Tag Archives: T-SQL

Precision in Date/Time Data Types

There is a general misconception, that is common across many different programming/scripting languages, regarding Date/Time data types, their precision, and how this differs in theory versus practice.  For the sake of this article, I will be discussing the following data types: DATETIME in T-SQL, DATETIME2(7) in T-SQL, and DateTime in C#.NET.

Data Type Precision (Theoretical Precision)

Data Type Theoretical Precision
SQL DATETIME 3.33 ms
SQL DATETIME2(7) 100 ns
C# DateTime 100 ns

When viewing the information presented above, the natural assumption is that each of these data types will give you a unique value at intervals of their precision.  For example, if you were using a SQL DATETIME and queried the time every 1 ms, you would expect to see a unique value roughly once out of every three queries.  The same would be expected of the higher precision data types, should they be queried on relatively similar timelines.

This, unfortunately, is not how Date/Time data types work.  This theoretical precision is representative only of what the data type structure is physically capable of holding, not of what the variable will ever practically contain.

Operating System Precision (Practical Precision)

The important factor that most programmers forget is that every program/query/script is running in the context of some Operating System and these Operating Systems are further running in the context of a processor and RAM.  The processor, while very fast, is consumed with many different actions happening to handle multitasking loads, and thus cannot dedicate a large percentage of its cycles to updating a clock.  The way processors get around this issue is by updating a portion of memory with the current date and time, at some predetermined interval.  The Operating System exposes this data through any one of several counters.  When a program requests the current time from the OS, the OS gives the program the latest reading of the “current” time.  In this way, the counter can be said to be highly, but not perfectly accurate, and highly precise to within some threshold (which we will see next).

When a program requests the current date and time from the operating system, it does so through a built-in function/method.  SQL DATETIME can get its data from GETDATE(), SQL DATETIME2(7) should get its data from SYSTEMDATETIME(), and C# DateTime can get its data from DateTime.Now.  GETDATE() hits an OS counter that has a precision of approximately 15 ms, while SYSTEMDATETIME() and DateTime.Now hit an OS counter that has a precision of approximately 1 ms.

Q.E.D., when you call GETDATE(), you will receive a data value that, while capable of holding a time down to a precision of 3.33 ms, will actually only be precise to about 15 ms (in Windows 2000).  In our previous example of querying every 1 ms, you would only see a unique value roughly once out of every 15 queries, as opposed to the 3 we previously would have assumed.

Demonstration

In order to see the effect we have discussed, the following code will capture the number of times we saw unique values and the average time between unique values.

-- SQL GETDATE
DECLARE @TimeStart DATETIME
DECLARE @Time DATETIME
DECLARE @TimeEnd DATETIME
DECLARE @I INT
DECLARE @Count INT
 
SET @I = 0
SET @Count = 0
SET @TimeStart = GETDATE()
SET @Time = @TimeStart
 
WHILE @I < 10000000
      BEGIN
            SET @TimeEnd = GETDATE()
            IF @TimeEnd != @Time
                  BEGIN
                        SET @Count = @Count + 1
                        SET @Time = @TimeEnd
                  END
            SET @I = @I + 1
      END
PRINT CAST(@Count AS VARCHAR) + ' unique values'
PRINT CAST(DATEDIFF(millisecond, @TimeStart, @TimeEnd) / CAST(@Count AS REAL) AS VARCHAR) + ' milliseconds'
 -- SQL SYSTEMDATETIME
DECLARE @TimeStart DATETIME2
DECLARE @Time DATETIME2
DECLARE @TimeEnd DATETIME2
DECLARE @I INT
DECLARE @Count INT
 
SET @I = 0
SET @Count = 0
SET @TimeStart = SYSDATETIME()
SET @Time = @TimeStart
 
WHILE @I < 10000000
      BEGIN
            SET @TimeEnd = SYSDATETIME()
            IF @TimeEnd != @Time
                  BEGIN
                        SET @Count = @Count + 1
                        SET @Time = @TimeEnd
                  END
            SET @I = @I + 1
      END
 
PRINT CAST(@Count AS VARCHAR) + ' unique values'
PRINT CAST(DATEDIFF(microsecond, @TimeStart, @TimeEnd) / CAST(@Count AS REAL) AS VARCHAR) + ' microseconds'
// C# DateTime
DateTime timeStart;
DateTime time;
DateTime timeEnd = DateTime.Now;
 
int i = 0;
int count = 0;
timeStart = DateTime.Now;
time = timeStart;
 
while(i < 10000000)
{
       timeEnd = DateTime.Now;
       if(timeEnd != time)
       {
              count++;
              time = timeEnd;
       }
       i++;
}
 
Console.WriteLine("{0} unique values",count);
Console.WriteLine("{0} microseconds", ((timeEnd.Ticks - timeStart.Ticks) / count) / (TimeSpan.TicksPerMillisecond / 1000));

From executing these I found the following results:

Data Type Theoretical Precision Practical Precision
SQL DATETIME (on Win2000) 3.33 ms 15.5556 ms
SQL DATETIME (on Win2K3) 3.33 ms 3.33333 ms
SQL DATETIME2(7) 100 ns 1003.71 µs
C# DateTime 100 ns 1000.21 µs

From these results, we can discern a couple things.  First, the counter that SQL Server accesses in Windows 2000 has a practical precision of approximately 15 ms, while the counter that SQL Server accesses in Windows 2003 (and beyond) has a practical precision of approximately 3.33 ms, which is basically the same as the theoretical precision.  Secondly, In both the case of SQL DATETIME2(7) and C# Datetime, the practical precision is 1 ms, which falls woefully short of the 100 ns theoretical precision.

Conclusion

Developers should be aware of what their data types and functions are doing under the surface.  In the case of SQL DATETIME2 and C# DateTime, developers must keep in mind that the true precision of the system date and time retrieved through those data types only has a precision of 1 ms.

Data Sanitization with Regular Expressions, Sort of

A DBA on Twitter, using the #sqlhelp hashtag, posted about a particular problem they were having in which they wanted to sanitize some data which was presumably entered in some sort of free-form text field. Their basic idea was that they would get data like any of the following:

my id is 12345
id=12345
user name (12345)

The DBA would then want to be able to sanitize this data and come up with a string containing the simple 5-digit number (in the case presented, the id should always be 5 digits).

This is a common enough problem, one that most people working with human-input data would be familiar with. There are several ways to approach it, but some work better than others, and when you are talking about a high-load system, you want the quickest and most efficient method. In my opinion, this is a perfect opportunity to use Regular Expressions. However, T-SQL does not have native support for full-fledged Regular Expressions, which is a shame, but it does have support for a small subset of Regular Expressions, and that is all we need for this particular case.

So, first, what are we trying to accomplish? We want to take any string and look for digits, between 0 and 9. T-SQL has a function named PATINDEX that takes a subset of of regular expression patterns and an expression that evaluates to some string data type and returns an integer of the index in the string where the pattern first occurs, or zero if the pattern is not matched anywhere in the string. Combining this function with SUBSTRING should give us all of the tools we need to solve this problem.

The code to return the five-digit id as a string is as follows:

DECLARE @yourStringToTest VARCHAR(20)            -- as an example for testing
DECLARE @pattern VARCHAR(20)
SET     @yourStringToTest = 'my id is 12345'
SET     @pattern = '%[0-9]%'
SELECT
    SUBSTRING(
                @yourStringToTest,               -- Source String (for the substring)
                PATINDEX(
                            @pattern,            -- The pattern we are matching
                            @yourStringToTest    -- Source String (for the pattern matching)
                            ),
                5                                -- The number of characters we are wanting to return
                )

As you can see, this is a very simple solution, but what about if the data has some really strange artifacts? What if they have data like:

my 1st ID is 12345
requesting 3 actions, my id is 12345

This poses an additional problem, how do we get to the actual id, which is the five-digit number, but skip over the integers that occur earlier in the string. If we try the code we have above, we are going to get “1st I” and “3 act”, respectively. Clearly we need to add some additional logic to our pattern. We need to ensure that the pattern looks for 5 consecutive digits, as opposed to just the first digit we encounter. Additionally, what if we want to get any id that is at least 3 digits, but not greater than 8, for example? These are all issues that would be solved very easily with Regular Expressions, unfortunately T-SQL does not support the full set of Regular Expression syntax.

There are, however, some interesting alternatives that have been created by members of the SQL community. They involve using COM objects, so I would be wary using them in any production system, but as a one-off script you need to run, it is definitely worth a shot. You can read about this method and see the code necessary to accomplish it by checking out the post made by yoinky on this SQLTeam.com forum thread.

Additionally, you could solve the problem by creating a simple SSIS package to iterate through your source data and run a Script Component to utilize the Regular Expression pattern. I’ll work on an example of that for a future post.

I would like to see Microsoft include full support of Regular Expressions in a future release of SQL Server, but until then, at least we have PATINDEX and LIKE.