Part of the SQL Server 2022 blog series.
Time series data is a set of values organized in the order in which they occur and arrive for processing. Unlike transactional data in SQL Server, which is not time-based and may be updated often, time series data is typically written once and rarely, if ever, updated.
Some examples of time series data include stock prices, telemetry from equipment sensors on a manufacturing floor, and performance metrics from SQL Server, such as CPU, memory, I/O, and network utilization.
Time series data is often used for historical comparisons, anomaly detection and alerting, predictive analysis, and reporting, where time is a meaningful axis for viewing or analyzing data.
Time series capabilities in SQL Server were introduced in Azure SQL Edge, Microsoft’s version of SQL Server for the Internet of Things (IoT) which combines capabilities such as data streaming and time series with built-in machine learning and graph features.
With SQL Server 2022 and Azure SQL, we’ve brought time series capabilities to the entire SQL Server family. Time series capabilities in SQL Server consist of enhancements to existing T-SQL functions for handling
NULL values, plus two new functions that make working with temporal-based data easier than ever.
Create contiguous ranges with GENERATE_SERIES
When analyzing time series data, it’s common to create a contiguous set of datetime values in evenly spaced intervals (for example, every second) within a specific range. One way to accomplish this is by creating a numbers table, also known as a tally table, which contains a set of consecutive numbers between a lower and upper bound. The numbers in the table can then be used in combination with the
DATEADD function to create the range of datetime values.
Prior to SQL Server 2022, creating a numbers table usually involved some form of common table expressions,
CROSS JOIN of system objects, looping, or some other creative T-SQL. These solutions are neither elegant nor efficient at scale, with additional complexity when the step between interval values is larger than 1.
GENERATE_SERIES relational operator in SQL Server 2022 makes creating a numbers table simple by returning a single-column table of numbers between a start and stop value, with an optional parameter defining the number of values to increment/decrement between steps in the series:
GENERATE_SERIES (start, stop [, step ])
This example creates a series of numbers between 1 and 100 in steps of 5:
SELECT valueFROM GENERATE_SERIES(1, 100, 5);
Taking this concept one step further, the next example shows how GENERATE_SERIES is used with DATEADD to create a set of values between 1:00 PM and 2:00 PM in 1-minute intervals:
SELECT DATEADD(minute, s.value, 'Dec 10, 2022 1:00 PM') AS [Interval]FROM GENERATE_SERIES(0, 59, 1) AS s;
If the step argument is omitted, a default value of 1 is used when computing interval values. GENEATE_SERIES also works with decimal values, with a requirement that the start, stop, and step arguments must all be the same data type. If start is greater than stop and the step is a negative value, then the resulting series will be a decrementing set of values. If start is greater than stop and the step is positive, an empty table will be returned.
GENERATE_SERIES requires a compatibility level of 160 or higher.
Group data in intervals with DATE_BUCKET
Time series data is often grouped into fixed intervals, or buckets, for analytical purposes. For example, sensor measurements taken every minute may be averaged over 15-minute or 1-hour intervals. While
DATEADD are used to create the buckets, we need a way to determine which bucket/interval a measurement belongs to.
DATE_BUCKET function returns the datetime value corresponding to the start of each datetime bucket for an arbitrary bucket size, with an optional parameter to define the origin from which to calculate each bucket. If no origin is provided, the default value of Jan 1, 1900, will be used as the origin date:
DATE_BUCKET (datepart, number, date, origin)
The following example shows the buckets for Dec 10, 2022, for several date parts with a bucket size of 1 and an origin date of Jan 1, 2022:
DECLARE @date DATETIME = 'Dec 10, 2022 12:05 PM';DECLARE @origin DATETIME = 'Jan 1, 2022 12:00 AM';SELECT 'Now' AS [BucketName], @date AS [DateBucketValue]UNION ALLSELECT 'Year', DATE_BUCKET (YEAR, 1, @date, @origin)UNION ALLSELECT 'Quarter', DATE_BUCKET (QUARTER, 1, @date, @origin)UNION ALLSELECT 'Month', DATE_BUCKET (MONTH, 1, @date, @origin)UNION ALLSELECT 'Week', DATE_BUCKET (WEEK, 1, @date, @origin)UNION ALLSELECT 'Day', DATE_BUCKET (DAY, 1, @date, @origin)UNION ALLSELECT 'Hour', DATE_BUCKET (HOUR, 1, @date, @origin)UNION ALLSELECT 'Minutes', DATE_BUCKET (MINUTE, 1, @date, @origin)UNION ALLSELECT 'Seconds', DATE_BUCKET (SECOND, 1, @date, @origin)
Notice how the date bucket value for the Week date part is Dec 10, 2022, which is a Saturday. That’s because the provided origin date (Jan 1, 2022) is also a Saturday. (Note the default origin date of Jan 1, 1900, is a Monday). Therefore, when working with the Week date part, if you want your Week bucket to begin on a Sunday then be sure to use a known origin that falls on a Sunday.
DATE_BUCKET becomes especially useful is for bucket sizes larger than 1, for example when grouping data in 5-minute or 15-minute buckets.
SELECT 'Now' AS [BucketName], GETDATE() AS [BucketDate]UNION ALLSELECT '5 Minute Buckets', DATE_BUCKET (MINUTE, 5, GETDATE())UNION ALLSELECT 'Quarter Hour', DATE_BUCKET (MINUTE, 15, GETDATE());
DATE_BUCKET provides an easy way to determine which time-based interval a timestamped measurement belongs to using any arbitrary-sized interval.
Gap analysis with FIRST_VALUE and LAST_VALUE
LAST_VALUE are not new functions to SQL Server 2022; what is new is how
NULL values are handled. In previous versions of SQL Server,
NULL values are preserved.
When working with time series data, it’s possible to have gaps between measurements. Ideally, gaps are filled in with an imputed value. When using
LAST_VALUE to compute the value corresponding to an interval, preserving
NULL values isn’t ideal.
In the following example, a series of sensor readings taken at 15-second intervals has some gaps:
If analyzing the data in 1-minute intervals (using
DATE_BUCKET), the default value returned by
FIRST_VALUE will include the null values:
SELECT [timestamp] , DATE_BUCKET(MINUTE, 1, [timestamp]) AS [timestamp_bucket] , SensorReading , FIRST_VALUE (SensorReading) OVER ( PARTITION BY DATE_BUCKET(MINUTE, 1, [timestamp]) ORDER BY [timestamp] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS [Default (RESPECT NULLS)]FROM MachineTelemetryORDER BY [timestamp];
LAST_VALUE include new syntax (
IGNORE NULLS or
RESPECT NULLS) in SQL Server 2022 which allows you to decide how
NULL values should be handled:
FIRST_VALUE ( [scalar_expression ] ) [ IGNORE NULLS | RESPECT NULLS ] OVER ( [ partition_by_clause ] order_by_clause [ rows_range_clause ] )
RESPECT NULLS is the default behavior and will include null values in the result when computing the first or last value within a partition. Specifying
IGNORE NULLS will cause
NULL values to be excluded when computing the first or last value over a partition.
SELECT [timestamp] , DATE_BUCKET(MINUTE, 1, [timestamp]) AS [timestamp_bucket] , SensorReading , FIRST_VALUE (SensorReading) IGNORE NULLS OVER ( PARTITION BY DATE_BUCKET(MINUTE, 1, [timestamp]) ORDER BY [timestamp] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS [Default (RESPECT NULLS)]FROM MachineTelemetryORDER BY [timestamp];
The new options for
IGNORE NULLS and
RESPECT NULLS allow you to decide how null values should be treated when analyzing your data.
Get started with SQL Server 2022 today. For more information and additional examples, visit the following resources:
The post Working with time series data in SQL Server 2022 and Azure SQL appeared first on Microsoft SQL Server Blog.