This article provides a brief introduction to the origin of time series compression tasks, classifies compression algorithms, and gives a brief overview of the advantages and disadvantages of common compression algorithms. Coding enthusiasts, come and explore the details!
Time series data is a fundamental data type generated in various applications and fields such as finance, healthcare, transportation, and smart cities . Time series analysis is crucial for various tasks, including anomaly detection, prediction, classification, and clustering. However, the vast quantity and complexity of time series data can pose significant challenges for efficient storage, retrieval, and processing .
In contrast to traditional relational databases, time series databases are inherently designed for the processing of time series data. Relational databases often employ traditional B+ tree storage structures and row-based storage. This design is suitable for scenarios with frequent reads and infrequent writes, aiming to improve data query performance and reduce disk or network I/O.
However, due to the focus on data retrieval in this design, each insertion of a new record requires synchronous updates to the corresponding storage structures. The database often needs to locate the node in the B+ tree where the data should be inserted, identify the page in the storage, check if the corresponding key already exists, and more. All these operations significantly increase the time overhead of insertion. Therefore, taking InfluxDB, widely used in the open-source community for time series databases, as an example, it adopts a storage structure based on LSM trees and has made specific optimizations to enhance the database's write capabilities.
As time progresses, the value of data generated earlier tends to decrease. In monitoring scenarios, where users are likely to focus only on the current moment or recent database state parameters, earlier data accumulates over time and has a low probability of being queried. Therefore, it is necessary to compress early data to a certain extent. Considering the lower query value of this type of data, achieving a high compression ratio while preserving some features of the original data becomes an important issue. This enables better planning of database resources, alleviates the pressure on database management, reduces costs, and enhances efficiency, providing better support for data storage and queries.
To address these challenges, there is a growing need for specialized algorithms and data structures to compress and index time series data. Data compression algorithms primarily balance data compression ratio, data decompression speed, and data compression accuracy, guiding the development of various techniques for time series data compression.
Classification of Data Compression Algorithms
For time series data, several commonly used compression algorithms exist. Based on whether there is a loss of precision after compression and decompression, compression algorithms can be divided into two categories: lossless compression algorithms and lossy compression algorithms.
Lossless compression algorithms maintain the integrity of data before and after compression, meaning that compressed data can be completely restored to the original data. Common lossless compression algorithms include:
- Differential coding algorithms [7, 8]: These algorithms compress time series data by calculating the differences between adjacent time points, reducing the storage of redundant information.
- Huffman coding algorithms [9, 10]: Using variable-length encoding, high-frequency data is represented with shorter codes, while low-frequency data is represented with longer codes.
Lossy compression algorithms achieve higher compression ratios by removing some redundant information. However, the compressed data cannot be fully restored to the original data, and there may be some degree of error. Common lossy compression algorithms include:
- Wavelet transform-based algorithms [11, 12]: Wavelet transform decomposes time series data into multiple frequency subbands, extracting the main features of time series data, and then compressing these features through coding.
- Discrete cosine transform-based algorithms: Discrete cosine transform transforms time series data into frequency-domain signals, achieving compression by retaining high-frequency components and suppressing low-frequency components.
Introduction to Widely Used Compression Algorithms
Gorilla  is a time series data compression algorithm based on binary tuple encoding. Its advantages include high compression ratio, fast compression speed, and lower storage requirements. However, it may be affected by memory limitations due to the need for a large buffer.
LZ4  is a general-purpose data compression algorithm that can also be used for time series data compression. Its advantages include fast compression speed, low compression delay, and good compression ratio. However, in some specific time series data distributions, it may exhibit poor compression performance.
Zstandard  is a compression algorithm based on the LZ77 algorithm, suitable for general compression and decompression tasks, and can also be used for time series data compression. Its advantages include high compression ratio, low decompression delay, adjustable compression speed, but compression time may be longer.
Snappy  is a fast compression algorithm suitable for various types of data, including time series data. Its advantages include high compression speed, low compression delay, and reliable compression ratio, but the compression ratio is relatively lower.
LFZip  is a time series data compression algorithm based on segmentation and fitting algorithms, with good compression ratio and speed. Its drawback is the inability to handle outliers and noise, requiring adjustment of some parameters to adapt to different data distributions.
Chimp  is a sampling-based time series data compression algorithm with high compression ratio and speed, dynamically adapting to different data distributions. Its drawback is the need for a sufficient sampling frequency to ensure compression accuracy, and it has limited ability to handle outliers.
Sprintz  is a simple and effective lossless data compression algorithm. Its main advantages include high compression ratio, fast speed, and low complexity. However, Sprintz algorithm has a high dependence on data correlation, requiring additional bits to store differences, resulting in relatively lower storage.
Application of Compression Algorithms in CnosDB
In CnosDB, specific compression methods can be specified for particular attributes when creating a database. The compression methods supported by CnosDB include:
The syntax structure is as follows:
CREATE TABLE TIMESERIES(
column1 BIGINT CODEC(DELTA),
name STRING CODEC(GZIP),
age BIGINT UNSIGNED CODEC(NULL),
ID DOUBLE CODEC(GORILLA)
When creating a database, after specifying the type of attribute, append a CODEC(), and within the parentheses is the compression method to achieve the desired compression effect.
The compression of time series data mainly revolves around the continuity and strong correlation of time series data. Due to the typically large data scale, lossy compression algorithms that trade some precision for extremely high compression ratios are more widely applied.
This article briefly introduces the origin of the time series compression task, classifies compression algorithms, and provides an overview of the advantages and disadvantages of common compression algorithms. The application in CnosDB is only briefly mentioned, and those interested can refer to the content in the cited references.
［3］ NAQVI S N Z，YFANTIDOU S，ZIMÁNYI E. Time series databases and influxdb［J］. Studienarbeit, Université Libre de Bruxelles，2017，12
［4］ ANH V N，MOFFAT A. Index compression using 64-bit words［J］. Software: Practice and Experience，2010，40（2）：131-147.
［5］ GOLOMB S. Run-length encodings (corresp.) ［J］. IEEE transactions on information theory，1966，12（3）：399-401.
［6］ RICE R F. Some practical universal noiseless coding techniques［M］ // . 1979.
［7］ CANDY J. A use of double integration in sigma delta modulation ［J］. IEEE transactions on communications，1985，33（3）：249-258.
［8］ O’NEAL J. Differential pulse-code modulation (PCM) with entropy coding［J］. IEEE Transactions on Information Theory，1976，22（2）：169-174.
［9］ OSWAL S，SINGH A，KUMARI K. Deflate compression algorithm ［J］. International Journal of Engineering Research and General Science，2016，4（1） ：430-436.
［10］ GAILLY J-L，ADLER M. GNU gzip ［J］. GNU Operating System，1992.
［11］ ZHANG Q，BENVENISTE A. Wavelet networks［J］. IEEE transactions on Neural Networks，1992，3（6）：889-898.
［12］ KINGSBURY N. Complex wavelets for shift invariant analysis and filtering of signals［J］. Applied and computational harmonic analysis，2001，10（3）： 234-253.
［13］ PELKONEN T，FRANKLIN S，TELLER J，et al. Gorilla: A fast, scalable, inmemory time series database ［J］. Proceedings of the VLDB Endowment，2015， 8（12）：1816-1827.
［14］ COLLET Y，OTHERS. Lz4: Extremely fast compression algorithm［J］. code. google. com，2013.
［15］COLLET Y. Zstandard-fast real-time compression algorithm［J］. Github repository [online]，2018.
［16］ GUNDERSON S H. Snappy: A fast compressor/decompressor［J］. code. google. com/p/snappy，2015.
［17］ CHANDAK S，TATWAWADI K，WEN C，et al. LFZip: Lossy compression of multivariate floating-point time series data via improved prediction ［C］ // 2020 Data Compression Conference (DCC)，［S.l.］，2020：342-351.
［18］ LIAKOS P，PAPAKONSTANTINOPOULOU K，KOTIDIS Y. Chimp: efficient lossless floating point compression for time series databases ［J］. Proceedings of the VLDB Endowment，2022，15（11）：3058-3070.
［19］ BLALOCK D，MADDEN S，GUTTAG J. Sprintz: Time series compression for the internet of things ［J］. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies，2018，2（3）：1-23