Data quality issues can have a huge impact on data applications, seriously affecting various fields in the information society.

According to statistics, over 25% of critical data in the world's top 1000 companies is incorrect. In 2009, Dell's Taiwan website sold 19-inch monitors priced at 4800 New Taiwan Dollars for 500 New Taiwan Dollars, resulting in 1.4 million units being ordered within 8 hours, leading to a direct economic loss of 6 billion New Taiwan Dollars. A British telephone company loses 600 million pounds annually due to data errors.

Errors can occur during the acquisition, storage, transmission, and computation of time series data, such as incomplete or inconsistent data. The diverse sources and forms of data make it more likely to be inconsistent and conflicting, and data updates can quickly lead to outdated and inconsistent data, especially in IoT scenarios. Any sensor failure or network transmission error during the time series data management process can lead to data quality issues. Analyzing dirty data without evaluating data quality beforehand can lead to misleading results.

CnosDB is a database focused on time series data processing, aimed at solving problems related to time series data storage and analysis, and providing users with efficient time series data management and query convenience. In CnosDB, we have designed and implemented several indicators for calculating data quality, which can evaluate the quality of time series data from multiple dimensions. For timestamp columns, we consider missing points, redundant points, and delayed points. For value columns, we consider abnormal values, ranges, changes, speeds, and accelerations. In CnosDB, these indicators can be calculated using aggregation functions, including the following four functions:

## 1. Completeness

The `completeness`

function is used to calculate the completeness of a time series, measuring the proportion of data that is not missing. For example, the daily website traffic (i.e., the COUNT aggregation value) in a web log is a recorded value. If the daily traffic is usually around 1000 and suddenly drops to 100 one day, you need to check if there is missing data.

### Algorithm Overview

The `completeness`

function first counts the number of rows in the data (cnt), then considers the cases where NaN and Inf may appear in the data column. It applies linear smoothing to these values, while also counting the occurrences of these two special values (specialcnt). By scanning the data, it counts the number of missing values (misscnt). The calculation formula for completeness is: 1 – (misscnt + specialcnt) / (cnt + misscnt). The result is a value between 0 and 1, where a value closer to 1 indicates better quality and closer to 0 indicates poorer quality.

### Function

completeness(time, value)

### Parameters

time：Timestamp

value：BIGINT / BIGINT UNSIGNED / DOUBLE

### Return Type

DOUBLE

### Example

Create table wzz：

CREATE table wzz(value double);

INSERT wzz VALUES (1, 12.34), (3, 34.54 ), (4, 1.43), (6, 14.03), (10, 12.30), (13, 11.54), (14, 112.20), (16, 14.44), (18, 134.02), (19, 116.34), (22, 1234.45), (24,10.36), (26, 124.21), (31, 6.34), (33, acos(12345));

The last acos(12345) represents NaN.

SELECT * FROM wzz;

Table after inserting the data：

+——————————-+———+ | time | val | +——————————-+———+ | 1970-01-01T00:00:00.000000001 | 12.34 | | 1970-01-01T00:00:00.000000003 | 34.54 | | 1970-01-01T00:00:00.000000004 | 1.43 | | 1970-01-01T00:00:00.000000006 | 14.03 | | 1970-01-01T00:00:00.000000010 | 12.3 | | 1970-01-01T00:00:00.000000013 | 11.54 | | 1970-01-01T00:00:00.000000014 | 112.2 | | 1970-01-01T00:00:00.000000016 | 14.44 | | 1970-01-01T00:00:00.000000018 | 134.02 | | 1970-01-01T00:00:00.000000019 | 116.34 | | 1970-01-01T00:00:00.000000022 | 1234.45 | | 1970-01-01T00:00:00.000000024 | 10.36 | | 1970-01-01T00:00:00.000000026 | 124.21 | | 1970-01-01T00:00:00.000000031 | 6.34 | | 1970-01-01T00:00:00.000000033 | NaN | +——————————-+———+

Check the completeness of the TSDB：

SELECT completeness(time, value) FROM wzz;

+———————————-+ | completeness(wzz.time,wzz.value) | +———————————-+ | 0.8235294117647058 | +———————————-+

## 2. Consistency

The `consistency` function calculates the consistency of a time series, measuring the proportion of uniformly distributed and non-densely redundant time series data. For example, redundant temperature sensors at a weather station may cause data duplication at the station.

### Algorithm Overview

Similar to the `completeness` function, after filling missing values, the `consistency` function calculates the redundancy count `redundancycnt` by scanning the data. The calculation formula for consistency is: 1 – (redundancycnt / cnt). The result is a value between 0 and 1, where a value closer to 1 indicates better quality and closer to 0 indicates poorer quality.

### Paramaters

time: Timestamp

value: BIGINT / BIGINT UNSIGNED / DOUBLE

### Return Type

DOUBLE

### Example

The table created is the same as the one used in the example for the `completeness` function, named `wzz`. To query the consistency of the time series data:

SELECT consistency(time, value) FROM wzz;

+———————————+ | consistency(wzz.time,wzz.value) | +———————————+ | 0.8666666666666667 | +———————————+

## 3. Timeliness

The `timeliness` function is used to calculate the timeliness of a time series, measuring the proportion of data that arrives on time without delay. For example, there are many scenarios where the timeliness of monitoring data is important, including alerting on aggregated values, viewing historical trend charts, real-time data reports (comparing data on a daily/weekly/quarterly/annual basis), trend anomaly detection, and offline analysis of historical data. It can be seen that the query volume, data distribution, and the need for data timeliness vary for each scenario. A time series database needs to be able to efficiently retrieve data in all these scenarios.

### Algorithm Overview

Similar to the `completeness` function, after filling missing values, the `timeliness` function calculates the delay count `latecnt` by scanning the data. The calculation formula for timeliness is: 1 – (latecnt / cnt). The result is a value between 0 and 1, where a value closer to 1 indicates better quality and closer to 0 indicates poorer quality.

### Parameters

time: Timestamp

value: BIGINT / BIGINT UNSIGNED / DOUBLE

### Return Type

DOUBLE

### Example

The table created is the same as the one used in the example for the `completeness` function, named `wzz`. To query the consistency of the time series data:

SELECT timeliness(time, value) FROM wzz;

+——————————–+ | timeliness(wzz.time,wzz.value) | +——————————–+ | 0.9333333333333333 | +——————————–+

## 4. Validity

The `validity` function is used to calculate the validity of a time series, measuring the proportion of data that meets constraint conditions. For example, if the monthly average precipitation in a region is a negative value, this is definitely a problem.

### Algorithm Overview

First, count the number of rows in the data (cnt), then fill in missing values, remove NaN and Inf values, and then use a custom calculation method to obtain counts for valuecnt, variationcnt, speedcnt, and speedchangecnt. The calculation formula for validity is: 1 – (0.25 * (valuecnt + variationcnt + speedcnt + speedchangecnt)) / cnt. The result is a value between 0 and 1, where a value closer to 1 indicates better quality and closer to 0 indicates poorer quality.

### Parameters

time: Timestamp

value: BIGINT / BIGINT UNSIGNED / DOUBLE

### Return Type

DOUBLE

### Example

The table created is the same as the one used in the example for the `completeness` function, named `wzz`. To query the validity of the time series data:

SELECT validity(time, value) FROM wzz;

+——————————+ | validity(wzz.time,wzz.value) | +——————————+ | 0.8 | +——————————+