Home » Blog » Using CnosDB and TensorFlow for Time Series Prediction

Using CnosDB and TensorFlow for Time Series Prediction

CnosDB is a high-performance time-series database based on a distributed architecture. TensorFlow, on the other hand, is one of the most popular deep learning frameworks for prediction. In this article, you will learn how to use time-series data for prediction, specifically using CnosDB and TensorFlow. Due to the autocorrelation of time-series data, many data science algorithms cannot be used to handle such data. Therefore, specific methods need to be used when using time-series data for machine learning, which are slightly different from methods used in other fields.

 

Table of Contents

  1. From Three-Body Motion to Sunspot Prediction
  2. Data Import
  3. Data Query
  4. Splitting the Dataset into Train and Test Sets
  5. Defining the 1DConv+LSTM Neural Network Model
  6. Using the Trained Model to Predict MSSN
  7. Visualization of the Predicted Values Compared to Real Values

1 From Three-Body Motion to Sunspot Prediction

1.1 Introduction

Sunspots are solar activity phenomena that occur on the photosphere of the Sun, typically appearing in clusters. Predicting changes in sunspot activity is one of the most active fields in space meteorology research.

Sunspot observations have a long duration. The accumulation of long-term data is beneficial for exploring the regularities of sunspot changes. Long-term observations show that the number and area of sunspots exhibit obvious periodicity, and the period shows irregularity, with a range of approximately 9 to 13 years, and an average period of about 11 years. The peak values of sunspot number and area are not constant.

The latest data shows a clear downward trend in the number and area of sunspots in recent years.

Given the profound impact of sunspot activity on the Earth, detecting sunspot activity is particularly important. Physics-based models, such as dynamic models, and statistical models, such as autoregressive moving average models, have been widely used to detect sunspot activity. To more efficiently capture the nonlinear relationships present in sunspot time series data, machine learning methods have been introduced.

It is worth noting that neural networks in machine learning are particularly good at mining nonlinear relationships in data.

Therefore, this article will introduce how to use the CnosDB time series database to store sunspot data and further use TensorFlow to implement a 1DConv+LSTM network to predict changes in sunspot numbers.

1.2 Sunspot Variation Observation Dataset

The sunspot dataset used in this article is version 2.0 released by the SILSO website (WDC-SILSO, Royal Observatory of Belgium, Brussels, http://sidc.be/silso/datafiles).

We will analyze and explore the monthly mean sunspot number (MSSN) from 1749 to 2023.

2 Data Import

Download the MSSN data csv format file SN_m_tot_V2.0.csv (https://www.sidc.be/silso/infosnmtot) to your local machine.

The following is the CSV file description provided by the official website:

Filename: SN_m_tot_V2.0.csv
Format: Comma Separated values (adapted for import in spreadsheets)
The separator is the semicolon ';'.
Contents:
Column 1-2: Gregorian calendar date
- Year
- Month
Column 3: Date in fraction of year.
Column 4: Monthly mean total sunspot number.
Column 5: Monthly mean standard deviation of the input sunspot numbers.
Column 6: Number of observations used to compute the monthly mean total sunspot number.
Column 7: Definitive/provisional marker. '1' indicates that the value is definitive. '0' indicates that the value is still provisional.

We use pandas to load and preview the file:

import pandas as pd
df = pd.read_csv("SN_m_tot_V2.0.csv", sep=";", header=None)
df.columns = ["year", "month", "date_fraction", "mssn", "standard_deviation", "observations", "marker"]# convert year and month to strings
df["year"] = df["year"].astype(str)
df["month"] = df["month"].astype(str)# concatenate year and month
df["date"] = df["year"] + "-" + df["month"]df.head()
import matplotlib.pyplot as plt 
df["Date"] = pd.to_datetime(df["date"], format="%Y-%m")
plt.plot(df["Date"], df["mssn"])
plt.xlabel("Date")
plt.ylabel("MSSN")
plt.title("Sunspot Activity Over Time")
plt.show()

2.1 Using Time-series Database CnosDB for MSSN Data Storage

CnosDB(An Open Source Distributed Time Series Database with high performance, high compression ratio and high usability.)

Official Website: http://www.cnosdb.com

Github Repo: https://github.com/cnosdb/cnosdb

Note: This article assumes that you have the ability to install and use CnosDB. For more information, please see the documentation at https://docs.cnosdb.com/.

(base) root@ecs-django-dev:~# docker run --restart=always --name cnosdb -d --env cpu=2 --env memory=4 -p 31007:31007 cnosdb/cnosdb:v2.0.2.1-beta
(base) root@ecs-django-dev:~# docker exec -it cnosdb sh sh
# cnosdb-cli
CnosDB CLI v2.0.0
Input arguments: Args { host: "0.0.0.0", port: 31007, user: "cnosdb", password: None, database: "public", target_partitions: None, data_path: None, file: [], rc: None, format: Table, quiet: false }

To simplify the analysis, we only need to store the observation time and sunspot number in the dataset. Therefore, we concatenate the year (Col 0) and month (Col 1) as the observation time (date, string type), and store the monthly mean sunspot number (Col 3) directly without processing.

We use SQL in CnosDB CLI to create a table named “sunspot” for storing the MSSN dataset.

public ❯ CREATE TABLE sunspot (
date STRING,
    mssn DOUBLE,
);
Query took 0.002 seconds.
public ❯ SHOW TABLES;
+---------+
| Table   |
+---------+
| sunspot |
+---------+
Query took 0.001 seconds.public ❯ SELECT * FROM sunspot;
+------+------+------+
| time | date | mssn |
+------+------+------+
+------+------+------+
Query took 0.002 seconds.

2.2 Using CnosDB Python Connector to Connect, Write and Query in CnosDB Database

Github Repo: https://github.com/cnosdb/cnosdb-client-python

# install Python Connectorpip install -U cnos-connector
from cnosdb_connector import connect
conn = connect(url="http://127.0.0.1:31001/", user="root", password="")
cursor = conn.cursor()

If unfamiliar with using CnosDB CLI, we can directly use Python Connector to create tables

# Create tf_demo database
conn.create_database("tf_demo")
# Use tf_demo database
conn.switch_database("tf_demo")
print(conn.list_database())
cursor.execute("CREATE TABLE sunspot (date STRING, mssn DOUBLE,);")
print(conn.list_table())

The output is shown below,which includes CnosDB Database by default

[{'Database': 'tf_demo'}, {'Database': 'usage_schema'}, {'Database': 'public'}]
[{'Table': 'sunspot'}]

Use the former dataframe generated to write into CnosDB

### "Sunspot" is the table name in CnosDB, and ['date ','mssn'] is the name of the column to be written
### If the written column does not contain a time column, it will be automatically generated based on the current time
conn.write_dataframe(df, "sunspot", ['date', 'mssn'])

3 Reading Data

Reference:CHENG Shu, SHI Yaolin, ZHANG Huai. Predicting sunspot variations through neural network (2022). http://journal.ucas.ac.cn/CN/10.7523/j.ucas.2021.0068

df = pd.read_sql("select * from sunspot;", conn)
print(df.head())

4 Splitting the Dataset into Train and Test Sets

import numpy as np
# Convert the data values to numpy for better and faster processing
time_index = np.array(df['date'])
data = np.array(df['mssn'])   
# ratio to split the data
SPLIT_RATIO = 0.8# Dividing into train-test split
split_index = int(SPLIT_RATIO * data.shape[0])   # Train-Test Split
train_data = data[:split_index]
train_time = time_index[:split_index]  
test_data = data[split_index:]
test_time = time_index[split_index:]

Constructing training data using sliding window method

import tensorflow as tf
## required parameters
WINDOW_SIZE = 60
BATCH_SIZE = 32
SHUFFLE_BUFFER = 1000## function to create the input features
def ts_data_generator(data, window_size, batch_size, shuffle_buffer):
'''
    Utility function for time series data generation in batches
    '''
    ts_data = tf.data.Dataset.from_tensor_slices(data)
    ts_data = ts_data.window(window_size + 1, shift=1, drop_remainder=True)
    ts_data = ts_data.flat_map(lambda window: window.batch(window_size + 1))
    ts_data = ts_data.shuffle(shuffle_buffer).map(lambda window: (window[:-1], window[-1]))
    ts_data = ts_data.batch(batch_size).prefetch(1)
return ts_data# Expanding data into tensors
# Expanding data into tensors
tensor_train_data = tf.expand_dims(train_data, axis=-1)
tensor_test_data = tf.expand_dims(test_data, axis=-1)## generate input and output features for training and testing set
tensor_train_dataset = ts_data_generator(tensor_train_data, WINDOW_SIZE, BATCH_SIZE, SHUFFLE_BUFFER)
tensor_test_dataset = ts_data_generator(tensor_test_data, WINDOW_SIZE, BATCH_SIZE, SHUFFLE_BUFFER)

5 Defining the 1DConv+LSTM Neural Network Model

model = tf.keras.models.Sequential([
                            tf.keras.layers.Conv1D(filters=128, kernel_size=3, strides=1, input_shape=[None, 1]),
                            tf.keras.layers.MaxPool1D(pool_size=2, strides=1),
                          tf.keras.layers.LSTM(128, return_sequences=True),
                          tf.keras.layers.LSTM(64, return_sequences=True),  
                          tf.keras.layers.Dense(132, activation="relu"),  
                          tf.keras.layers.Dense(1)])
## compile neural network model
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss="mse",
            optimizer=optimizer,
            metrics=["mae"])
## training neural network model
history = model.fit(tensor_train_dataset, epochs=20, validation_data=tensor_test_dataset)
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

6 Using the Trained Model to Predict MSSN

def model_forecast(model, data, window_size):
ds = tf.data.Dataset.from_tensor_slices(data)
ds = ds.window(window_size, shift=1, drop_remainder=True)
ds = ds.flat_map(lambda w: w.batch(window_size))
ds = ds.batch(32).prefetch(1)
forecast = model.predict(ds)
return forecast
rnn_forecast = model_forecast(model, data[..., np.newaxis], WINDOW_SIZE)
rnn_forecast = rnn_forecast[split_index - WINDOW_SIZE:-1, -1, 0]
# Overall Error
error = tf.keras.metrics.mean_absolute_error(test_data, rnn_forecast).numpy()
print(error)
101/101 [==============================] - 2s 18ms/step
24.676455

7 Visualization of the Predicted Values Compared to Real Values

plt.plot(test_data)
plt.plot(rnn_forecast)
plt.title('MSSN Forecast')
plt.ylabel('MSSN')
plt.xlabel('Month')
plt.legend(['Ground Truth', 'Predictions'], loc='upper right')
plt.show()

Relevant CnosDB Documents:

1. CnosDB Quick Start: https://docs.cnosdb.com

2. CnosDB Official Website: https://www.cnosdb.com

3. CnosDB GitHub Warehouse: https://github.com/cnosdb/cnosdb

Reference:

CHENG Shu, SHI Yaolin, ZHANG Huai. Predicting sunspot variations through neural network (2022). http://journal.ucas.ac.cn/CN/10.7523/j.ucas.2021.0068