Generative AI for time-series
Concept
This time dependency introduces new levels of complexity to the process of synthetic data generation: keeping the trends and correlations across time is just as important as keeping the correlations between features or attributes (as we’re used to with tabular data). And the longer the temporal history, the harder it is to preserve those relationships.
Over the years, data science teams have been trying to define suitable ways to generate high-quality synthetic time series data, among which Generative Adversarial Networks (GANs) are currently extremely popular.
In this piece, we will delve into the peculiarities of a well-humored architecture, DoppelGANger — yes, pun intended — recently added to the ydata-synthetic package and explore how we can use it to generate synthetic data from complex time series datasets.
How do you generate your time-series data twin? Even for a seasoned data scientist, the topic of Synthetic Data generation can be a little tricky because there are many nuances to consider. Indeed, determining how to best adjust data preparation and choose appropriate machine learning models for data synthesization is one of the main topics we’ve been discussing at the Data-Centric AI Community.
For educational purposes, ydata-synthetic has been our number one choice and we’ve tested it plenty with tabular data. This time, we’re experimenting with time-series data, using the most recent model for time-series synthetic data generation — DoppelGANger.
As you can tell by the (awesome) name, DoppelGANger makes a pun out of “Doppelganger” — a German word that refers to a look-alike or a double of a person — and “GAN”, the artificial intelligence model.
Idea
Being GAN-based, the same general principles of GANs apply to DoppleGANger: the generator and the discriminator are continuously optimized by comparing the synthetic data (created by the generator) with the real data (which the discriminator or critic tries to distinguish from the synthetic).
Yet, as mentioned earlier, GANs have traditionally struggled with the peculiarities of time-series data:
GANs struggle to capture long-term temporal relationships. One common approach is to consider a windowing approach, i.e., define shorter segments of data and evaluate the models only on these segments. This is the approach taken by TimeGAN. Yet, for use cases where we need to replicate an entire temporal series, which require that long-term correlations are kept, this is unfortunately unfeasible;
GANs struggle with mode collapse. If the data has multimodal distributions, the ones least represented will eventually collapse (disappear). Time-series data are particularly prone to this since the range of measurements is often highly variable;
GANs struggle to map complex relationships between measurements and attributes. As well as across different measurements. If you recall the structure of time-series data, this would be the case where it would be difficult to map the relationship between the account balance and type of transaction (attribute), and between account balance and other measures such as fee amounts.
To surpass these limitations, DoppelGANger introduces some modifications to address the generation of synthetic time-series data better:
To capture temporal correlations, DoppelGANger uses batch generation. Instead of generating one record at a time, it generates a batch of records. This is better for longer time series and helps to preserve the temporal correlations that would be forgotten when generating singleton records;
To tackle mode collapse, DoppleGANger uses auto-normalization. Rather than normalizing the input data by the global min and max values of the measurements, each time series signal is individually normalized. The min and max of each respective series are used as metadata, which the GAN learns to generate as well, alleviating mode collapse;
To map relationships between measurements and attributes, DoppleGANger models the joint distribution between measurements and attributes and introduces an auxiliary discriminator. The generation of attributes is decoupled from the generation of measurements conditioned on attributes, where each one uses a dedicated generator. Additionally, an auxiliary discriminator is introduced, discriminating only on the attributes.
Key benefits and when to use it
Now that we have a comprehensive overview of its architecture and how it surpasses some well-known challenges of time-series data, it’s easier to understand why DoppelGANger is such a popular model:
Accurate Temporal Patterns Generation: DoppelGANger is designed to cope with both short-term and long-term patterns in data, and therefore is able to mimic certain characteristics such as seasonal changes. For instance, capturing both weekly and annual correlations in a financial time series use case;
Realistic Temporal Correlations: DopplelGANger is suitable for use cases where keeping the correlations between measurements and attributes is paramount. For instance, replicating trends in e-commerce applications where consumption patterns are associated with other attributes (day of the week, personal factors such as age and gender, or marketing and advertising information);
Flexible Generation: DoppleGANger allows the generation of synthetic data according to different metadata/attribute distributions as defined by the users. In this way, it can accommodate the augmentation of rare events, such as those commonly found in healthcare devices or fraud detection transactions.
Hands-on example: Measurign the broadband America
To explore the application of DoppelGANger, we will use the Measuring Broadband America (MBA) Dataset, freely available on the Federal Communications Commission (FCC) website (Licensing Information). This is one of the datasets used to showcase the DoppelGANger model in the original paper and reports on several measurements such as round-trip times and packet loss rates from several homes in the US, as we’ll detail in a bit. You can follow along the tutorial using this notebook that contains the full flow.
1
pip install ydata-synthetic==1.3.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandas as pd
import matplotlib.pyplot as plt
from ydata_synthetic.synthesizers.timeseries import TimeSeriesSynthesizer
from ydata_synthetic.synthesizers import ModelParameters, TrainParameters
# Read the data
mba_data = pd.read_csv("fcc_mba.csv")
numerical_cols = ["traffic_byte_counter", "ping_loss_rate"]
categorical_cols = [col for col in mba_data.columns if col not in numerical_cols]
# Defining model and training parameters
model_args = ModelParameters(batch_size=100,
lr=0.001,
betas=(0.2, 0.9),
latent_dim=20,
gp_lambda=2,
pac=1)
train_args = TrainParameters(epochs=400,
sequence_length=56,
sample_length=8,
rounds=1,
measurement_cols=["traffic_byte_counter", "ping_loss_rate"])
These parameters correspond to various settings used in the implementation of DoppelGANger:
1
2
3
4
5
6
7
8
9
10
11
12
batch_size specifies the number of samples (data points) that are processed together in each training iteration;
lr stands for “learning rate” and determines how much the weights of the model are updated in response to the estimated error during training;
betas are regularization coefficients of the model and define how the model adapts its learning rates over time;
latent_dim specifies the dimension of the latent space, which is a lower-dimensional space where the generator operates;
gp_lambda controls the strength of the gradient penalty term in the loss function, which helps stabilize the training;
pac is the packing degree designed to alleviate mode collapse;
epochs are the number of training iterations;
sequence_lenght is the length of each time sequence;
sample_lenght refers to the time series batch size, the number of time steps generated at each RNN rollout (parameter S in the original paper);
rounds is the number of steps per batch;
measument_cols refers to the existing measurements in the dataset.
With the parameters defined, we can train the model and finally synthesize new data:
1
2
3
4
5
6
7
8
# Training the DoppelGANger synthesizer
model_dop_gan = TimeSeriesSynthesizer(modelname='doppelganger',model_parameters=model_args)
model_dop_gan.fit(mba_data, train_args, num_cols=numerical_cols, cat_cols=categorical_cols)
# Generating new synthetic samples
synth_data = model_dop_gan.sample(n_samples=600)
synth_df = pd.concat(synth_data, axis=0)
Now that we have our newly created synthetic data, we can resort to data visualization to quickly determine whether the synthetic data generally reproduces the behavior of the original data. We could either plot the entire sequence or a shorter time window of 56 points to check the preliminary results in more detail and decide whether the hyper parametrization needs some adjusting:
Overall, the results are quite promising and DoppelGANger is able to fully replicate our entire time sequence, including both short and long-term characteristics. This is a simple dataset for exemplification, with low dimensionality and without critical issues such as missing values: for more complex scenarios, it might be wise to thoroughly explore your dataset beforehand.
Conclusion
As we’ve discussed throughout this article, time series data brings an added complexity to the process of synthetic data generation that is hard to map out using conventional GAN approaches.
DoppelGANger steps into the challenge to able to overcome some of this complexity, and with extremely promising advantages that make it so popular among data practitioners nowadays:
Robust Data Synthesization: Handling both short and long-term relationships, DoppelGANger is able to create new realistic datasets that closely mirror the original data. This is perhaps the most interesting use case of the approach, letting data scientists create a data-double of their datasets for safe and customizable experimentation during development cycles;
Flexible and Generalizable: Being able to cope with complex contexts comprising heterogeneous data (categorical, continuous, multi-dimensional datasets), DoppelGANger can handle the overall characteristics of real-world data with requiring an extensive human effort in the transformation process which naturally accelerates development as well;
Plethora of Applications and Industries: From data sharing, privacy preservation, and high-fidelity synthetic data generation, DoppelGANger is applicable to a wide range of use cases. This makes it extremely interesting across so many distinct industries, such as networks, finance, security, and much more.
In a world with an infinite hunger for data-driven insights, DoppelGANger arises an invaluable approach, providing data scientists the data they need, without compromising privacy or quality, and enabling them to make twin-intelligent decisions!
Ref
Internet
Hết.