This blog provides some guidance on using a recent deep generative model developed by Microsoft researchers at Cambridge for missing-value imputation and shares the best practices to impute missing values in multivariate time-series datasets.
Background on Missing Value Preprocessing (MVP)
Unlike the datasets available in research, data in the wild is messy, and for various reasons, it may be incomplete. Collection infrastructure (sensors) fail, files–sometimes paper records–get corrupted, and respondents do not answer every question. Some data is, by its nature, sparse; a clinical examination may cover only a fraction of the possible questions. The circumstances require novel techniques to restore the data to a form usable in conventional machine learning models. The good news is the missing value imputation package developed by Microsoft Research, Cambridge UK, automates this by creating a deep learning-based model of missing data.
To model “missingness,” one needs to ask, “why is the data missing?” Consider the case of “Missing Completely At Random” (MCAR), where the fact a datum is missing does not depend — and hence is not predictable — from the rest of the data. It's as if darts were thrown at random to eliminate some data. Of course, we believe that the missing values are predictable in the rest of the data; that's why imputation makes sense. Imputation solves this problem by learning a model of the missing values from the existing data.
In other cases, the fact that the data is missing is meaningful and can be considered a separate binary-valued variable. See [Koller, 2009] Chapter 19 for a complete discussion. But back to MCAR.
A common practice in tabular data is filling the gaps in each column with zero, mean, or median of that data column. In time-series datasets, repeating or interpolating forward, backward, or both directions is also common. These algorithms, along with more advanced methods such as MICE, are suitable for a handful of cases, especially for missing-completely-at-random (MCAR) and missing-at-random (MAR) patterns.
EDDI – A Deep Learning for Missing Value Imputation
Microsoft's research team at Cambridge developed a technology based on a partial VAE algorithm, allowing Missing Value Prediction by using probabilistic Deep Learning. The code is open source as a part of the Data-Efficient Decision-Making project on this link.
The team also developed an easy-to-use API, which is currently in private preview. If you are interested in evaluating the API for your scenario, please email the team at email@example.com. The API is easy to use, thus speeding up the process, it is scalable, and it eases the requirement to have deep domain expertise. It also works with different types of data (e.g., continuous values and categorical data) and can handle different missingness patterns.
Applying EDDI to Multivariate Timeseries
Multivariate time series data consist of multiple concurrent time-dependent variables, and each variable depends not only on its past values but also on the present and past values of other variables. We need to model this temporal aspect explicitly as predictive features. Comparing EDDI and linear imputation for multivariate timeseries, EDDI is a great choice when our use-case meets the following conditions:
- There is strong inter-variable predictability between input features
- The time series data can be characterized by time embedding/fold representations
- The missing type is either missing-completely-at-random (MCAR) or missing-at-random (MAR)
Showcase: Soft Sensor Modeling
Soft sensor modeling is an interesting multimodal time series use case that aims to model the behavior of a physical sensor network mathematically. A solution template for soft-sensor modeling on Azure is discussed in this blog post. In this section, we add missingness to their scenario and use EDDI to do the imputation. The dataset originated from a sulfur recovery unit (SRU) of a refinery plant in Italy [paper]. You can find the complete explanation of the use-case and dataset on this post and download the datasets from this link. The values are per-minute samples captured from five sensors.
The dataset we are working on is clean, with no missing parts. We intentionally selected that to have a clear ground truth for our experiments and evaluation. The following two are likely scenarios in a real sensor environment:
- Missing random values: a value is not captured due to an interruption, or the sensor reads a corrupted value. We randomly masked the sensor values at a 0.007 rate to imitate that.
- Missing a chunk: a sensor is corrupted for a period of time; on top of prior random masking, we discuss this scenario using a chunk mask for one of the sensors.
The missing type in the above scenarios are typically MCAR or MAR, which justifies the usability of EDDI. There are other missingness scenarios that we are not discussing here; for example, when values are not saved due to compression or when sensor readings are not aligned or have different reading rates. EDDI helps us impute the missingness and have a nice full dataset for the downstream task.
- Train: Using a sample dataset (with its missing values):
- Batch Inference: Impute the missing values for the full dataset
- Input: model id, inference datafile (features, CSV format)
- Output: dataset with imputation on missing cells (CSV format)
In this repository, we have shared the code to use EDDI API in two ways:
- EDDI Service API – This option requires a minimum environment and hardware setup, and since it runs on an Azure server, it can be integrated with any production application easily. [This is in private preview, contact firstname.lastname@example.org for access]
- Azua Package [git-repo] – Using this code repository involves one to setup required libraries, and when running locally, the performance surely depends on your local memory/processor. However, this option would be more useful for the developers who want to have full control over the code to review and modify internal processes per their needs.
[GitHub Repo: Softsensor_MVP_with_EDDI]: This code repository walks you through data preparation, train and batch-inference steps for using EDDI API for the soft sensor modeling show case.
Best Practices on using EDDI for Multivariate Timeseries
If you decide to use EDDI for multivariate timeseries MVP to prepare filled-in data for a subsequent prediction task, here is a list of best-practices:
- How To Prevent Information Leakage: A typical ML prediction task involves input features predicting the output targets where a test set is left out for evaluation. When you do imputation prior to a prediction task, the prediction model learns the pattern in the imputed data, not the original data. Therefore, you should be mindful of error interpretation as well as any data leakage. You can think of “imputation” followed by “prediction” as a pipeline. Therefore, you can choose to either a) evaluate the pipeline component-by-component or b) evaluate the pipeline end-to-end. We choose the first one in this post to focus on evaluating the imputation component. Thus, you need to seal the prediction evaluation by:
- Training EDDI with the same data you are going to train your prediction model. EDDI-inference can impute missingness in the test data.
- Impute input and output columns Independently. The problem here is that imputation applied to the Target values is effectively a “prediction” if the input features are used during imputation – a kind of data leakage. You can remove the contribution of the imputation step to the predictive evaluation by running separate imputation tasks for the set of input and each of the output columns, treating them as separate datasets.
- How Imputation works with Temporal Features: We observed that by augmenting the dataset with a window of feature values, the model could capture temporal relationships. A simple yet sufficient solution for many use-cases is adding an immediate previous and next value of each feature as separate features:
Using the above setting to impute given our random 0.7% missingness, one can see EDDI performs better than a linear imputation [Note: lower MAPE error is better]:
In-depth insight: in the above example, we are using immediately adjacent neighbors due to the fine-grain dependency of the sensor values. We could choose a larger window-size, i.e., Xt-k, …, Xt+k, if the temporal dependencies are expected to be more extensive. Also, if we need to capture coarser dependencies, we could use larger k, e.g., Xt-5, Xt+5. Mutual information would give an initial insight into the data predictabilities.
- Use EDDI to Impute Missed Chunks: Using EDDI is more interesting when we lose a chunk of data. In the sensor network example, a sensor may get corrupted or go offline for a while. To test this, we masked feature IN3 on a chunk of 2500 consecutive records (which is around two days). As shown below, EDDI recovered the missing chunk much better than linear interpolation. Figures from left to right are and EDDI & Linear imputation, respectively:
In-depth insight: Here, the 2500 element chunk-length is chosen intentionally to showcase EDDI. The relative performance of EDDI vs. linear imputation depends on the signal shape and the predictability of the features within the missed chunk by other features. Linear imputation performs well for signals with linear dependence on last and next seen values, which is likely in smaller chunks, while better predictability increases EDDI's performance but is slightly harder to know in advance. Domain knowledge is a key component in this case.
Machine Learning with missing values is an old challenge, and EDDI is a novel deep learning-based solution for missing value imputation on multivariate datasets. However, the imputation of a multivariate time-series dataset requires some tweaks to take advantage of both temporal and multivariate signals, which we discussed in this post. Note that one imputation solution does not fit all the missing problems! :smiling_face_with_smiling_eyes: For example, if predictability among variables is very limited or there is too much noise in the data, simpler imputation solutions may work better. Yet, EDDI works well if the missing values usually co-occur with the visible ones that have some predictive signal.
GitHub Repository: Softsensor_MVP_with_EDDI
- Koller and N. Friedman, (2009) “Probabilistic Graphical Models,” MIT Press.