How to Extract Hundreds of Time Series Features for Machine Learning using Open-Source Python Package tsfresh

Introduction

Feature engineering is a crucial step in preparing data for machine learning, especially with time-series data where patterns over time can significantly influence predictions. The traditional manual approach to generating features can be tedious and inefficient. This is where tsfresh, an open-source Python package, comes into play, offering an automated solution to extract time series features for Machine Learning using open-source Python package tsfresh effectively.  In this blog, we will look at an open-source Python package called tsfresh that we can use to generate hundreds of time-series features in an automated fashion. First, we will briefly explain Feature Engineering.

Once we are familiar with Feature Engineering, we will look at how we can use tsfresh to automate the process of generating time-series features. All the code used in this blog is available on the following GitHub repository.

What is Feature Engineering

A Machine Learning feature is any measurable value that can be used as an input for a Machine Learning task. In simplest terms, it can be considered a column of the input data to a Machine Learning model where different observations represent the rows. For example, in the famous Iris dataset, where the goal is to predict tRead More: he type of species, the input values of Sepal length, Sepal width, Petal length, and petal length are called features. The task of the Machine Learning model is to predict the Species, given some feature values.

Infographic that shows Features are input values that help a Machine Learning model better predict the Target Value- Time Series Features for Machine Learning

Feature Engineering, therefore, is the process of transforming the raw data into useful features that better characterize the data; thus, enabling the machine learning model to learn better from those features. An example of Feature Engineering for time series features for Machine Learning using open-source Python package tsfresh on time-series sales data is given below. Here we have sales data over time, and we aim to predict future sales. By applying Feature Engineering with tsfresh, we can include additional data such as ‘Mean Sales Last year’ or ‘Sales on the same day last year.’ The main advantage of adding these time series features for machine learning is to enable the Machine Learning model to better forecast future sales using the open-source Python package tsfresh.

Raw Data Engineered Features
Sales
Sales
Time
Time
Mean Sales in last 7 days.
Max Sales in last 7 days.
Sales same date last year.
Sales same date last month.
Holiday Data
Temperature
Infographic that show Raw Data
Infographic that show feature extracted from Raw Data - Time Series Features for Machine Learning

In the above figure, we have sequential raw data (based on time). Using tsfresh, we can extract time series features for Machine Learning using open-source Python package tsfresh, such as maximum, minimum, mean, median, number of peaks, etc. Once we have extracted these helpful time series features for machine learning, we can use tsfresh or any other suitable feature selection method to refine the feature set, focusing on retaining only the most impactful features for machine learning using the open-source Python package tsfresh.

Let’s look at the implementation of tsfresh in a Jupyter notebook using Python. First, we need to install the tsfresh module using pip. This can be done from terminal or the Jupyter notebook: 

Infographic that shows first we need to install the tsfresh module using pip

Unlock the Potential of tsfresh with AlphaBOLD!

Implementing time series features just got easier! Explore AlphaBOLD's AI services to seamlessly integrate tsfresh into your machine learning workflows. 

Request a Demo

Next, we need to download sample data from the UCI Machine Learning Repository that we can use for our experimentations.

The documentation for the dataset is provided on: http://archive.ics.uci.edu/ml/datasets/Robot+Execution+Failures 

The dataset represents Force and Torque measurements from sensors on robots. The dataset contains 88 samples represented by the ‘id’ column. The time column represents the sequence of readings. 

Infographic that shows dataset represents Force and Torque measurements from sensors on robots - Time Series Features for Machine Learning

As you can see, in the dataframe we have 1320 rows and 8 columns.  

Infographic that show data frame

The ‘y’ column represents whether or not the sensonrs data represents robot failure. This is a target value, and our goal can be to classify torque and force measurements as either a failure or not.  

Next, we need to import the ‘extract_features’ method and use it to extract features for our dataset. We need to pass the data to the ‘extract_features’ method along with information on which column represents the sequence of readings and the ‘id’ column that will be used to differentiate between various datasets. In our case, readings from every robot represent one dataset, differentiated by the ‘id’ column and sorted on the ‘time’ column. 

Read more about Time Series Forecasting Using Machine Learning: Top 10 Tips to Take Your Time Series Forecasting Model to the Next Level

Infographic that shows 'extract_features' method

In the above snippet, we can see that the tsfresh package has returned 4722 columns that include time-series features for all the datasets and all the numeric columns in our datasets. 

You can use the following code to identify all the features calculated by the tsfresh. 

Infographic that shows code to identify all the features calculated by the tsfresh - Features for Machine Learning

Suppose we need to extract limited features for only one column, that is, F_x. We can do that using a features dictionary as below: 

Infographic that shows how to extract limited features for only one column

We can use the above dictionary to extract features from tsfresh. 

Infographic that shows we can use the above dictionary to extract features from tsfresh

As seen above, we now have 10 columns because we opted for 10 features in our dictionary and only requested it for the F_x column. Again, we can confirm the list of features extracted as below: 

Infographic that shows list of features extracted- Features for Machine Learning

In this way, we can see how easy it is to extract time-series features using tsfresh package in an automated and customizable way.

Read More about Fourier Transform in Python – Vibration Analysis

Maximize Machine Learning Impact with AlphaBOLD's AI Services

Elevate your machine-learning game! Explore AlphaBOLD's Artificial Intelligence services for expert guidance on implementing impactful time series features. Unlock the potential now!

Request a Demo

Conclusion

To conclude, we have seen what Feature Engineering is and why it is important, especially in the context of time series features for Machine Learning using open-source Python package tsfresh.

We have also discovered how tsfresh simplifies the extraction of time series features for machine learning, making it an invaluable tool for enhancing predictive models. Its seamless integration into existing Machine Learning workflows, such as in Scikit-learn, saves us precious coding and processing time, underscoring its utility for time series features for machine learning. For further details on how tsfresh can revolutionize the extraction of time series features for Machine Learning using open-source Python package tsfresh, you can visit the official documentation

Explore Recent Blog Posts