Sensor Analysis FrameworkLink
When dealing with sensor data, specially with low cost sensors, a great part of the effort needs to be dedicated to data analysis. After a careful data collection, this stage of our experiments is fundamental to extract meaningful conclusions and prepare reports from them. For this reason, we have developed a data analysis framework that we call the Sensor Analysis Framework. In this section, we will detail how this framework is built, how to install it, use it and build on top of it!
Image source: xkcd
The framework is writen in Python, and can be run using Jupyter Notebooks or Jupyter Lab. It is intended to provide an state-of-the art data analysis environment, adapted for the uses within the Smart Citizen Project, but that can be easily expanded for other use cases.
More familiar with R?
How we use itLink
The Sensor Analysis Framework is mainly used to:
- Handle sensor data acquisition, either from local CSV files or the API
- Perform data cleaning and anomaly detection
- Apply sensor models for actual pollutant concentration calculations
- Create reports, plots and visualisations
The framework can be used in an interactive manner, using the example notebooks in the repository, but it can be also used to process data in batch. This is described in the batch analysis secion of this documentation and is a quite handy, fast and scalable way of processing the data.
Want to make a lot of plots, of a lot of SCKs?
A deeper lookLink
Have a look at the features within the framework:
- Tools to retrieve data from the Smart Citizen's API or to load them from local sources (in csv format, compatible with the SCK SD card data)
- A data handling framework based on the well known Pandas package
- An exploratory data analysis tools to study sensor behaviour and correlations with different types of plots
- A sensor model calibration toolset with classical statistical methods such as linear regression, ARIMA, SARIMA-X, as well as more modern Machine Learning techniques with the use of LSTM networks, RF (Random Forest), SVR (Support Vector Regression) models for sequential data prediction and forecasting
- Methods to statistically validate and study the performance of these models, export and store them
- As a bonus, an interface to convert the python objects into the statistical analysis language R
Step by step guides
Loading and managing the dataLink
Data can be downloaded from the SmartCitizen API with the KIT IDs or using csv. In order to tidy up the data, the recordings are organised around the concept of test, an entity containing all the kits' references, sensors and general information regarding the conditions at which the measurements were carried out:
- Test Location, date and author
- Kit type and reference
- Sensor calibration data or reference
- Availability of reference equipment measurement and type
A brief schema of the test structure is specified below:
All this structure is filled up at the test creation with a dedicated script, saving future time to understand mismatching reading units, timestamps formats and so on.
Create your tests
Visit the guide on organising your data
Exploratory data analysisLink
The device's data can be explored visually with different types of plots. It can also be generated in batch with descriptor files, as shown in the guide. Some of the functionalities implemented are:
- Time series visualisation
- Correlation plot and pairs plot
- Violin plots
The data models section includes tools to prepare, train and evaluate models coming from different devices within a test in order to calibrate your sensors. It provides an interface with common statistics and machine learning frameworks such as sci-kit learn, tensorflow, keras, and stats models. These frameworks provide tools to perform:
- Outliers detection with Holt-Winters methods (triple exponential smoothing) and XGBoost Regressors
- Data study and analysis for multicollinearity and autocorrelation in order to determine significant variables and avoid model overfit with non-significant exogenous variables
- Trend decomposition and seasonality analysis
- Baseline model estimations in order to assess minimum targets for model quality (using naive regression models)
- Ordinary Linear Regression techniques for univariate and multivariate linear and non-linear independent variables
- ARIMA-X (Autorregresive, Integrated, Moving Average) models with exogenous variables using Box-Jenkis parameter selection methods
- Supervised learning techiques:
- Single and multiple layers LSTM (Long-Thort Term Memory) networks with configurable structure
- Random Forest and Support Vector methods for regression
An example of the model is shown below for the estimation of the SGX4514 CO with the use of the rest of the Kit's available sensor, using a single layer LSTM network only two weeks of training:
Depending on the model selected, different validation techniques are implemented, in order to verify models' assumptions and avoid data misinterpretation (i.e. Durbin Watson or Jacque Bera test for linear regression). Finally, it is important to follow carefully the instructions as stated in the notebook, in order to avoid low model quality.
Model import/export and storageLink
Once the model is analysed and validated, it can be saved and exported. This allows using the model in the future with the same variables in other sensor studies. The model objects are serialised with joblib and can be uploaded to the Models Github Repository for later use.