Skip to content

Sensor Analysis Framework

DOI
Binder
PyPI version

When dealing with sensor data, specially with low cost sensors, a great part of the effort needs to be dedicated to data analysis. After a careful data collection, this stage of our experiments is fundamental to extract meaningful conclusions and prepare reports from them. For this reason, we have developed a data analysis framework that we call the Sensor Analysis Framework. In this section, we will detail how this framework is built, how to install it, and make most use of it.

We care for open science

The framework is writen in Python, and can be run using Jupyter Notebooks or Jupyter Lab. It is intended to provide an state-of-the art data analysis environment, adapted for the uses within the Smart Citizen Project, but that can be easily expanded for other use cases. The ultimate purpose of the framework, is to allow for reproducible research by providing a set of tools that can are replicable, and expandable among researchers and users alike, contributing to FAIR data principles.

FAIR data principles.jpg
By SangyaPundir - Own work, CC BY-SA 4.0

The framework integrates with the Smart Citizen API and helps with the analysis of large amounts of data in an efficient way. It also integrates functionality to generate reports in html or pdf format, and to publish datasets and documents to Zenodo.

More familiar with R?

R users won't be left stranded. R2PY provides functionality to send data from python to R quite easily.

Check the source code

How we use it

The main purpose of the framework is to make our lives easier when dealing with various sources of data. Let's see different use cases:

Get sensor data and visualise it

This is probably the most common use case: exploring data in a visual way. The framework allows downloading data from the Smart Citizen API or other sources, as well as to load local csv files. Then, different data explorations options are readily available, and not limited to them due to the great visualisation tools in python. Finally, you can generate html, or pdf reports for sharing the results.

Examples

Check the examples in the Github Repository

Organise your data

Handling a lot of different sensors can be at times difficult to organise and have traceability. For this, we created the concept of test, which groups a set of devices, potentially from various sources. This is convenient since metadata can be addeed to the test instance describing, for instance, what was done, the calibration data for the device, necessary preprocessing for the data, etc. This test can be later loaded in a separate analysis session, modified or expanded, keeping all the data findable.

Some example metadata that can be stored would be:

  • Test Location, date and author
  • Kit type and reference
  • Sensor calibration data or reference
  • Availability of reference equipment measurement and type

A brief schema of the test structure is specified below:

Check the guide

Check the guide on how to organise sensor data

Clean sensor data

Sensor data never comes clean and tidy in the real world. For this reason, data can be cleaned with simple, and not that simple algorithms for later processing. Several functions are already implemented (filtering with convolution, Kalman filters, anomaly detection, ...), and more can be implemented in the source files.

Model sensor data

Low cost sensor data needs calibration, with more or less complex regression algorithms. This can be done at times with a simple linear regression, but it is not the only case. Sensors generally present non-linearities, and linear models might not be the bests at handling the data robustly. For this, a set of models ir rightly implemented, using the power of common statistics and machine learning frameworks such as sci-kit learn, tensorflow, keras, and stats models.

Guidelines on sensor development

Check our guidelines on sensor deployment to see why this is important in some cases.

Batch analysis

Automatisation of all this tools can be very handy at times, since we want to spend less time programming analysis tools than actually doing analysis. Tasks can be programmed in batch to be processed automatically by the framework in an autonomous way. For instance, some interesting use cases of this could be:

  • Downloading data from many devices, do something (clean it) and export it to .csv
  • Downloading data and generate plots, extract metrics and generate reports for many devices
  • Testing calibration models with different hyperparameters, modeling approaches and datasets

Share data

One important aspect of our research is to share the data so that others can work on it, and build on top of our results, validate the conclusions or simply disseminate the work done. For this, integration with zenodo is provided to share datasets and reports:

Have a look at the features within the framework:

  • Tools to retrieve data from the Smart Citizen's API or to load them from local sources (in csv format, compatible with the SCK SD card data)
  • A data handling framework based on the well known Pandas package
  • An exploratory data analysis tools to study sensor behaviour and correlations with different types of plots
  • A sensor model calibration toolset with classical statistical methods such as linear regression, ARIMA, SARIMA-X, as well as more modern Machine Learning techniques with the use of LSTM networks, RF (Random Forest), SVR (Support Vector Regression) models for sequential data prediction and forecasting
  • Methods to statistically validate and study the performance of these models, export and store them
  • As a bonus, an interface to convert the python objects into the statistical analysis language R

Info

Check the guide on how to set it up here

Loading and managing the data

Data can be downloaded from the SmartCitizen API with the KIT IDs or using csv. In order to tidy up the data, the recordings are organised around the concept of test, an entity containing all the kits' references, sensors and general information regarding the conditions at which the measurements were carried out:

  • Test Location, date and author
  • Kit type and reference
  • Sensor calibration data or reference
  • Availability of reference equipment measurement and type

A brief schema of the test structure is specified below:

All this structure is filled up at the test creation with a dedicated script, saving future time to understand mismatching reading units, timestamps formats and so on.

Exploratory data analysis

The device's data can be explored visually with different types of plots. It can also be generated in batch with descriptor files, as shown in the guide. Some of the functionalities implemented are:

  • Time series visualisation
  • Correlation plot and pairs plot
  • Correlogram
  • Heatmaps
  • Violin plots

This section uses interactive plotting frameworks as Plotly and the well known matplotlib to serve different exploratory analysis tools.

Data models

The data models section includes tools to prepare, train and evaluate models coming from different devices within a test in order to calibrate your sensors. It provides an interface with common statistics and machine learning frameworks such as sci-kit learn, tensorflow, keras, and stats models. These frameworks provide tools to perform:

Pre-processing stage:

  • Outliers detection with Holt-Winters methods (triple exponential smoothing) and XGBoost Regressors
  • Data study and analysis for multicollinearity and autocorrelation in order to determine significant variables and avoid model overfit with non-significant exogenous variables
  • Trend decomposition and seasonality analysis

Model stage

  • Baseline model estimations in order to assess minimum targets for model quality (using naive regression models)
  • Ordinary Linear Regression techniques for univariate and multivariate linear and non-linear independent variables
  • ARIMA-X (Autorregresive, Integrated, Moving Average) models with exogenous variables using Box-Jenkis parameter selection methods
  • Supervised learning techiques:
    • Single and multiple layers LSTM (Long-Thort Term Memory) networks with configurable structure
    • Random Forest and Support Vector methods for regression

An example of the model is shown below for the estimation of the SGX4514 CO with the use of the rest of the Kit's available sensor, using a single layer LSTM network only two weeks of training:

Depending on the model selected, different validation techniques are implemented, in order to verify models' assumptions and avoid data misinterpretation (i.e. Durbin Watson or Jacque Bera test for linear regression). Finally, it is important to follow carefully the instructions as stated in the notebook, in order to avoid low model quality.

Model import/export and storage

Once the model is analysed and validated, it can be saved and exported. This allows using the model in the future with the same variables in other sensor studies. The model objects are serialised with joblib and can be uploaded to a Model Repository.

Source files

Download

Check the source code