Data Science using Python



This course covers some basic knowledge regarding data handling with pandas and numpy as well as automatic plotting with matplotlib using a practical example from the Industry 4.0 model factory at the FH Aachen University of Applied Sciences.

Learning Outcome

Introduction

This course will teach you the basics of how to process and validate your data, as well as the basics and lessons learned when working with data. We'd like to provide a hands-on experience by having a practical course where you need to process data provided by us.

Requirements

If you're unfamiliar with Python or need a refresher, please prepare with the following courses:

What You Need

Hardware

Software

Sources and Resources


Task

Project Introduction

The practical course consists of:

The provided data used in this course was generated using a HackRF. The HackRF is a software-defined radio that can send or process radio signals. The measurement was conducted during a scientific project. The HackRF was used to measure the noise level as well as a periodically changing signal generated automatically by a WiFi access point.

Setting Up the Work Environment

All required data and the prepared Python modules are available in a GitLab repository: DataScience Basics Repository.

Load GitHub

We recommend using a general entry point, for example, a module seminar/main.py that uses the modules you've created to fulfill the tasks presented in this course.

seminar.main as well as seminar.preprocessing are already created in the Git project. However, their program logic is not implemented yet.

We recommend creating more modules that fulfill a single task according to the project outline, i.e., seminar.plot, seminar.extraction, and seminar.final.

Use the module seminar as an entry point to execute the main function in seminar.main.

Create Configuration Create Seminar Config

Test Seminar Config

Flake8 Config Seminar

You can automatically check your programming style before executing the main program logic by adding the flake8-configuration to the seminar-configuration.

Autostart Flake8 Seminar

Understanding the Data Set

The data from data/measurement.csv was generated using hackrf_sweep and stored as a CSV file.

The CSV file does not contain a header; thus, documentation is needed to understand the rows of the measurement:

The first two columns are the timestamp of the measurement, followed by the low- and high frequency, the bin width, and the number of samples. The other columns contain the activity of the frequency in decibels, starting with the first bin reaching from (low frequency) to (low frequency + bin width) and ending with the last bin from (high frequency - bin width) to (high frequency).

Course Goal

The goal of this course is to create a dataset which contains the signal strength in dB based on the frequency band starting points (lowest value of the frequency bands) and the timestamps.

Extract:

datetime 2400000 2400049
2020-06-02 10:38:38.852938 -31.13 -45.80
2020-06-02 10:38:38.872789 -30.19 -43.60
2020-06-02 10:38:38.892089 -30.75 -42.84

Based on this table, a waterfall diagram should be created which provides an overview of the data.

Waterfall Diagram

However, the obtained dataset provides some challenges:

Thus, several steps are required to obtain the cleaned dataset and the waterfall diagram.

Data Preparation

The following data preparation steps should be included in the module seminar.preprocessing.

Check your column names and your data using the pandas.DataFrame.head function.

Not all data is required in the final dataframe; however, some information is required for intermediate data processing.

For the data preparation, multi-level indexing (MultiIndex) will be used (Pandas MultiIndex Documentation). This way, the obtained signal strengths are categorized by the timestamp as well as the starting frequency of the sweep.

Check your new indices using the pandas.DataFrame.head function.

If you are already familiar with Pandas using a single index in your dataframes, be aware that some functions work differently. Check out pandas.DataFrame.xs to see how you can access data in your dataframe.

Now, duplicate data (same timestamp and frequency band) should be removed by creating the mean value of this dataset.

Now, a new dataframe listing the signal strengths based on the timestamp and the starting frequencies of each frequency bin should be created.

It can save time if you write a function that checks the existence of cache_df.pkl and reads your dataframe from this file.

Data Verification

A unit test is a smaller test, one that checks that a single component operates in the right way. A unit test helps you to isolate what is broken in your application and fix it faster (Python Testing).

There is a unit test in seminar.tests that uses this file for validation of your dataframe.

Unit Test Configuration

Data Visualization

A waterfall diagram should be created to visualize the obtained dataset.

Use the Matplotlib examples at the bottom of the page (Matplotlib imshow Documentation) to see how you can adjust your plot.


Summary and Outlook

Lessons Learned "Pandas"

Measurement Errors

Accuracy and Precision

Usually, you have a hypothesis and need to prove if it is right (or not). Thus, you need to collect data. However, most measurements have errors, and some calculations with data introduce errors, like the calculation of an average, which also produces a standard deviation from that average. Initially, the errors are often so large that your hypothesis cannot be accepted or rejected. Additionally, the addition of more data will result in sharper error distributions, which will allow us to accept or reject the hypothesis with specific confidence.

Error Propagation

Next Steps When Analyzing the Data

There are automated approaches to extract features from the dataset. These approaches are used to categorize often multidimensional datasets into a number of categories. These approaches are covered by libraries such as SciPy.

First, we focus on the manual feature extraction, which starts by looking at the data:

Looking at the waterfall diagram, we see all the data, and we can extract features. For example, we are interested in the signal-to-noise ratio over the measured frequencies. Just by looking at the data, we see artifacts, areas with much pollution, and areas where our transmitter seems to be the only source.

This leads to the following questions:

And thus, you need to look at all measured frequencies. A time-over-frequency plot results in a graph where you can try to extract features. The difficulty here is that the data is polluted and that you cannot easily see the same features from the waterfall diagram because the line weights are too thick.

A solution can be to use a smoothing function (running averages); however, they average over the data, which is often not beneficial. A second solution is to use dots instead of lines, which might point to a different picture. However, in many cases, a histogram of the data is a powerful method to find features.

A histogram offers a time-independent look at the data, and we can use it to easily measure the noise level by fitting a function to the noise. With the noise known, the signals can easily be separated from the noise. Using a histogram and deriving parameters reduces the dimensionality and complexity of the dataset.

In our example, the classification of the noise condenses hundreds of data points into two numbers (the height and position of the fitted distribution). Thus, the next step would be to create histograms for each frequency band.