Before starting I’d just like to say that this post is not a cookbook of the algorithms that will allows to handle time series data. I will cite some but they are too many and list them is not the prupose, this article is more a REX (Return on Experience) on what I faced and how to simply avoid some basic issues.
Coming back to our topic, when willing to collect time series data one have to keep in mind their characteristics. Mainly 3 parameters are very important:
Resolution: Number of data sample by unit of time.
Precision: certainty of measurement at each time point.
Accuracy: Relationship between signal and timing of the effect we want to predict.
Their amplitude and frequency mainly. On another hand there are some rules that has to be followed to capture enough data to render the willed effect. In order to do that, there is a theorem coming from sirs Nyquist and Shannon. This theorem is called the sampling theorem and comes from the telecom world.
Nyquist did theoretical work on bandwidth requirements and realised that if he wanted to spot the relevant information from a signal, he needed to sample it at, at least twice the frequency of the actual information he was looking for. This sampling theorem is fundamental to make a link in between continuous and discrete signals.
In the picture below, one can see that sampling the signal using the orange dots does not allow to render the signal’s information. Indeed, the samplig is at the signal’s frequency. Then adding the green dots doesn’t change anything. (Orange + green = 1.5 the frequency). But, adding the yellow dots to the orange ones render the signal’s complexity.
These physical characteristics of the signal has to be taken into account at the very beginning of the project when choosing the sensor used to measure our data.
Let’s take an example. ECG or electro cardiogram is a measurement of the voltage in the heart. It is the reflection of heart contraction. Let’s consider that your heart beat every second (which is pretty good). Considering what we just saw about Nyquist theorem, you would think that sampling twice a second would be enough to catch the relevant information in the signal. If you think so, you’re wrong and you’ll miss most of the relevant info. Indeed, ECG is made of several peaks, each having an importance for a cardiologist willing to identify a pathology. Below a picture of a heart beat and the duration of each single phase composing it.
Looking at it, it seems like the sampling rate shouldn’t be slower than 20hz. Ten times the 2Hz we were considering before right? The sampling theorem solves the “resolution” characteristic of data.
Now that we have a sensor delivering the data at the proper frequency, we should consider how this sensor is delivering the data. This can have an effect on precision of the signal and on its accuracy.
Let’s consider precision first by re-using our heart beat example. Looking at the picture again we see that the difference between the higher and le lower point is of 1mV approximately. Moreover, the smallest peak amplitude seems to be of 0.03 mV. Knowing this, the sensor need to be precise enough to capture this effect.
Last but not least, the accuracy of the measurement. It is common that sensors embed some pre-processing algorithms. Those algorithms can thus have an effect on the data and even if one cannot avoid it, this has to be taken into account when analysing the data. Those processing usually consist in different kind of filtering (generally for 50Hz that can be considered as noise) or normalization of the data. It is very important to look at the manufacturer’s manual in order to be aware of the algorithms used and take their effects into account in the processing and interpretation of the result.
All this seems quite trivial and logic but if I took time to talk about it, this is because despite the small number of projects I had as a data scientist, I faced all these situations at least once.
Now that we know how the data are collected and everything is under control, It’s time to process them.
Despite all the efforts put in the collection phase, the signal is not pure and ready to be processed and analysed. Indeed several things that you cannot control can occur. When dealing with movements, it is almost sure that your sensors will have some unwilled movements. In this condition, this is you and the persons to whom you are doing the analyses that will have to account for the importance of these movements. Are they part of the real conditions that will make your model better generalize later or are they parasites that will prevent your model to fit at first? This question is quite tricky to answer and have to be considered. Are the environment alteration of the signal part of the relevant information? Do you need to filter those information? If yes, some deeper analyses have to be done, involving Fourrier transform among them in order to find the different frequencies held by your signal and their power. Maybe then perform so filtering to attenuate unwilled effect of one component of the signal.
Then maybe your signal doesn’t come from one sensor but several sensors. If you wish to monitor heart beat at the same time than the subject activity, you might have 2, 3 or even more sensors. What then?
Considering you’ve done all we’ve already discussed while colelcting and processing the data, there is an other step: synchronization. This can be though at the collection moment to ease the post-processing. If not, there are still some solutions to synchronize the data afterward. Then, there are several points to adress. Indeed, if you’re processing the data to build a model that will make predictions, it is likely that the way that you’ve collected the data is the way you will use your system (maybe real-time) to collect new data in order to make predictions. In both cases you should do the same post-processing on your data for learning and inference otherwise the model will not be able to perform properly. Nevertheless, there are things you can do when doing batch processing (understand not real-time) that would be impossible in real-time. Indeed, when synchronizing data for batch processing, you can rely on passed and future data to get the data from a given time point. This is not possible for real-time data, where usually, the last avalable data is kept until update by a new one.
|timestamp||data1||data2||sync batch||sync real time|
Looking at this table it seeems more obvious that the effect on the data is not negligible. By extension it seems more obvious that a mode trained on batch synchronized data will under perfrom on real-time data.
The take home message here is that as a data scientist, you will have to handle data coming from different sources. Even if some projects are close to each other, they all have their characteristics. One essential point is to talk to the experts you are working with (generally your client) and to be sure to understand the physics behind the data and the purpose of what is asked to you. Which goal do you serve?
Hoping that you’ve survived along this long post, feel free to give any insight about what you’ve encounter and your own experience. Next step? Trying to use all this in a machine that learns either for prediction of for classification. 😉
This this the first of a series of 4 articles I am writing about time-series. You wonder why this topic ? I come from neurosciences and in this domain, signal processing of EEG data is a really interesting though difficult topic. Then, turning into being a developer and now an AI engineer, I deal with other kind of time-series data (from body sensors for example) and realised that there was a lack in the understanding of these data. I made (and will probably keep on making) mistakes due to this understanding and writing these articles is a way to share and keep track of what I ha learned so far.
The importance of time series is prodigious. We are no longer, and it’s been a while, in a static world. Everything is moving, and it is moving faster than most of us can even think. This is making our brain quite bad at analysing the data coming from connected sources. Too much ordered information to compute. These information can come from IoT devices, stock markets, the survey active online in website and so on. Hopefully, computing and machine learning methods are here to help us manipulate these data and give us insight on the information they drive.
Why time series analysis is important? Well in life, what you are doing today can (and will) have an effect on what will happened tomorrow. As a DataScientist you should know that the big power you have implies big responsibilities. You have the duty to create models that will explain the properly your data. How the data you keep preciously on your hard drive (because you don’t trust cloud) will influence the future? You have the power to predict which event the fly of a butterfly will produce! More seriously, timeseries are all around, analysing them involve understanding various aspects about the inherent nature of the world that wurrounds us. Being able to manage them and get the best insight from them can really change our life for the best… or not.
So before digging more seriously into this world, time as come for some definitions and the most important one is: Time-series. We are talking about it and, eventhough this notion is known by a lot of people (and probably you), what if I ask you to define it?
Let’s do this together. Imagine a quantitative value, le gyrometer of this fancy sport watch you secretly want for Christmas. The value of its linear acceleration is varying through time. Here we are, we got time. What about series. If you’re familiar with the python library pandas, a Serie is a fancy kind of list. So easy shortcut, a series is a list of values. Time-series coin a list of a specific value varying through time. Easy !
What now? We are not done yet. This definition is very simplist so let’s go a little deeper.
First, this kind of data is supposed to be continuous but to be honest even if time is so, we are not able to measure it or any other values varying with if in a continuous way. So what? We are catching the value of a continuous data called time-series at discrete moments in time. The tricky part here is that if these discrete moments are too far in time from each other, we can miss useful information. On the contrary, if these moments are too close, we are likely to be overwhelmed with useless data. One can easily imagine that it would be better to be in the second case than the first one. This is addressed by the Nyquist theorem.
Time series are by definition, ordered. This means that the position of a given point in time is driven by the position of the points before. This can be due to different components (this is why you will often hear about component analysis.
- Trend component: it has no cycle, it is “just” increasing or decreasing. This is mainly found in stock market analysis.
- Seasonal component: this is an easy one. Its value depends on season like wood price for fireplace or symphonic orchestra price for new years eve concert.
- Cyclic component: seasonal component is kind of cyclic but here we are more likely to find out data that are measured on a long scale such as, stock crash, epidemic and so on.
- Unpredictable: these data or event are, by nature, stochastic. It is difficult (nearly impossible) to predict them.
These characteristics were closely linked to what we call the period of a signal. The period of a cyclic signal is the time it takes to realise a full cycle.
- Amplitude: is the maximum displacement from a mean value.
- Frequency: this is 1/period. As we define the period as the number of time unit it takes to perform a cycle, the frequency is the number of event by period of time. A good example of frequencies in daily life is sound. Notes that are composing more complex sounds have very specific frequencies.
Fig1: Illustration of some time-series characteristics.
There are some other characteristics of signal such as wavelengths but I will address them when needed not to bother you with extra information too soon. If you want to play with the different characteristics of a time series, I invite you to go there in order to measure the effect of each of them on the data.
When talking about medical field lots of people agree that we are all different. But despite this we are using the same pills, therapies etc. What if AI or ML could adapt medicine to each one of us. Create the perfect pill to cure our headache considering our age, gender, way of life, medical history? Thus we’ll end up with the perfect pill and the perfect cure recommendation (number of takes, for how many days and so on).
The very first application of math in a medical purpose does not come from physicians but from an insurance company. Their goal: predict if a customer will be more likely to die in the year to come. Yes, this is not such a surprise right? The surprise on the other hand comes from the time in which it took place. The 17th century. This what is called an innovation came from a guy called John Graunt. He is considered as being the creator of the life table and thus the originator of demography.
Figure 2: John Graunt’s actuarial tables.
John Graunt’s actuarial tables were one of the first results of time series style thinking applied to medical questions. Image is in the public domain and taken from the Wikipedia article on John Graunt
Nowadays, time series are not the most studied data in medicine. Indeed studies are more focused on visual data to help medical teams to detect cancer and so on. Plus, the lack of sharing and the difficulty of working as spread teams is making the aggregation of sufficient quantity of data difficult. In this context, clinical studies keep on being the norm. However, some experiment mixing visual and time-series data are driven. This is how an AI is capable of predicting blues evolution more precisely than any pratician. Lately, timeseries has been used as epidemiological predictors and both local and international political decisions are made to help this field to be developed. Nevertheless, we still have troubles to anticipate the course of an epidemic.
The medical field in which time series are widely used for more than a century now is neurosciences.
Indeed, physicians and researchers have discovered the electrical activity of the brain (citation) and technology like EEC (Electroencephalogram) is used since the first quarter of 20th century(source). This is not a surprise if trials to match mathematical models to the brain’s behaviour have soon been made.
One of the problem is that at first, EEG data were mainly from patient and so related to a disease. It has been important to measure brain electrical activity on a priori healthy people in order to compare, understand that when a given function is lost, this is related with these anatomical and these electrical modifications etc. Since then loads of students were asked to register their brain doing numerous tasks in order to compare to patients. Nowadays, healthcare is benefiting from the digital revolution of the late 20th/early 21st century. With the advent of wearable sensors and smart electronic medical devices, we are entering an era where healthy adults are taking routine measurements, either automatically or with minimal manual input. (as is) The issue with a Iot of devices is the precision of the data, one would minor its weight a little and major its size. Plus we cannot be sure that it is worn by its actual owner and the global health of his if he decide not to share that he has diabetes or any other personal information. Medical field is no longer a physician’s world, several different actors are trying to forecast people’s biometric data with more or less ethics.
Western world has been shacked by several crisis. These crises have left scars on our banking system and in order to predict such big changes models are applied to help bankers make the right decision at the right moment. Early banking system relying on data forecasting and analysis gave rise of economic indicators. Most of them are still in use today. In the banking system, almost every decision rely on time series data management. One of the first approach is data visualization. This technique help human being to handle ordered data by transposing them in an unordered world. Indeed, if machines are very good to process huge amount of data stored in databases of unbeatable excel files containing trillions of rows and thousands of columns, our brain is not meant for this. Our brain is made for images, sounds, touch. Our body is analogous when we are trying to feed it with binary data. An other approach is found in expert models or models that can adapt to the tendencies, evolve and give a good insight. Wait a minute, isn’t it artificial intelligence ?
Human beings have always wished to predict (and why not act on) the weather. After being a philosopher’s affair in the antique time, weather forecast has been taken seriously by scientists and today thousands of recording stations are spread all over the world in order to understand the phenomenon driven by mother nature. Weather forecasting is nothing more than a time-series prediction game. If the tools used at the very beginning of this field relied purely on complicated algorithms, the tendency has been to simplify them in the name of the economy principle. Then some machine learning was added to these expert algorithms in order to assemble the results “automatically” and take a good decision. Finally today’s attempts are more focused on deep learning techniques. Despite the fashion of deep learning this has a real scientific interest but let’s keep this for later on.
I’ve told a lot about time series to make predictions but this is not the only thing we can do with them. Time-series can be used to encrypt or decrypt, add noise to a signal. This is mainly use in signal management systems such as communication or espionage. Finally, time series data can be used in order to simply understand what is around us. Trying to figure the effect of a signal on an other by simple statistical analysis of the properties of the data.
Ever felt stucked on a problem?
Not a daily life problem but one about data. Maybe you’re a data scientist and your daily life problems are about data.
Often, when you struggle with this kind of problem, the issue raise in expressing it to people and the solution pops out like magic. Unfortunately, sometimes, the discussion remains sterile and you can only focus on the way you’re already reasoning. This makes some sense as somebody often surround hisself with people thinking and reasoning like his. This makes oneself feel more comfortable. The thing is that is useless when one will to think differently. So if you don’t have a weirdo friend around to get you out of a box, why not using a DeepLearning black box, untangle it to find some creativity? Sounds a little blurry? Let me explain.
Once upon a time a colleague of mine tried to solve a problem using a complex algorithm having several parameters. In fact, she solved the problem but only for a specific case. Then she realised that in order to use her algorithm in various conditions, she will have to define a large amount of values for different cases. Even then, she wouldn’t address some specific use-cases. She was struggling with a generalization problem. An other issue my friend had was that for some cases she was playing on some parameters of her algorithm and in some others she was playing on others.
Let’s take an example. Imagine you want to characterise the movement of a ball. Just a regular movement let say I am playing with my colleague and we are throwing the ball at each other. To characterize it the only needed parameter is velocity. This is what my colleague did. Now, imagine that I ain’t told her that I did 15 years of basketball reaching a certain level and I wan to play her a trick by modulating ball’s velocity and adding some effect to it. Linear velocity will not be sufficient and maybe angular velocity will be needed. So in some cases, resolving the equation is “simple” in some others, it necessitate more expertise and time.
Let’s come back to our problem. Here there is not 2 but 12 parameters and testing all the different parameter combinations and their proper value would take a lifetime. Then two ideas raised and they both involve AI but in a very different way.
The first one is quite classical. Indeed, if a human is able to find all the parameters for different conditions, so why not an AI? Giving the model time series data and associated parameters used, then training the model to find the desired parameter in any other condition. Then we could open the “black box” and try to read the feature extraction inside. This could work but the problem is that we’ve been orienting the trainig by doing the labelling. Thus we might have introduced a bias linked to the way we’ve been classifying them. During my years doing research, I’ve been tracking biases linked to the protocol desing and subjects selection or even the way we were running the experiment. Then I came to industry and at some point started to track results. Lately, I met Aurélie Jean and red her book “The other side of the mirror” (“De l’autre côté du mirroir” in French). This was a good reminder of how much space biases can take into data analysis and that both in research and industry we shoud be focusing more on data cleaning, balancing and understanding.Moreover, thinking about it, the original problem was less about finding the proper values of the parameters than finding the proper parameter to use in order to generalize a solution.
To do so, we were too much into the unperfect solution we already found, we were having blinkers and eventhough knowing it, it was really hard to make a move on any other direction unless being helped by some weirdo “friend” thinking differently than us. This friend is called unsupervised learning. Using this kind of learning one can train an algorithm to find differences on our different data but with no a priori. Thus the model will extract features in a way even (and I would say particularly) a very specialized data scientist woudn’t have. This way, untangling the black box of feature extraction could have the same effect than a coffee conversation with one of your colleague: open your eyes to a new scope of possible. The difference here is that the machine will bring a possible solution but with no explanation and us, as human beings, needs to understand the reasoning behind the theory. Free to us though to build the story around the data and shall the data scientist’s expertise help us to explain what we were not able to see at first.
There is, to my opinion of course, a beautiful way to use deep learning. This way pushes our creativity, helps us opening our minds to a new way of solving problems by eliminating our a priori (biases) about data. In this example, opening the “black box” of deep learning was the purpose of the exercise but most of the time, the result doesn’t lay in the feature extraction but in the proper result of the model. Then one should question hisself about the input data, the model’s architecture and result. Lastly, one should look at what’s inside the model in order to be able to understand the reasoning behind the solution and not apply it blindly.
Just a quick installation procedure for tensorflow2 (tf2).
If you’re like me and still on Ubuntu 16.04 with a python 3.5 version, you might have experience that a simple pip install does not work properly?
pip install tensorflow==2.0.0-alpha0
End up by a:
2.0.0-alpha0 not found
Before you start yieling at Google and crying at your computer, just relax and read what follows. Indeed, tf2 is available through pip only if you run python 3.7 thus if you have a version of python under 3.7, you’re stuck…
No of course there is a very simple way to install tf2 alpha0.
virtualenv -p /usr/bin/python3.6 venv
- Install tensorflow:
pip install /home/mycomputer/Downloads/tensorflow_gpu-2.0.0-cp3.X-cp3.Xm-manylinux1_x86_64.whl
Great ! Now you’re all set to work on tensorflow2 in a virtualenv.
Don’t forget to get out of your environment once done:
More on virtual environments.
More on tf2:
This article aims at giving an overview of what a neural network is in the context of computing. Indeed, as a computer obviously contains no neurons, the goal is to demystify this concept. A lot of artificial intelligence (AI) and Machine Learning (ML) concepts are largely inspired by biology there will first have a quick introduction to what is a neuron and how does it works. Then we will define what perceptron neurons and sigmoid neurons are.
I am referencing here the websites that I really often visit in my working (and geeking) journey. I imagine that you probably know most of them but some others might be new to you. I do not reference here some “obvious” references such as stackoverflow (oups, just did it) or developer websites such as Apache, Tensorflow and so on.
Artificial intelligence (AI) is the ability for machines to reproduce human or animal capacities such as problem-solving. One of AI’s subdomain is Machine Learning (ML), whose goal is to make computers learning business rules without the need of the business knowledge but only by giving the computer the data. As part of ML methods, Deep Learning (DL) is based on data representation. The philosophy behind this concept is to mimic the brain’s processing pattern in order to define the relationship between stimuli. This has led to a layer organisation of algorithms, particularly efficient in computer vision field. In some cases, these algorithms are able to overcome human abilities.