All posts by alonzomagaly

Data health and … delusion

For once, I’ll write about data but not in a technical way.

For once I’ll write about something some notice without really caring because, this is research after all.

I’m not a typical computer scientist. I ain’t learned it by the book but by doing. My very first code was in R because I couldn’t afford a statistical software and … I loved it. Then moving to matlab and discovering object oriented programming thanks to a post-doc to whom computer had few to zero secrets. By that time (we’re in 2013), I used to be a student in neurosciences and even using “cd” command on a terminal was making me feeling like a pro. Big data was the buzz-word everybody started to have at the corner of the mouth. What I was not able to see yet was that in research lab, big data was already a reality since quite a moment. Terabytes of EEG recordings stored in hard drives (sometimes on CDs), microscopy images, videos and csv files of behavioral experiments on humans, rodents … Most of the time, if you were not working in the same lab than the team recording those data or in collaboration with them, there was no way to access them. But all this seemed normal, the big data tools were in their early stages and datascientist not even a title yet.

Then I finished my master and traveled the world a little. And, coming back to France I decided to become a developper. I started to look at some webinars and got interested in databases. I always loved to imagine ways to store things in an efficient way that will make me recover them without any effort. As a lot of people I first heard of tables and SQL. Then curiosity lead me to NoSQL and column databases. I also discovered cloud computing and all the sharing and storing opportunities they offer. It seemed like a real universe of inifinite possibilities was opening. By that time I was working for Capgemini in Toulouse on projects with the French space agency (CNES). Big data were really big (from Tera, I jumpted to Peta) and neurosciences research conditions far behind.

While coming back to neurons (artificial ones) by developping artificial intelligence systems at Elter, I started to be a database user. As you be might be aware of, AI needs Big Data. Then I realised that despite the huge amount of data available, chasing them was not that easy. But on an other hand it was understandable that, data is money so it seemed normal for a company to struggle accessing to other companies data. I then learned how to generate data in a way that will make them usable for data-science projects. I also learned how to parse the web looking for pertinent data that i could re-use. Most of the time those data were coming from research labs. And all the time these data were stored in a different way. Different file format, different folder architecture… This has lead me to the idea of a common way of storing things. There are so many formats for images (jpegs, pngs, tif, bin…), for tables (CSVs, xlsx, txt …). Tensorflow released a way to store data ready for learning usage with their TFRecord format. This is interesting when doing AI using tensorflow but of no use when doing just data-science or using caffe framework. Soon a series of conversion tools arised in my computer: tif2bin, csv2tfrec … One of my new mantra when starting a project was to define naming rules and file architectures with the clients or colleagues not to waste time refactoring everything. In the end, despite the cloud, despite the databases, everything was still stored back on hard drives when time was coming to use it for processing (unless you’re using cloud computing but this cannot alway be possible).

Then a worldwide episode: COVID-19. If an event should have played a key role in helping human being to work in cooperation, this is this one. While doing remote working (like millions of people on the planet) I started to look for initiatives and way to help with my poor data skills. What I found was quite amazing. Hundreds of initiatives all over the world. And none of them seemed to be aware of the others. Thousands of data website. No real way to ensure their veracity (one of the magic Vs of big data). A seed was planted in between two neurons. Why no organization, no institution is devoted to world critical data management?

Lately, I’ve been working on a personnal project. Creating an AI helping divers to recognize the fishes they saw during their dive. In addition to having a playful side, this project aims at helping researches who wish to know more about fish population in different part of the globe. The goal is to add information about divind depth and water temperature and make them available in a formatted way to be used efficiently by any research team in the world. As I am working now in a research institution and research is a more open world, I thought I could access more data. Those hidden to companies but shared among the community in order to make some progress. Indeed, there is competition for financing but once the research is done and the paper published, the data should be made available. Here again, I was quite disappointed. Data is there but ways to make it available, hard and expensive. Researchers are not computer scientists and asking them to share their data in a platform that necessitate some programming skills a waste of time. Despite the storage ability, when data is stored, there is no worldwide agreement on the how. Then each lab that manage to make some of their data available often put raw data with a README file to explain how to use them (when there is such a file). This file is often harder to read than a book in ancient greek. No homogeneity in the format and then hours and hours to download the data understand them, realize that the information inside do not match to what we need and start all over again. The seed that was planted few month ago then started to give rise to sprouts. As we are facing more and more world issues, ocean temperature rise, deforestation, CO2 and now viruses, research topics led on one side of the planet are directly liked to the one conducted on the other side. Data science tools and techniques are meant to make bridges in between research teams but data accessibility is too poor.

The idea behind this is the creation of an institution, a worldwide institution which be the guardian of data. When a research team publish its paper his only duty is to give the data to this institution that will store it in an homogenous way. It will make it accessible only to non-profit usage and institutions. Any lab member could query its database and have in minutes access to data he might not be aware of. The name of the lab that generates them, this could lead to new collaborations, avoid wasting time in doing again things that were unsuccessful. Lines are still blurry around what such an institution should and shouldn’t be and I’m nobody to draw them. I just wanted through these lines to sow a seed in some people’s mind. Maybe this will spread to a more concrete action.

Thanks for reading me I’ll be very interested to read your comments on this.

Is remote the new office?

Lately, due to the COVID-19 crisis, remote working has been the norm for most companies. Lots of people are writing articles about what remote is, they are giving tips about how to perform while remote etc. On the other hand, others are writing about how to get back to work after being home for a while. How to reinvent interactions and so on. In the middle of this, some are talking about re-inventing work. This article is more an essay about why we already did re-invent work, how to benefits from de-confinement to develop new working habits and the impact of all this on working performances, and ecology. Let’s go.

During the ongoing crisis, we have been asked to work from home (when it is possible of course). Thus we had to adapt using a lot of tools in a new way or sometimes for the first time. Communication within teams has changed. Communication with clients too. More efforts have been made to listen to each other in order to make things work. More patience too and logically more kindness. All this is the key to an efficient and constructive way of working. And if this remained stressful for some of us due to the lack of social bonds, is this the reason to go back to our regular offices and good old (bad?) habits?

Indeed we’ve been forced to take new habits we already did the most difficult part of the job. What if instead of hurrying going back to your grey office under neon lights, we decided to keep on working from home and enjoy more time with our kids and friends? What if when we need to work surrounded by people our company offers to pay for co-working space? What if we reduced face-to-face meetings to their strict minimum needed?

The first point looks like what we live today, right? But with the freedom to go wherever you want during your more frequent free time. With the freedom of your working hours even though restricted to some meeting or availability for some colleagues. The key to this is the discipline that you probably already acquired during confinement. This is the part of the path to a new way of working we already made.

The second point is interesting. Indeed as human beings, we are a social animal. Of the kind that really needs to make social boundaries on an everyday purpose to make its social order stable (The Book Sapiens was an excellent choice to read during my stay home). If social bound is what you need, why having it with your colleagues? You probably like them for sure, but why not do what was efficient for you during your most brain-demanding period: university? During this period, students are working together whether or not they are coming from the same field, they mostly need to be friends. From time to time, we had to make team works on specific topics (Those minimum restricted meetings). The rest of the time, we were surrounded by people we love and respect and had no qualms about challenging our thoughts and helped us see the world differently. Going working on co-working places can have this effect. You keep on having the coffee discussion Eureka effect, you surround with people that do not care to say you’re wrong (which is not necessarily the case at work) and can help you come up with new breaking ideas. On the other hand, when you need a specific technical solution, you’re always up to Skype, Slack or any other social tool, your colleagues.

The last part of this reflection is about the consequences of all this. Let’s start with professional advantages. You spend fewer hours in traffic, you’re less stressed, your work/family balance is better. You work surrounded by people you choose making it easier to handle challenges of issues in your work. This leads to better performances for your company and better daily life for you. Your company needs less building infrastructure, so it spends less money on it. You’re also healthier, so spend fewer days in medical leave. Sounds good right?

What about the other effect: the ecological one. First, fewer people on the road, so logically this leads to less pollution in big cities. If you’re going to co-working places, it’s more likely that you’ll choose one that is close to your home you can reach easily. Fewer buildings dedicated to office work. What to do with all those buildings? Some can be turned into co-working places, some others to social housing and the remaining can be destroyed to augment the proportion of parks and green spaces in cities and why not going further and create downtown kitchen gardens. We were able to see the impact of two-month lockdown, imagine the impact we could have is we keep on the same way.

To conclude this article just a few lines about the people that cannot work remote. Those persons too will benefit from remote working of others at least for their daily travel to work that will be significantly lowered. Maybe if the traffic is less important they will consider taking a bike instead of a car (also having a proper shower at work helps a lot). This change can seem hard, or nothing or pointless to some. To my opinion, humanity needs to get some lessons and reminders from mother nature sometimes, and we would be idiots not to listen to Her. Feel free to comment I’d be glad to have your opinion on this.

Time-series N°2: How to collect and process them?

Before starting I’d just like to say that this post is not a cookbook of the algorithms that will allows to handle time series data. I will cite some but they are too many and list them is not the prupose, this article is more a REX (Return on Experience) on what I faced and how to simply avoid some basic issues.

Repetition is everywhere. Despite this, it was not possible to repeat a temporal object before 1877 and the invention of the phonograph by Thomas Edison. Even the type of music called “modern classical” that is explicitly made to avoid repetitions has some in it.  Moreover, if you artificially add some repetitions in this music, human find it more enjoyable (it has to be said that contemporary classical music is quite hard to ear). Brain is prime to search for meaning and patterns in the world. Repetition infer meanings. This has to be linked to how the brain works. Indeed, the information in the brain comes from neurons that have an electrical activity. Recordings of this activity using EEG techniques and analyses (based on time-series of course) have shown the presence of brain waves. This is not the purpose to develop this concept here (maybe in a future post) but the main thing here is to say that we are driven (our whole body from heart to brain) by rhythmic activities. Thus it seems relevant that we are sensitive to rhythmic structure composing what surrounds us.

Contemporary classical music (not much structure)

Coming back to our topic, when willing to collect time series data one have to keep in mind their characteristics. Mainly 3 parameters are very important:

Resolution: Number of data sample by unit of time.

Precision: certainty of measurement at each time point.

Accuracy: Relationship between signal and timing of the effect we want to predict.

Their amplitude and frequency mainly. On another hand there are some rules that has to be followed to capture enough data to render the willed effect. In order to do that, there is a theorem coming from sirs Nyquist and Shannon. This theorem is called the sampling theorem and comes from the telecom world.

Nyquist did theoretical work on bandwidth requirements  and realised that if he wanted to spot the relevant information from a signal, he needed to sample it at, at least twice the frequency of the actual information he was looking for. This sampling theorem is fundamental to make a link in between continuous and discrete signals.

In the picture below, one can see that sampling the signal using the orange dots does not allow to render the signal’s information. Indeed, the samplig is at the signal’s frequency. Then adding the green dots doesn’t change anything. (Orange + green = 1.5 the frequency). But, adding the yellow dots to the orange ones render the signal’s complexity.

Illustration of Shannon Nyquist Sampling theorem.

These physical characteristics of the signal has to be taken into account at the very beginning of the project when choosing the sensor used to measure our data.

Let’s take an example. ECG or electro cardiogram is a measurement of the voltage in the heart. It is the reflection of heart contraction. Let’s consider that your heart beat every second (which is pretty good). Considering what we just saw about Nyquist theorem, you would think that sampling twice a second would be enough to catch the relevant information in the signal. If you think so, you’re wrong and you’ll miss most of the relevant info. Indeed, ECG is made of several peaks, each having an importance for a cardiologist willing to identify a pathology. Below a picture of a heart beat and the duration of each single phase composing it.

ECG from a healthy heart

Looking at it, it seems like the sampling rate shouldn’t be slower than 20hz. Ten times the 2Hz we were considering before right? The sampling theorem solves the “resolution” characteristic of data.

Now that we have a sensor delivering the data at the proper frequency, we should consider how this sensor is delivering the data. This can have an effect on precision of the signal and on its accuracy.

Let’s consider precision first by re-using our heart beat example. Looking at the picture again we see that the difference between the higher and le lower point is of 1mV approximately. Moreover, the smallest peak amplitude seems to be of 0.03 mV. Knowing this, the sensor need to be precise enough to capture this effect.

Last but not least, the accuracy of the measurement. It is common that sensors embed some pre-processing algorithms. Those algorithms can thus have an effect on the data and even if one cannot avoid it, this has to be taken into account when analysing the data. Those processing usually consist in different kind of filtering (generally for 50Hz that can be considered as noise) or normalization of the data. It is very important to look at the manufacturer’s manual in order to be aware of the algorithms used and take their effects into account in the processing and interpretation of the result.

All this seems quite trivial and logic but if I took time to talk about it, this is because despite the small number of projects I had as a data scientist, I faced all these situations at least once.

Now that we know how the data are collected and everything is under control, It’s time to process them.

Despite all the efforts put in the collection phase, the signal is not pure and ready to be processed and analysed. Indeed several things that you cannot control can occur. When dealing with movements, it is almost sure that your sensors will have some unwilled movements. In this condition, this is you and the persons to whom you are doing the analyses that will have to account for the importance of these movements. Are they part of the real conditions that will make your model better generalize later or are they parasites that will prevent your model to fit at first? This question is quite tricky to answer and have to be considered. Are the environment alteration of the signal part of the relevant information? Do you need to filter those information? If yes, some deeper analyses have to be done, involving Fourrier transform among them in order to find the different frequencies held by your signal and their power. Maybe then perform so filtering to attenuate unwilled effect of one component of the signal.

Then maybe your signal doesn’t come from one sensor but several sensors. If you wish to monitor heart beat at the same time than the subject activity, you might have 2, 3 or even more sensors. What then?

Considering you’ve done all we’ve already discussed while colelcting and processing the data, there is an other step: synchronization. This can be though at the collection moment to ease the post-processing. If not, there are still some solutions to synchronize the data afterward. Then, there are several points to adress. Indeed, if you’re processing the data to build a model that will make predictions, it is likely that the way that you’ve collected the data is the way you will use your system (maybe real-time) to collect new data in order to make predictions. In both cases you should do the same post-processing on your data for learning and inference otherwise the model will not be able to perform properly. Nevertheless, there are things you can do when doing batch processing (understand not real-time) that would be impossible in real-time. Indeed, when synchronizing data for batch processing, you can rely on passed and future data to get the data from a given time point. This is not possible for real-time data, where usually, the last avalable data is kept until update by a new one.

timestampdata1data2sync batchsync real time
Differene in between batch and real-time synchronization

Looking at this table it seeems more obvious that the effect on the data is not negligible. By extension it seems more obvious that a mode trained on batch synchronized data will under perfrom on real-time data.

The take home message here is that as a data scientist, you will have to handle data coming from different sources. Even if some projects are close to each other, they all have their characteristics. One essential point is to talk to the experts you are working with (generally your client) and to be sure to understand the physics behind the data and the purpose of what is asked to you. Which goal do you serve?

Hoping that you’ve survived along this long post, feel free to give any insight about what you’ve encounter and your own experience. Next step? Trying to use all this in a machine that learns either for prediction of for classification. 😉

Time-series N°1

This this the first of a series of 4 articles I am writing about time-series. You wonder why this topic ? I come from neurosciences and in this domain, signal processing of EEG data is a really interesting though difficult topic. Then, turning into being a developer and now an AI engineer, I deal with other kind of time-series data (from body sensors for example) and realised that there was a lack in the understanding of these data. I made (and will probably keep on making) mistakes due to this understanding and writing these articles is a way to share and keep track of what I have learned so far.

In today’s article we will define timeframe series. Then, we will review some application domains underlying the discrete yet essential role of this data type.

The importance of time series is prodigious. We are no longer, and it’s been a while, in a static world. Everything is moving, and it is moving faster than most of us can even think. This is making our brain quite bad at analysing the data coming from connected sources. Too much ordered information to compute. These information can come from IoT devices, stock markets, the survey active online in website and so on. Hopefully, computing and machine learning methods are here to help us manipulate these data and give us insight on the information they drive.

Why time series analysis is important? Well in life, what you are doing today can (and will) have an effect on what will happened tomorrow. As a DataScientist you should know that the big power you have implies big responsibilities. You have the duty to create models that will explain the properly your data. How the data you keep preciously on your hard drive (because you don’t trust cloud) will influence the future? You have the power to predict which event the fly of a butterfly will produce! More seriously, timeseries are all around, analysing them involve understanding various aspects about the inherent nature of the world that wurrounds us. Being able to manage them and get the best insight from them can really change our life for the best… or not.


So before digging more seriously into this world, time as come for some definitions and the most important one is: Time-series. We are talking about it and, eventhough this notion is known by a lot of people (and probably you), what if I ask you to define it?

Let’s do this together. Imagine a quantitative value, le gyrometer of this fancy sport watch you secretly want for Christmas. The value of its linear acceleration is varying through time. Here we are, we got time. What about series. If you’re familiar with the python library pandas, a Serie is a fancy kind of list. So easy shortcut, a series is a list of values. Time-series coin a list of a specific value varying through time. Easy !

What now? We are not done yet. This definition is very simplist so let’s go a little deeper.

First, this kind of data is supposed to be continuous but to be honest even if time is so, we are not able to measure it or any other values varying with if in a continuous way. So what? We are catching the value of a continuous data called time-series at discrete moments in time. The tricky part here is that if these discrete moments are too far in time from each other, we can miss useful information. On the contrary, if these moments are too close, we are likely to be overwhelmed with useless data. One can easily imagine that it would be better to be in the second case than the first one. This is addressed by the Nyquist theorem.

Time series are by definition, ordered. This means that the position of a given point in time is driven by the position of the points before. This can be due to different components (this is why you will often hear about component analysis.

  • Trend component: it has no cycle, it is “just” increasing or decreasing. This is mainly found in stock market analysis.
  • Seasonal component: this is an easy one. Its value depends on season like wood price for fireplace or symphonic orchestra price for new years eve concert.
  • Cyclic component: seasonal component is kind of cyclic but here we are more likely to find out data that are measured on a long scale such as, stock crash, epidemic and so on.
  • Unpredictable: these data or event are, by nature, stochastic. It is difficult (nearly impossible) to predict them.

These characteristics were closely linked to what we call the period of a signal. The period of a cyclic signal is the time it takes to realise a full cycle.

  • Amplitude: is the maximum displacement from a mean value.
  • Frequency: this is 1/period. As we define the period as the number of time unit it takes to perform a cycle, the frequency is the number of event by period of time. A good example of frequencies in daily life is sound. Notes that are composing more complex sounds have very specific frequencies.


Fig1: Illustration of some time-series characteristics.

There are some other characteristics of signal such as wavelengths but I will address them when needed not to bother you with extra information  too soon. If you want to play with the different characteristics of a time series, I invite you to go there in order to measure the effect of each of them on the data.

Application domains


When talking about medical field lots of people agree that we are all different. But despite this we are using the same pills, therapies etc. What if AI or ML could adapt medicine to each one of us. Create the perfect pill to cure our headache considering our age, gender, way of life, medical history? Thus we’ll end up with the perfect pill and the perfect cure recommendation (number of takes, for how many days and so on).

The very first application of math in a medical purpose does not come from physicians but from an insurance company. Their goal: predict if a customer will be more likely to die in the year to come. Yes, this is not such a surprise right? The surprise on the other hand comes from the time in which it took place. The 17th century. This what is called an innovation came from a guy called John Graunt. He is considered as being the creator of the life table and thus the originator of demography.


Figure 2: John Graunt’s actuarial tables.

John Graunt’s actuarial tables were one of the first results of time series style thinking applied to medical questions. Image is in the public domain and taken from the Wikipedia article on John Graunt

Nowadays, time series are not the most studied data in medicine. Indeed studies are more focused on visual data to help medical teams to detect cancer and so on. Plus, the lack of sharing and the difficulty of working as spread teams is making the aggregation of sufficient quantity of data difficult. In this context, clinical studies keep on being the norm.  However, some experiment mixing visual and time-series data are driven. This is how an AI is capable of predicting blues evolution more precisely than any pratician. Lately, timeseries has been used as epidemiological predictors and both local and international political decisions are made to help this field to be developed. Nevertheless, we still have troubles to anticipate the course of an epidemic.

The medical field in which time series are widely used for more than a century now is neurosciences.

Indeed, physicians and researchers have discovered the electrical activity of the brain (citation) and technology like EEC (Electroencephalogram) is used since the first quarter of 20th century(source).  This is not a surprise if trials to match mathematical models to the brain’s behaviour have soon been made.

One of the problem is that at first, EEG data were mainly from patient and so related to a disease. It has been important to measure brain electrical activity on a priori healthy people in order to compare, understand that when a given function is lost, this is related with these anatomical and these electrical modifications etc. Since then loads of students were asked to register their brain doing numerous tasks in order to compare to patients. Nowadays, healthcare is benefiting from the digital revolution of the late 20th/early 21st century. With the advent of wearable sensors and smart electronic medical devices, we are entering an era where healthy adults are taking routine measurements, either automatically or with minimal manual input. (as is) The issue with a Iot of devices is the precision of the data, one would minor its weight a little and major its size. Plus we cannot be sure that it is worn by its actual owner and the global health of his if he decide not to share that he has diabetes or any other personal information. Medical field is no longer a physician’s world, several different actors are trying to forecast people’s biometric data with more or less ethics.


Western world has been shacked by several crisis. These crises have left scars on our banking system and in order to predict such big changes models are applied to help bankers make the right decision at the right moment. Early  banking system relying on data forecasting and analysis gave rise of economic indicators. Most of them are still in use today. In the banking system, almost every decision rely on time series data management. One of the first approach is data visualization. This technique help human being to handle ordered data by transposing them in an unordered world. Indeed, if machines are very good to process huge amount of data stored in databases of unbeatable excel files containing trillions of rows and thousands of columns, our brain is not meant for this. Our brain is made for images, sounds, touch. Our body is analogous when we are trying to feed it with binary data.  An other approach is found in expert models or models that can adapt to the tendencies, evolve and give a good insight. Wait a minute, isn’t it artificial intelligence ?

Weather Forecasting

Human beings have always wished to predict (and why not act on) the weather. After being a philosopher’s affair in the antique time, weather forecast has been taken seriously by scientists and today thousands of recording stations are spread all over the world in order to understand the phenomenon driven by mother nature. Weather forecasting is nothing more than a time-series prediction game. If the tools used at the very beginning of this field relied purely on complicated algorithms, the tendency has been to simplify them in the name of the economy principle. Then some machine learning was added to these expert algorithms in order to assemble the results “automatically” and take a good decision. Finally today’s attempts are more focused on deep learning techniques. Despite the fashion of deep learning this has a real scientific interest but let’s keep this for later on.

I’ve told a lot about time series to make predictions but this is not the only thing we can do with them. Time-series can be used to encrypt or decrypt, add noise to a signal. This is mainly use in signal management systems such as communication or espionage. Finally, time series data can be used in order to simply understand what is around us. Trying to figure the effect of a signal on an other by simple statistical analysis of the properties of the data.


Being original with DeepLearning


Ever felt stucked on a problem?

Not a daily life problem but one about data. Maybe you’re a data scientist and your daily life problems are about data.

Often, when you struggle with this kind of problem, the issue raise in expressing it to people and the solution pops out like magic. Unfortunately, sometimes, the discussion remains sterile and you can only focus on the way you’re already reasoning. This makes some sense as somebody often surround hisself with people thinking and reasoning like his. This makes oneself feel more comfortable. The thing is that is useless when one will to think differently. So if you don’t have a weirdo friend around to get you out of a box, why not using a DeepLearning black box, untangle it to find some creativity? Sounds a little blurry? Let me explain.

Once upon a time a colleague of mine tried to solve a problem using a complex algorithm having several parameters. In fact, she solved the problem but only for a specific case. Then she realised that in order to use her algorithm in various conditions, she will have to define a large amount of values for different cases. Even then, she wouldn’t address some specific use-cases. She was struggling with a generalization problem. An other issue my friend had was that for some cases she was playing on some parameters of her algorithm and in some others she was playing on others.

Let’s take an example. Imagine you want to characterise the movement of a ball. Just a regular movement let say I am playing with my colleague and we are throwing the ball at each other. To characterize it the only needed parameter is velocity. This is what my colleague did. Now, imagine that I ain’t told her that I did 15 years of basketball reaching a certain level and I wan to play her a trick by modulating ball’s velocity and adding some effect to it. Linear velocity will not be sufficient and maybe angular velocity will be needed. So in some cases, resolving the equation is “simple” in some others, it necessitate more expertise and time.

Let’s come back to our problem. Here there is not 2 but 12 parameters and testing all the different parameter combinations and their proper value would take a lifetime. Then two ideas raised and they both involve AI but in a very different way.

The first one is quite classical. Indeed, if a human is able to find all the parameters for different conditions, so why not an AI? Giving the model time series data and associated parameters used, then training the model to find the desired parameter in any other condition. Then we could open the “black box” and try to read the feature extraction inside. This could work but the problem is that we’ve been orienting the trainig by doing the labelling. Thus we might have introduced a bias linked to the way we’ve been classifying them. During my years doing research, I’ve been tracking biases linked to the protocol desing and subjects selection or even the way we were running the experiment. Then I came to industry and at some point started to track results. Lately, I met Aurélie Jean and red her book “The other side of the mirror” (“De l’autre côté du mirroir” in French). This was a good reminder of how much space biases can take into data analysis and that both in research and industry we shoud be focusing more on data cleaning, balancing and understanding.Moreover, thinking about it, the original problem was less about finding the proper values of the parameters than finding the proper parameter to use in order to generalize a solution.

To do so, we were too much into the unperfect solution we already found, we were having blinkers and eventhough knowing it, it was really hard to make a move on any other direction unless being helped by some weirdo “friend” thinking differently than us. This friend is called unsupervised learning. Using this kind of learning one can train an algorithm to find differences on our different data but with no a priori. Thus the model will extract features in a way even (and I would say particularly) a very specialized data scientist woudn’t have. This  way, untangling the black box of feature extraction could have the same effect than a coffee conversation with one of your colleague: open your eyes to a new scope of possible. The difference here is that the machine will bring a possible solution but with no explanation and us, as human beings, needs to understand the reasoning behind the theory. Free to us though to build the story around the data and shall the data scientist’s expertise help us to explain what we were not able to see at first.

There is, to my opinion of course, a beautiful way to use deep learning. This way pushes our creativity, helps us opening our minds to a new way of solving problems by eliminating our a priori (biases) about data. In this example, opening the “black box” of deep learning was the purpose of the exercise but most of the time, the result doesn’t lay in the feature extraction but in the proper result of the model. Then one should question hisself about the input data, the model’s architecture and result. Lastly, one should look at what’s inside the model in order to be able to understand the reasoning behind the solution and not apply it blindly.

Install Tensorflow2

Just a quick installation procedure for tensorflow2 (tf2).

If you’re like me and still on Ubuntu 16.04 with a python 3.5 version, you might have experience that a simple pip install does not work properly?

pip install tensorflow==2.0.0-alpha0

End up by a:

2.0.0-alpha0 not found

Before you start yieling at Google and crying at your computer, just relax and read what follows. Indeed, tf2 is available through pip only if you run python 3.7 thus if you have a version of python under 3.7, you’re stuck…

No of course there is a very simple way to install tf2 alpha0.

    1. Go there: GPU or there No GPU
    2. Dowload the version corresponding to your os (Linux of course), and your python version.
    3. Now you have a wheel of tensorflow.
    4. In order to test it I advise you to do so in a virtuan environment. I personnaly use virtualenv:

virtualenv -p /usr/bin/python3.6 venv

source venv/bin/activate

  1. Install tensorflow:

pip install /home/mycomputer/Downloads/tensorflow_gpu-2.0.0-cp3.X-cp3.Xm-manylinux1_x86_64.whl

Great ! Now you’re all set to work on tensorflow2 in a virtualenv.

Don’t forget to get out of your environment once done:

More on virtual environments.

More on tf2:

What a neural network really is?

This article aims at giving an overview of what a neural network is in the context of computing. Indeed, as a computer obviously contains no neurons, the goal is to demystify this concept. A lot of artificial intelligence (AI) and Machine Learning (ML) concepts are largely inspired by biology there will first have a quick introduction to what is a neuron and how does it works. Then we will define what perceptron neurons and sigmoid neurons are.

Continue reading What a neural network really is?

My web bible

I am referencing here the websites that I really often visit in my working (and geeking) journey. I imagine that you probably know most of them but some others might be new to you. I do not reference here some “obvious” references such as stackoverflow (oups, just did it) or developer websites such as Apache, Tensorflow and so on.

Continue reading My web bible

An overview of Machine Learning frameworks

Artificial intelligence (AI) is the ability for machines to reproduce human or animal capacities such as problem-solving. One of AI’s subdomain is Machine Learning (ML), whose goal is to make computers learning business rules without the need of the business knowledge but only by giving the computer the data. As part of ML methods, Deep Learning (DL) is based on data representation. The philosophy behind this concept is to mimic the brain’s processing pattern in order to define the relationship between stimuli. This has led to a layer organisation of algorithms, particularly efficient in computer vision field. In some cases, these algorithms are able to overcome human abilities.

Continue reading An overview of Machine Learning frameworks