Data health and … delusion


For once, I’ll write about data but not in a technical way.

For once I’ll write about something some notice without really caring because, this is research after all.

I’m not a typical computer scientist. I ain’t learned it by the book but by doing. My very first code was in R because I couldn’t afford a statistical software and … I loved it. Then moving to matlab and discovering object oriented programming thanks to a post-doc to whom computer had few to zero secrets. By that time (we’re in 2013), I used to be a student in neurosciences and even using “cd” command on a terminal was making me feeling like a pro. Big data was the buzz-word everybody started to have at the corner of the mouth. What I was not able to see yet was that in research lab, big data was already a reality since quite a moment. Terabytes of EEG recordings stored in hard drives (sometimes on CDs), microscopy images, videos and csv files of behavioral experiments on humans, rodents … Most of the time, if you were not working in the same lab than the team recording those data or in collaboration with them, there was no way to access them. But all this seemed normal, the big data tools were in their early stages and data scientist not even a title yet.

Then I finished my master and traveled the world a little. And, coming back to France I decided to become a developper. I started to look at some webinars and got interested in databases. I always loved to imagine ways to store things in an efficient way that will make me recover them without any effort. As a lot of people I first heard of tables and SQL. Then curiosity lead me to NoSQL and column databases. I also discovered cloud computing and all the sharing and storing opportunities they offer. It seemed like a real universe of inifinite possibilities was opening. By that time I was working for Capgemini in Toulouse on projects with the French space agency (CNES). Big data were really big (from Tera, I jumpted to Peta) and neurosciences research conditions far behind.

While coming back to neurons (artificial ones) by developping artificial intelligence systems at Elter, I started to be a database user. As you be might be aware of, AI needs Big Data. Then I realised that despite the huge amount of data available, chasing them was not that easy. But on an other hand it was understandable that, data is money so it seemed normal for a company to struggle accessing to other companies data. I then learned how to generate data in a way that will make them usable for data-science projects. I also learned how to parse the web looking for pertinent data that i could re-use. Most of the time those data were coming from research labs. And all the time these data were stored in a different way. Different file format, different folder architecture… This has lead me to the idea of a common way of storing things. There are so many formats for images (jpegs, pngs, tif, bin…), for tables (CSVs, xlsx, txt …). Tensorflow released a way to store data ready for learning usage with their TFRecord format. This is interesting when doing AI using tensorflow but of no use when doing just data-science or using caffe framework. Soon a series of conversion tools arised in my computer: tif2bin, csv2tfrec … One of my new mantra when starting a project was to define naming rules and file architectures with the clients or colleagues not to waste time refactoring everything. In the end, despite the cloud, despite the databases, everything was still stored back on hard drives when time was coming to use it for processing (unless you’re using cloud computing but this cannot alway be possible).

Then a worldwide episode: COVID-19. If an event should have played a key role in helping human being to work in cooperation, this is this one. While doing remote working (like millions of people on the planet) I started to look for initiatives and way to help with my poor data skills. What I found was quite amazing. Hundreds of initiatives all over the world. And none of them seemed to be aware of the others. Thousands of data website. No real way to ensure their veracity (one of the magic Vs of big data). A seed was planted in between two neurons. Why no organization, no institution is devoted to world critical data management?

Lately, I’ve been working on a personnal project. Creating an AI helping divers to recognize the fishes they saw during their dive. In addition to having a playful side, this project aims at helping researches who wish to know more about fish population in different part of the globe. The goal is to add information about divind depth and water temperature and make them available in a formatted way to be used efficiently by any research team in the world. As I am working now in a research institution and research is a more open world, I thought I could access more data. Those hidden to companies but shared among the community in order to make some progress. Indeed, there is competition for financing but once the research is done and the paper published, the data should be made available. Here again, I was quite disappointed. Data is there but ways to make it available, hard and expensive. Researchers are not computer scientists and asking them to share their data in a platform that necessitate some programming skills a waste of time. Despite the storage ability, when data is stored, there is no worldwide agreement on the how. Then each lab that manage to make some of their data available often put raw data with a README file to explain how to use them (when there is such a file). This file is often harder to read than a book in ancient greek. No homogeneity in the format and then hours and hours to download the data understand them, realize that the information inside do not match to what we need and start all over again. The seed that was planted few month ago then started to give rise to sprouts. As we are facing more and more world issues, ocean temperature rise, deforestation, CO2 and now viruses, research topics led on one side of the planet are directly liked to the one conducted on the other side. Data science tools and techniques are meant to make bridges in between research teams but data accessibility is too poor.

The idea behind this is the creation of an institution, a worldwide institution which be the guardian of data. When a research team publish its paper his only duty is to give the data to this institution that will store it in an homogenous way. It will make it accessible only to non-profit usage and institutions. Any lab member could query its database and have in minutes access to data he might not be aware of. The name of the lab that generates them, this could lead to new collaborations, avoid wasting time in doing again things that were unsuccessful. Lines are still blurry around what such an institution should and shouldn’t be and I’m nobody to draw them. I just wanted through these lines to sow a seed in some people’s mind. Maybe this will spread to a more concrete action.

Thanks for reading me I’ll be very interested to read your comments on this.


Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.