Adventures of a Data Detective
Tanel Peet, Data Detective at Proekspert
How frog sounds, birdsong, and other real-life exploits of a data scientist resulted in crafting a tool called Kogu.
The data detective has the superpower to predict the future
Being a data scientist often feels like being a detective who needs to find out how and why something happened, much like a police detective. The difference is that data detective’s job can have a more significant impact: the data detective has the superpower to predict the future.
The coolest job in the world
The combination of challenging problems, innovation and the potentially significant impact of the work makes detecting data one of the coolest jobs in the world. The awesomeness of being a data detective arouses the Instant Gratification Monkey. These rewards are brought on by the manifestation of creative and innovative work in the form of discoveries or improvements in model accuracy.
For hobby projects done for your gratification, the hunger for quick wins might not be considered a problem.
However, if you are working in a team or have a long-term project on your hands, the quick high you receive from discoveries can cause issues for the Future Data Detective. Like the universe moving towards chaos and disorder, so will your data science projects if you are not willing to put in some additional effort.
Frog sounds and the problem of reproducibility
The first data science project I worked on was to assist a colleague in classifying frog sounds using Deep Learning. My initial detective work went rather well and resulted in a conference paper. I was eager to earn more doses of gratification that the project was offering me, yet this resulted in messy structure and code.
It did not seem to be a problem, as it was a short solo project, so I could vividly remember all the experiments I did, and hyper-parameter sets I tried.
This Present Data Detective was happy.
Fast-forward three months and I are happily working on a new project with my first credited scientific publication. I received an email asking me to share the code for the published paper.
I had forsaken best practices for instant gratification
I located the folder where I kept my code, ran the algorithm, and produced results which weren’t as good as claimed in the conference paper. It took me a half hour to go through the code and understand why it did, the way it did. I was furious at the Past Data Detective in me for not using Version Control System (VCS) and not writing Clean Code.
It turned out that Past Data Detective had introduced changes to the code for testing a couple of new ideas and getting another fix of instant gratification after submitting the paper. As the process of reviewing and publishing the paper took several months, I had moved on and forgotten the changes I had introduced. I had forsaken best practices for instant gratification, leaving the Present Data Detective in a sticky situation.
It took me several days before I could restore the code so that it corresponded to the structure and results I described in the paper.
Birdsongs and learning from failure
My next adventure was to classify birds by their songs, which was a much harder task. However, this time I was prepared: I wrote cleaner code, my project had better structure, and I used VCS more often. It took a more effort on my part and fighting to resist the urge to seek the Instant Gratification Monkey, but still, I ran into unexpected trouble.
The problem was that the experiments were not connected to the code.
There was a long summer between the end of my internship and the time I started to write my thesis. At the summer’s end, I could easily find the code, and I was much happier at how readable it was.
However, there was a problem with the way I logged the results of my various experiments. I used Google Spreadsheets to store some values about pre-processing and training parameters, but I also used free-form text comments.
The problem was that the experiments were not connected to the code, and for most of the experiments, I could not recall what the Past Data Detective’s comments meant. I was not sure what I had already tried, resulting in several days of computing power gone to waste (for example, one training took around eight hours).
Since I had learned from past failures, I wrote a script that would save the logs and all necessary files, including the code files used for the experiment, to a new directory. After completing the training, I then opened a log file, obtained the results and put these in the spreadsheet.
When I began working for Proekspert, I discovered I was not alone in my problem.
Other data scientists were also conflicted between obtaining fast results and having a manageable project. Together, we decided to start automating the annoying parts of our detective work, so we could concentrate on the results that mattered to us.
This drive towards reducing energy and time spent on gaining order in data science projects has resulted in a tool we now call Kogu.
Kogu is born
Kogu – an Estonian word translating as “whole” or “entire” in English – is a unique tool for managing data science experiments. It helps structure projects, giving each a standard structure, allowing users to quickly find relevant data, reports, source code, and figures from a project.
The philosophy behind Kogu is much more visionary.
Kogu enables a variety of versions of code to be managed and can be used to link source code with experiments, giving us reproducible experiments. The results of each experiment can be can be logged together with metadata; like interactive figures, tags and comments. This helps users determine what the Past Data Detective did, plus, it can also be used to check how and what fellow data detectives are doing at present.
On its surface, Kogu is merely another piece of software, but the philosophy behind it is much more visionary. It introduces best practice and processes to fellow data detectives, so we can work on things that matter whilst keeping projects sustainable.
Believing in Kogu is believing in:
Focus on execution and delivery, not on the past. Kogu logs the environment, data, script, parameters, and outputs for you to easier find, compare and reproduce experiments.
For data scientists and for managers. Kogu is your single point of truth where you can compare experiments and share them with your team.
Unified project structure combined with logging of experiments will not replace you, however, will make it easier to make sense out of your work when you are gone.
If you share my vision on organising your data experiments with minimal overheads then collaborate with us by joining our beta community.