Machine Learning Reproducibility: A Kaggle Competition Use-Case
Even though Reproducibility in Machine Learning is a theme that people hear about now and then, we still see that people are practicing it only to a certain degree. Even between Kaggle competition winners, we still see a lot of hard-to-reproduce code in Notebooks. Our goal here is to outline some reproducibility elements and how we tackled them in a recent competition.
First, what reproducibility stands for in Machine Learning? During a Machine Learning project, we have to deal with a couple of things. Different from Software Engineering, the code is not our only final artifact. We have to deal with a dataset and the transformations applied to it, and we do a ton of experiments before getting our final model. After that, our model is composed of its code and weights (what the model learned). And it also depends on the data transformations applied to work correctly. We need reproducibility in all of these steps. Otherwise, it is much harder to get the same results and put our model into production.
In this blog post, we’ll go over each of the parts you might need to change to improve your projects’ reproducibility. We’ll also give some examples from our participation in the MoA Challenge (a recent Kaggle Competition) in the end.
Seeding
The most basic and first thing that you are probably already doing is seeding everything. When training a Machine Learning model, there many sources of stochastic in play. We split the dataset randomly, initialize the weights randomly, present the batches in a random order, etc. For that reason, to make the experiment reproducible, we need to set the seed for every operation. In general, you should be seeding the NumPy, PyTorch, TensorFlow, and so on.
Data Processing
When dealing with a dataset, we will commonly transform it in many ways, including cleaning, selecting features, engineering new features, and so on. We have to create code for all these tasks, and we need to keep track of such code. After all, once your model goes into production, you need to reproduce the same steps to transform the input data, or your model output won’t make any sense.
We commonly see Data Scientists creating a Jupyter Notebook to preprocess their data, running it once, and then working with the output. Sometimes, they would make a single Notebook containing the code to transform the data and train the model. Even though their work might be reproducible in a way (you can rerun the Notebook and get the same output), it is far from ideal. When experimenting with different data preprocessing, we change the code a lot. What if you achieved the best model with a previous version of the preprocessing pipeline? You might have lost that code if you don’t keep track of it. And keep tracking of Notebook changes is painful.
Another problem with Notebooks arises when you want to reproduce your pipeline in production. People don’t usually organize their code very well in Notebooks. They don’t create reusable functions or classes. Instead, you get many cells transforming the data, and now you need to reproduce it precisely in production. It makes it impossible to produce the same steps without a lot of costly refactoring.
So, to increase reproducibility, you are highly encouraged to (i) put your data transformation code inside functions and (ii) keep track of your code changes using Git. You can still use Notebook for your experiments (even though I don’t advise you to do that), but your Notebooks will at least be importing modules and reusing code. To make (i) even better, you should parameterize the data transformation and make it pickable. By parameterizing it, you can run different data transformations during the experiments without changing code. And by making it pickable, you can export your data transformation in the end together with your model, making it much easier to put it into production. The very well known scikit-learn already provides an API to do that: Pipeline.
Experiments
When training a model, we do a lot of experiments. During this process, we might change not only hyperparameters but a lot of code changes. It might be tempting to do all of that in a Notebook (like many people do), but again it hurts the reproducibility. When you are exploring new settings, you often discover that a previous experiment was your best one. To come back to it, you need the hyperparameters used and the exact version of the code. To have that, you need to keep track of both of them, and you can’t do it with Notebooks.
MoA Challenge
The MoA Challenge finished recently and was three months-long competition. It was a competition that required a lot of data processing and many experiments. And since it was a Code Competition, people couldn’t generate the submission file on their machine. Instead, they needed to submit the code that, without internet access, can be run on the test dataset to generate the submission. It is not nearly as difficult as putting a model into production, but it increases the reproducibility requirement compared to regular competitions.
Still, at least on public Notebooks, we have seen the standard spaghetti code processing the data on-the-fly and training a model. Since the Kaggle allows you to submit such Notebook, which will train and generate the submission, it works. But in most competitions, the winners use ensembles of many different models. Each of these models might require completely different data preprocessing, and when you need to run all of them together, it gets harder. Because of that, participants had to package their models and data processing steps to run on the Kaggle server if they wanted to compete seriously.
Beyond that, As the experiments evolve, we keep changing the code of both the model and the data processing. Each experiment will have different hyperparameters too. With such a long competition, having some way to keep track of the experiments is very important.
We have used two of our open source projects, Pipeline API from scikit-learn and Software Engineering good practices to tackle all of these problems. First, we have organized a Python project, splitting the code between several modules. To manage our experiments, we separated them into modular steps and used our Stripping library to organize the pipeline. We had a parameterized step to process the data to decide which preprocessing should be applied (and with what parameters). The result is a scikit-learn Pipeline, which is pickable and can smoothly run on the Kaggle server.
For last, to keep track of the experiments, we used another library of ours: Aurum. This library uses Git as its base to keep track of everything: dataset version, code, hyperparameters, and metrics. This way, we can quickly reproduce any previous experiment with precision.
Conclusion
Even though we are giving an example of a Kaggle competition here, it is worth noting that these good practices made our code much closer to production. If we wanted to put such a model into production, we wouldn’t need to do any refactoring of the code, and the whole process with be smooth.
THE BLOG
News, lessons, and content from our companies and projects.
41% of small businesses that employ people are operated by women.
We’ve been talking to several startups in the past two weeks! This is a curated list of the top 5 based on the analysis made by our models using the data we collected. This is as fresh as ...
Porto Seguro Challenge – 2nd Place Solution
We are pleased to announce that we got second place in the Porto Seguro Challenge, a competition organized by the largest insurance company in Brazil. Porto Seguro challenged us to build an ...
Predicting Reading Level of Texts – A Kaggle NLP Competition
Introduction: One of the main fields of AI is Natural Language Processing and its applications in the real world. Here on Amalgam.ai we are building different models to solve some of the problems ...
Porto Seguro Challenge
Introduction: In the modern world the competition for marketing space is fierce, nowadays every company that wants the slight advantage needs AI to select the best customers and increase the ROI ...
Sales Development Representative
At Exponential Ventures, we’re working to solve big problems with exponential technologies such as Artificial Intelligence, Quantum Computing, Digital Fabrication, Human-Machine ...
Exponential Hiring Process
The hiring process is a fundamental part of any company, it is the first contact of the professional with the culture and a great display of how things work internally. At Exponential Ventures it ...
Exponential Ventures annonce l’acquisition de PyJobs, FrontJobs et RecrutaDev
Fondé en 2017, PyJobs est devenu l’un des sites d’emploi les plus populaires du Brésil pour la communauté Python. Malgré sa croissance agressive au cours de la dernière année, ...
Exponential Ventures announces the acquisition of PyJobs, FrontJobs, and RecrutaDev
Founded in 2017, PyJobs has become one of Brazil’s most popular job boards for the Python community. Despite its aggressive growth in the past year, PyJobs retained its community-oriented ...
Sales Executive
At Exponential Ventures, we’re working to solve big problems with exponential technologies such as Artificial Intelligence, Quantum Computing, Digital Fabrication, Human-Machine ...
What is a Startup Studio?
Spoiler: it is NOT an Incubator or Accelerator I have probably interviewed a few hundred professionals in my career as an Entrepreneur. After breaking the ice, one of the first things I do is ask ...
Social Media
At Exponential Ventures, we’re working to solve big problems with exponential technologies such as Artificial Intelligence, Quantum Computing, Digital Fabrication, Human-Machine ...
Hunting for Unicorns
Everybody loves unicorns, right? But perhaps no one loves them more than tech companies. When hiring for a professional, we have an ideal vision of who we are looking for. A professional with X ...
Stay In The Loop!
Receive updates and news about XNV and our child companies. Don't worry, we don't SPAM. Ever.