In recent years, the term Artificial Intelligence has gained strength and together with it have emerged some professions such as Data Scientist and Machine Learning Engineer. Knowing and applying machine learning is attractive and appears to be the path to success. However this path can be troubled and especially discouraging for those who are just starting out. Over the years working as a Data Scientist and Machine Learning Researcher, I have witnessed several common mistakes that made life difficult for those who were starting in the field, including me. If you don’t want to waste your time and motivation while building machine learning models, here’s how to avoid these 7 mistakes.
Don’t spend enough time exploring your data.
Don’t expect your model to work miracles. The model needs to be well built, it needs to be targeted and often it also needs you to make things easier for it. And for that, you need to analyze your data first. And I know that many people do not explore at all, they simply jump to the stage of creating the model. If you are one of those guys, I suggest you start over from scratch. But assuming you already know the importance of exploring the data, this tip is for you. Exploring never hurts.
Here is the stage where you will understand your variables and how they correlate, discover the distribution of your data and anticipate the distribution of future data to come. You will better understand the patterns within your data, detect outliers or anomalies and treat them, in addition to removing all the garbage that will only hinder the next steps. And now with all this understanding you are able to create new features and believe me, any machine learning model will be grateful for more features.
Underestimate the theory behind the models
Don’t get me wrong, I’m not saying you need a PhD to create machine learning models. However, it makes no sense for you to apply any algorithm without even knowing what it was created for. And the same is applicable for machine learning / deep learning. Knowing if the model you are choosing to work with meets your data can save you a lot of time and possibly deliver better results. Even to use an AutoML it is important to know what each search parameter in the models means. So how about reading a little bit before?
Your whole focus is on theory
If on the one hand it is important to study how the models work, it is also important to mix theory with practice. Machine Learning is just a small subfield Artificial Intelligence and yet there are already way too many things to be studied by a single person. And some really complicated things, I must say. If you focus too on the theory, you’ll soon find yourself frustrated for not being able to apply any of that. The best way is to combine theory with practice: that way you’ll improve your coding skills together with your understanding of machine learning.
Underestimating the value of domain knowledge
Now you understand machine learning well and can also build powerful applications with it. You are able to solve any problem in the world. Said no one ever. Each problem will have its particularity and this will be extremely valuable for the proper functioning of the model. Do not underestimate the experience of people who have spent years studying or working in a certain field, they will have a lot to add and even give tips on how to better explore your data. And whenever possible, read articles on that specific problem and you’ll find meaningful insights on how to make your model even better.
Not having a structured and organized approach to conduct experiments
That’s a tricky one. I think that every Data Scientist was once lost in his hundreds of experiments. I’ve lost myself a few times. The point here is: try to be as organized as possible. Establish a methodology for building experiments and saving your results in a way that later on you will be able to replicate everything you have done. It is not uncommon for data scientists to use Jupyter Notebooks and sometimes not even bother to rename them, creating many untitled notebooks. In addition, the notebook itself favors creating a lot of garbage within the code, executing it in different orders and even keeping something you don’t even have written anymore in cache. After hundreds or even thousands of experiments (yes, it is common to do this much of experiments), it will be impossible to keep everything you have done in a clean way. This will not only make it difficult for you to identify what is improving or making your model worse, but there is a considerable chance that in the end you will not be able to replicate what you have done. And believe me, the frustration is huge!
Thinking about it and knowing that it is a problem that all data scientists face and not just the beginners, here at Amalgam we developed Aurum that keeps track of all code and data changes, and lets you easily reproduce any experiment as well as easily compare metrics across experiments. I suggest you take a look and avoid another headache.
Perform different transformations during training and testing
I’ve seen this mistake happen many times with several students or colleagues while I was teaching machine learning. It is important to keep in mind that everything, exactly everything you do in the training data must be done in the same way in the test data and also added to the data treatment pipeline that will come later when your model is in production. This is because if you train your model with one data distribution, some processing and / or cleaning and the same isn’t applicable to data that will be predicted later, your model is predicting the wrong things. It is processing the data in formats in which it was not trained and this will certainly not have an expected result.
If you are using any algorithms for transformations like StandardScaler for example, make sure to save that model as well to use the same transformations again, otherwise your future data will be scaled differently and this will not be good for the model.
Don’t make a good validation set
Finally and not least (in fact I particularly consider the main step when working with machine learning), build a good validation set, otherwise all your experiments will be of no use. Seriously, if you can’t validate your model well, no matter what score you get on any metric you’ve chosen, it won’t be representative in the real world.
We generally split a small percentage of our data to validate our model. The first thing is to ensure that this set created has a distribution of variables similar to the complete data set, so we will evaluate as much as possible the generalization of our model among the possibilities of our data. In datasets with unbalanced classes it is also important to ensure that your validation set has all classes and in the same proportion, this is called a stratified split and will ensure that your model learns correctly.
Whenever possible, make a cross-validation, a simple method that ensures that there is no overlap between the training and test set, as well as ensuring that there is no overlap between the k test sets generated, avoiding biased evaluations. Also, watch out for leaks between training and test data. When there is a leak in your data, you will not be able to assess whether your model is overfitting or not, in addition to apparently performing very well. In fact, your model got addicted to the training data and as the information was leaked for the test, you don’t realize it. In the end, the model lacks generalization and will certainly perform very poorly in a real scenario.
Building a machine learning model can be quite challenging and with some tricky parts. You will certainly face several other problems, but keep in mind how to avoid these 7 mistakes and you should be fine, having more time available for other challenges without losing motivation.