7 common mistakes of a machine learning beginner

Igor Muniz
Director of Artificial Intelligence

In recent years, the term Artificial Intelligence has gained strength and together with it have emerged some professions such as Data Scientist and Machine Learning Engineer. Knowing and applying machine learning is attractive and appears to be the path to success. However this path can be troubled and especially discouraging for those who are just starting out. Over the years working as a Data Scientist and Machine Learning Researcher, I have witnessed several common mistakes that made life difficult for those who were starting in the field, including me. If you don’t want to waste your time and motivation while building machine learning models, here’s how to avoid these 7 mistakes.

Don’t spend enough time exploring your data.

Don’t expect your model to work miracles. The model needs to be well built, it needs to be targeted and often it also needs you to make things easier for it. And for that, you need to analyze your data first. And I know that many people do not explore at all, they simply jump to the stage of creating the model. If you are one of those guys, I suggest you start over from scratch. But assuming you already know the importance of exploring the data, this tip is for you. Exploring never hurts.

Here is the stage where you will understand your variables and how they correlate, discover the distribution of your data and anticipate the distribution of future data to come. You will better understand the patterns within your data, detect outliers or anomalies and treat them, in addition to removing all the garbage that will only hinder the next steps. And now with all this understanding you are able to create new features and believe me, any machine learning model will be grateful for more features.

Underestimate the theory behind the models

Don’t get me wrong, I’m not saying you need a PhD to create machine learning models. However, it makes no sense for you to apply any algorithm without even knowing what it was created for. And the same is applicable for machine learning / deep learning. Knowing if the model you are choosing to work with meets your data can save you a lot of time and possibly deliver better results. Even to use an AutoML it is important to know what each search parameter in the models means. So how about reading a little bit before?

Your whole focus is on theory

If on the one hand it is important to study how the models work, it is also important to mix theory with practice. Machine Learning is just a small subfield Artificial Intelligence and yet there are already way too many things to be studied by a single person. And some really complicated things, I must say. If you focus too on the theory, you’ll soon find yourself frustrated for not being able to apply any of that. The best way is to combine theory with practice: that way you’ll improve your coding skills together with your understanding of machine learning.

Underestimating the value of domain knowledge

Now you understand machine learning well and can also build powerful applications with it. You are able to solve any problem in the world. Said no one ever. Each problem will have its particularity and this will be extremely valuable for the proper functioning of the model. Do not underestimate the experience of people who have spent years studying or working in a certain field, they will have a lot to add and even give tips on how to better explore your data. And whenever possible, read articles on that specific problem and you’ll find meaningful insights on how to make your model even better.

Not having a structured and organized approach to conduct experiments

That’s a tricky one. I think that every Data Scientist was once lost in his hundreds of experiments. I’ve lost myself a few times. The point here is: try to be as organized as possible. Establish a methodology for building experiments and saving your results in a way that later on you will be able to replicate everything you have done. It is not uncommon for data scientists to use Jupyter Notebooks and sometimes not even bother to rename them, creating many untitled notebooks. In addition, the notebook itself favors creating a lot of garbage within the code, executing it in different orders and even keeping something you don’t even have written anymore in cache. After hundreds or even thousands of experiments (yes, it is common to do this much of experiments), it will be impossible to keep everything you have done in a clean way. This will not only make it difficult for you to identify what is improving or making your model worse, but there is a considerable chance that in the end you will not be able to replicate what you have done. And believe me, the frustration is huge!

Thinking about it and knowing that it is a problem that all data scientists face and not just the beginners, here at Amalgam we developed Aurum that keeps track of all code and data changes, and lets you easily reproduce any experiment as well as easily compare metrics across experiments. I suggest you take a look and avoid another headache.

Perform different transformations during training and testing

I’ve seen this mistake happen many times with several students or colleagues while I was teaching machine learning. It is important to keep in mind that everything, exactly everything you do in the training data must be done in the same way in the test data and also added to the data treatment pipeline that will come later when your model is in production. This is because if you train your model with one data distribution, some processing and / or cleaning and the same isn’t applicable to data that will be predicted later, your model is predicting the wrong things. It is processing the data in formats in which it was not trained and this will certainly not have an expected result.

If you are using any algorithms for transformations like StandardScaler for example, make sure to save that model as well to use the same transformations again, otherwise your future data will be scaled differently and this will not be good for the model.

Don’t make a good validation set

Finally and not least (in fact I particularly consider the main step when working with machine learning), build a good validation set, otherwise all your experiments will be of no use. Seriously, if you can’t validate your model well, no matter what score you get on any metric you’ve chosen, it won’t be representative in the real world.

We generally split a small percentage of our data to validate our model. The first thing is to ensure that this set created has a distribution of variables similar to the complete data set, so we will evaluate as much as possible the generalization of our model among the possibilities of our data. In datasets with unbalanced classes it is also important to ensure that your validation set has all classes and in the same proportion, this is called a stratified split and will ensure that your model learns correctly.

Whenever possible, make a cross-validation, a simple method that ensures that there is no overlap between the training and test set, as well as ensuring that there is no overlap between the k test sets generated, avoiding biased evaluations. Also, watch out for leaks between training and test data. When there is a leak in your data, you will not be able to assess whether your model is overfitting or not, in addition to apparently performing very well. In fact, your model got addicted to the training data and as the information was leaked for the test, you don’t realize it. In the end, the model lacks generalization and will certainly perform very poorly in a real scenario.

Conclusion

Building a machine learning model can be quite challenging and with some tricky parts. You will certainly face several other problems, but keep in mind how to avoid these 7 mistakes and you should be fine, having more time available for other challenges without losing motivation.

THE BLOG

News, lessons, and content from our companies and projects.

Sem categoria 12/01/23

41% of small businesses that employ people are operated by women.

We’ve been talking to several startups in the past two weeks! This is a curated list of the top 5 based on the analysis made by our models using the data we collected. This is as fresh as ...

Lucas

Amalgam 26/10/21

Porto Seguro Challenge – 2nd Place Solution

We are pleased to announce that we got second place in the Porto Seguro Challenge, a competition organized by the largest insurance company in Brazil. Porto Seguro challenged us to build an ...

Adriano Marques
CEO at XNV

Amalgam 15/10/21

Predicting Reading Level of Texts – A Kaggle NLP Competition

Introduction: One of the main fields of AI is Natural Language Processing and its applications in the real world. Here on Amalgam.ai we are building different models to solve some of the problems ...

João Paulo Martins
Data Scientist XNV

Amalgam 15/10/21

Porto Seguro Challenge

Introduction: In the modern world the competition for marketing space is fierce, nowadays every company that wants the slight advantage needs AI to select the best customers and increase the ROI ...

João Paulo Martins
Data Scientist XNV

Sem categoria 16/09/21

Sales Development Representative

At Exponential Ventures, we’re working to solve big problems with exponential technologies such as Artificial Intelligence, Quantum Computing, Digital Fabrication, Human-Machine ...

Rodolfo Egarter
COO @ Pluo

Exponential 15/09/21

Exponential Hiring Process

The hiring process is a fundamental part of any company, it is the first contact of the professional with the culture and a great display of how things work internally. At Exponential Ventures it ...

Rodolfo Egarter
COO @ Pluo

Sem categoria 04/08/21

Exponential Ventures annonce l’acquisition de PyJobs, FrontJobs et RecrutaDev

Fondé en 2017, PyJobs est devenu l’un des sites d’emploi les plus populaires du Brésil pour la communauté Python. Malgré sa croissance agressive au cours de la dernière année, ...

Adriano Marques
CEO at XNV

Exponential Technology Sem categoria 04/08/21

Exponential Ventures announces the acquisition of PyJobs, FrontJobs, and RecrutaDev

Founded in 2017, PyJobs has become one of Brazil’s most popular job boards for the Python community. Despite its aggressive growth in the past year, PyJobs retained its community-oriented ...

Adriano Marques
CEO at XNV

Sem categoria 02/08/21

Sales Executive

At Exponential Ventures, we’re working to solve big problems with exponential technologies such as Artificial Intelligence, Quantum Computing, Digital Fabrication, Human-Machine ...

Rodolfo Egarter
COO @ Pluo

Sem categoria 28/07/21

What is a Startup Studio?

Spoiler: it is NOT an Incubator or Accelerator I have probably interviewed a few hundred professionals in my career as an Entrepreneur. After breaking the ice, one of the first things I do is ask ...

Adriano Marques
CEO at XNV

Sem categoria 23/07/21

Social Media

At Exponential Ventures, we’re working to solve big problems with exponential technologies such as Artificial Intelligence, Quantum Computing, Digital Fabrication, Human-Machine ...

Rodolfo Egarter
COO @ Pluo

Sem categoria 14/07/21

Hunting for Unicorns

Everybody loves unicorns, right? But perhaps no one loves them more than tech companies. When hiring for a professional, we have an ideal vision of who we are looking for. A professional with X ...

Rodolfo Egarter
COO @ Pluo

see all

Stay In The Loop!

Receive updates and news about XNV and our child companies. Don't worry, we don't SPAM. Ever.

7 common mistakes of a machine learning beginner

Share

Tags

Don’t spend enough time exploring your data.

Underestimate the theory behind the models

Your whole focus is on theory

Underestimating the value of domain knowledge

Not having a structured and organized approach to conduct experiments

Perform different transformations during training and testing

Don’t make a good validation set

Conclusion

THE BLOG

41% of small businesses that employ people are operated by women.

Porto Seguro Challenge – 2nd Place Solution

Predicting Reading Level of Texts – A Kaggle NLP Competition

Porto Seguro Challenge

Sales Development Representative

Exponential Hiring Process

Exponential Ventures annonce l’acquisition de PyJobs, FrontJobs et RecrutaDev

Exponential Ventures announces the acquisition of PyJobs, FrontJobs, and RecrutaDev

Sales Executive

What is a Startup Studio?

Social Media

Hunting for Unicorns

Stay In The Loop!

Company

Child Companies

Community