Porto Seguro Challenge
Porto Seguro Challenge

Introduction:

In the modern world the competition for marketing space is fierce, nowadays every company that wants the slight advantage needs AI to select the best customers and increase the ROI of marketing campaigns. And of course, our team at Amalgam.ai is developing solutions to this field.

The Challenge:

In this competition we were challenged to build a model that predicts the probability of a customer purchasing a product.

The score chosen to measure the quality of the prediction was the F1 Score. This metric measures the harmonic mean of the precision and the recall scores from the prediction.

The Dataset:

The dataset for this challenge was composed of 70 columns with the following order:

1. 1 target column;

2. 1 id column;

3. 68 columns of different types of variables

All the columns of this dataset were anonymized for this challenge, which created a different level of complexity for the problem.

Our Approach:

Feature Engineering:

To tackle this problem we focused heavily on the feature engineering of the dataset. As the first step we tried to locate and analyze the variables which contributed the most to the labels.

Once we tracked these variables we started to generate new features based on them. These features were generated by extracting statistics, creating relations with other features and grouping with different sets of parameters.

Modeling:

When we get the feature engineering just right we started the modeling part. This was done using  classical machine learning models, such as: XGBoost, LightGBM and CatBoost.

One approach that was very effective was to initialize the same model with different seeds and average their predictions, this method created very robust predictions in simple manner.

Other approach that worked very well was to use AutoGluon, an automl library that create stacking and ensembles in an automatic way.

Once the models were trained we did an ensemble with all of them. In my solution we used an ensemble of LightGBM + XGBoost + CatBoost + AutoGluon stacking. To do the ensemble we used majority voting between all the models.

Conclusion

This was a very fun and interesting challenge, we learned a lot, mostly how to handle anonymized datasets. With the approach presented here we finished the competition in 15th out of 174 teams.