Should you accept overfitting to gain performance?
Imagine someone is pointing a gun at your head and forces you to choose between:
Model 1:
Train AUC: 0.94
Test AUC: 0.85
Model 2:
Train AUC: 0.83
Test AUC: 0.81
Which model would you choose?
“Answer or I’ll shoot!!”. BANG.
Even without risking your life, many people in this scenario don’t know what to do, while many others lean towards one or the other.
I, too, am among those who lean, but first, I want to make some clarifications.
What are we predicting? Are we predicting an objective truth or a behavioral phenomenon?
Take a picture of a dog; within it, you will have all the information necessary to correctly predict whether the animal in the photo is a dog or a cat*. This doesn’t work if you have customer data and want to train a model to predict a consumer’s propensity to purchase or to foresee their churn from a service.
This mainly depends on two factors:
The data is limited.
How mny variables do we have on the consumer? Let’s play a game: think of a number of variables, and I’ll tell you some you haven’t thought of. Are you ready? The customer could:
- Have lost their job and can’t afford your product
- Have woken up in a bad mood and finally decided to cancel your service
- Be about to move abroad
- Have been persuaded by the competition to leave you
- Could be deceased, etc.
What I’m telling you is that you don’t have all the information you would need.
You only have a little.
The phenomenon is mutable.
A dog will always be a dog and a cat will always be a cat, while people’s behavior changes. It changes based on their habits and based on the market, it changes over time. Nokia made great phones until…, I rented a lot of movies from Blockbuster then suddenly…, Blackberry was the undisputed market leader until…
The good news is that even if the phenomenon you want to predict is mutable, and even if the data at your disposal is limited, you can still do something good and useful for your business.
What does this reasoning have to do with the previous explosive problem? A lot!
Model 1 is a model that clearly suffers from overfitting, thus having poor generalizability. Therefore, the model has learned something during training that it cannot replicate during the test phase.
Someone might say, “Who cares! The test is the closest thing to the production data**, even if Model 1 overfits, it still performs better in the test than Model 2, so Model 1 is the better one.”
Someone else might counter with: “No! I don’t trust an overfitting model. I prefer to play it safe and choose Model 2.”
Personally, I tend to lean towards this second reasoning. Pay attention to the word “lean” it is not always true, but it is certainly true if I have to make predictions on a mutable phenomenon.
Whether it’s a fraud detection model, a churn model, or a propensity model, the only thing I know is that the phenomenon (fraud, churn, or propensity) is mutable, therefore destined to change over time.
This leads me to reject Model 1 because I find that a model that has learned patterns not reflected within a test set is unreliable on data related to an even more distant phenomenon.
However, I might not make the same choice if I had to distinguish between Coca-Cola cans and Pepsi cans; in that case, I might even lean towards Model 1. A Coca-Cola can is unlikely to change, and the same goes for a Pepsi can.***
What I can’t stand are the annoying ones who, in the end, ask, “So which is better? Model 1 or Model 2?”
There is no better strategy, but different models with different characteristics that must be used wisely.
We are Data Scientists, and the most important word is the second one. We must draw from our tools to answer complex questions; there are no free lunches.
See you around. Stay Tuned.
*. Assuming the photos are taken correctly.
**. It would be interesting to discuss how to build a test set in a mutable case like the one described in the example. Maybe we’ll talk about it in a future episode.
***. In this specific case, it would still be interesting to verify if, despite the presence of overfitting, the model is using only information we deem relevant (details in the can) and not external effects (background).