Introduction
In our previous article, we discussed underfitting in machine learning. Today, we will focus on overfitting, explore the problems associated with it, and even look at situations where it may be acceptable—or even desirable.
Overfitting is a well-known phenomenon in machine learning in which a model becomes too closely tailored to the training data, capturing not only the general patterns but also the anomalies and noise present in the data. This excessive adaptation reduces the model’s ability to generalize to new data, making its predictions less reliable outside the training set. However, in some situations, a certain degree of overfitting can be tolerated or even sought after.
What is overfitting?
Overfitting occurs when a model is too complex relative to the structure of the data. It learns not only the important features but also the specific details and random noise. This excessive complexity leads to poor performance on new data, as the model fails to generalize—that is, it no longer produces the expected results when the characteristics differ from those of the training data.
To illustrate overfitting, imagine you have a dataset of product sales over time and you are trying to predict future sales.
-
If you use linear regression, the model will draw a simple straight line that captures the overall sales trend.
-
In contrast, a high-degree polynomial regression (for example, degree 10) will fit a complex curve that passes through every data point, effectively capturing every fluctuation.
In the latter case, the model is likely to learn variations specific to the training data that will not appear in new data. This results in a model that appears to perform well on the training set but fails on unseen or new data.
Another common example of overfitting occurs with deep neural networks. Imagine training a model to classify images of cats and dogs. If the model is too complex, it may overlearn irrelevant details from the training images, such as background color, image orientation, or the position of the animals, instead of focusing on global features like shapes or textures.
This leads to a model that performs well on the training set but fails on new images where these irrelevant details (background, orientation, position) differ.
Problems caused by overfitting
Overfitting raises several challenges in machine learning, including:
-
Poor generalization: The overfitted model performs well on the training data but fails to make accurate predictions on new data.
-
Learning noise: The model captures errors and anomalies specific to the training data, which negatively impacts its predictions.
-
Unnecessary complexity: Overfitted models are often more expensive in terms of computational time and resources without improving prediction quality.
How can overfitting be detected?
Identifying overfitting involves comparing model performance on the training set with performance on a validation or test set. Two key indicators are:
-
Performance gap: If the error on the training set is very low while the error on the validation set is high, this is a clear sign of overfitting.
-
Learning curves: A large gap between training and validation errors over time indicates overfitting.
When can overfitting be desirable?
In some situations, overfitting may be acceptable or even desirable—especially when generalization is not the main priority or when the data is highly specific.
In cases involving very limited datasets or highly specialized problems, overfitting can be tolerated because there are unlikely to be other similar datasets. For example, in rare medical conditions, when developing a model to predict a rare disease with very little available data, it may be acceptable for the model to overlearn the specific characteristics of that small dataset. Generalization is not always essential, as the model is unlikely to encounter many new data points.
Another example involves expert systems or well-defined environments. In tightly controlled environments where new data closely resembles the training data, an overfitted model can actually perform better. This is often the case in industrial quality inspection, where each item on a production line is very similar to the previous one. A model overfitted to small variations in the training data may be better at capturing subtle details and identifying defects.
Overfitting can also be useful in historical data modeling or when reproducing behaviors from a specific past period. In financial modeling, for instance, an overfitted model that closely reproduces market fluctuations over a specific time frame can provide more precise insights into past trends. In this case, generalization to other time periods is not the primary goal.
Finally, in personalization scenarios, where a model is designed for a single user or a very small group, overfitting can be beneficial. For example, in a recommendation system for a single user, the model may be overfitted to that user’s specific preferences to improve recommendation quality, without concern for generalization to other users.
Conclusion
Overfitting is generally considered undesirable in machine learning because it harms a model’s ability to generalize to new data. However, in certain specific cases—such as limited datasets, highly controlled environments, or historical modeling—a certain degree of overfitting can be tolerated or even desired. Understanding the context and objectives of a project is essential in deciding whether overfitting can provide added value.