Underfitting in Machine Learning: Problems, Regularization, and Strategies to Avoid It - Artikel

Introduction

Underfitting occurs when a model is too simple to capture the relationships present in the data (here, the term model refers to a machine learning model, whether for supervised, unsupervised, or generative learning). In this situation, the model fails to learn the essential features, leading to significant errors on both the training dataset and the test dataset. Underfitting limits model performance from the outset because the model is unable to grasp the underlying structure of the data.

In this article, we will discuss the causes and consequences of underfitting, how to recognize it, and the strategies used to correct it. We will also explain in detail regularization techniques (such as L1 and L2), which are often used to prevent overfitting but, when misapplied, can also lead to underfitting.

What is underfitting?

Underfitting occurs when a model is too constrained or too simple to capture the relationships between input and output variables. It fails to fit even the training data properly, resulting in poor performance on both the training and test datasets.

Example of underfitting: Linear regression

Consider using linear regression to model a non-linear relationship. If the data follows a curve (such as a quadratic function) but you apply linear regression, the model will not be able to capture the trend correctly. Underfitting is observed when the model attempts to draw a straight line through data that actually follows a curve. As a result, it will generate large errors on both the training and test data because it cannot accurately represent the underlying relationship.

Another example in classification involves shallow neural networks. In an image classification task, a neural network that is too simple—with a small number of layers and neurons—can lead to underfitting. The model lacks the capacity to extract complex features such as shapes or textures, which are necessary for accurate image classification. Underfitting can also be detected when the model fails to capture subtle variations in the data, leading to poor performance on both the training set and new images.

Problems caused by underfitting

Underfitting leads to several issues in machine learning. One key problem is poor performance on the training data: unlike overfitting, where performance is good on the training data but poor on the test data, underfitting results in poor performance everywhere. The model is unable to capture complex relationships in the data, even when they are clearly present.

Another issue is inaccurate predictions. Due to the model’s limited capacity, prediction errors remain high and consistent, even after prolonged training.

Regularization techniques: L1 and L2

There are solutions and techniques to mitigate underfitting, including regularization methods. Regularization techniques are designed to limit model complexity to prevent overfitting, but they must be used carefully to avoid causing underfitting.

L1 regularization (Lasso) adds a penalty term to the cost function based on the sum of the absolute values of the model’s parameters.

L1 regularization formula:
J(θ) = error(θ) + λ Σ |θᵢ|,
where λ is a regularization parameter that controls the strength of the penalty.

The effect of L1 regularization is to force some coefficients to become exactly zero, leading to automatic feature selection. It is often used when the goal is to simplify a model by retaining only the most important features. However, overly strong regularization can make the model too simple, causing it to ignore useful information and resulting in underfitting.

For example, when applying L1 regularization to a regression problem with many variables, the model may reduce some coefficients to zero. While this simplifies the model, excessive regularization can eliminate important variables and cause underfitting.

L2 regularization (Ridge) is similar to L1, but it penalizes the sum of the squared coefficients rather than their absolute values.

L2 regularization formula:
J(θ) = error(θ) + λ Σ θᵢ².

Again, λ controls the strength of the penalty. Unlike L1, L2 regularization does not force coefficients to become zero; instead, it gradually shrinks them. This helps prevent very large coefficient values that could lead to overfitting. However, excessive L2 regularization can also shrink coefficients too much, making the model unable to capture data complexity and leading to underfitting.

For example, applying overly aggressive L2 regularization to a neural network can cause connection weights to become very small, preventing the model from learning the complex relationships needed for classification or prediction.

Choosing between L1 and L2

A fundamental question is how to make an informed choice between L1 and L2 regularization. There is no single correct answer. Generally, L1 is more suitable when a sparse model is desired—that is, a model that uses only a small subset of features—such as when working with high-dimensional data. L2 is more effective at reducing the impact of highly correlated features without eliminating them entirely, which can be preferable when all features carry some importance.

However, overly strong regularization (whether L1 or L2) can make the model too simple and lead to underfitting. It is therefore crucial to find the right balance by tuning the regularization parameter λ using techniques such as cross-validation.

Recognizing and correcting underfitting

In summary, underfitting is characterized by high errors on both the training and test datasets. The model shows poor performance on both because it has not learned enough from the data. Learning curves may show that the error remains high even after extended training, indicating that the model is unable to capture data complexity.

Possible corrective actions include reducing regularization if underfitting is caused by excessive penalization—by decreasing λ, the model can learn more freely and better fit the data. Another approach is to add relevant features to enrich the information available to the model, allowing it to learn more complex relationships and avoid remaining too simple.

Additionally, increasing the number of training iterations or epochs can help when the model has not been trained long enough. Finally, using a more complex model—such as moving from linear to polynomial regression, or from a shallow neural network to a deeper one—can help capture more complex patterns.

Finding the right balance between model complexity and regularization is crucial for building a performant model that can generalize well to new data.