model training underfitting, overfitting and more

Common challenges in a model training are

1. overfitting - high train accuracy, terrible production performance

Red flag signals

Train accuracy 98%+, val accuracy 65–70%

Train loss keeps falling, val loss starts rising (divergence point)

Large gap between train F1 and val F1

Model memorises noise — shuffling labels barely changes train loss

Primary metrics to watch

Train/val loss gap Generalisation gap Val accuracy Learning curves Val F1

Watch the gap, not the absolute numbers. Train acc 98% is fine if val acc is also 94%. The gap is the signal.

Primary metrics to watch

Train/val loss gap Generalisation gap Val accuracy Learning curves Val F1

Watch the gap, not the absolute numbers. Train acc 98% is fine if val acc is also 94%. The gap is the signal.

This is what a overfitting learning curve graph looks like. As you can see the generalization gap higher than 0.15 is a red flag. In our case, it is 0.426.

One more thing to note is the red line will diverge and the gap will be bigger and bigger over time.

And this is what it might look like given that the training gap becomes so huge:-

Possible fixes are :-

Dropout

Dropout in plain English:

Imagine you have a team of 10 employees. Every morning, you randomly tell 3 of them to stay home. The remaining 7 have to do the full job without knowing who's absent tomorrow. Over time, every employee is forced to become genuinely useful on their own — nobody gets lazy by relying on a colleague to always cover for them.

That's dropout. Each training batch, random neurons are switched off. The surviving neurons can't specialise by co-depending on each other, so every neuron learns to be independently meaningful. At inference (production), everyone shows up to work — but because each neuron learned to stand alone, the full team is now stronger and more robust.

L2 → punishes the optimizer: "your weights are getting too big, here's a tax on that"

Dropout → changes the architecture mid-training: "some of you don't exist this batch, figure it out"

Why are we doing that?

In a typical training, A and B is always together and they become specialize in 'remembering' instead of learning. (we will talk about learning rate later). Neurons A and B learn to always fire as a pair. They specialise in memorising a specific quirk in the training data — not a real pattern. When that quirk isn't in new data, they fail together.

By randomly removing neurons each batch, A can never guarantee B will be there. So A has to learn to be useful on its own. Same for every neuron. The network stops leaning on fixed partnerships.

And our model becomes and the constant pairing between A and B is broken. In a way this forces the model to learn

With dropout we

+Forces every neuron to be independently useful

+Free ensemble — averages many sub-networks

+Works even when L2 struggles (very deep nets)

+No assumption about weight magnitude

–Adds training noise — needs more epochs

–Useless on single-layer models

–Can hurt small datasets (high variance)

Example of Dropout

Deep neural networks (2+ hidden layers)

✓Large model with many parameters

✓Co-adaptation between neurons is the problem

✓NLP, vision, tabular deep learning

✗Shallow models (logistic regression, linear)

✗Very small datasets — increases variance

✗Convolutional layers (use spatial dropout instead)

nn.Sequential(
  nn.Linear(256, 128),
  nn.ReLU(),
  nn.Dropout(p=0.3),   # after activation
  nn.Linear(128, 64),
  nn.ReLU(),
  nn.Dropout(p=0.5),   # higher near output
  nn.Linear(64, 1)

Example of L2 in use

✓Any model — universal regulariser

✓Logistic/linear regression always

✓You need interpretable, small weights

✓Tree models (min_samples_leaf acts similarly)

✓Want to penalise large weights uniformly

✗When you want sparse features (use L1 instead)

✗When features are on very different scales (normalise first)

optimizer = torch.optim.AdamW(
  model.parameters(),
  lr=1e-3,
  weight_decay=1e-4   # L2 built into AdamW
)

# sklearn equivalent:
LogisticRegression(C=1.0)  # C = 1/lambda
# low

2. underfitting - model is too simple and not learning from the data.

What does a underfit looks like?

shapeBoth curves start high, descend very slowly, and plateau together at a high loss. No gap between train and val.

diagnosisStructural underfit — the model lacks capacity to learn this problem. Training longer will barely help.

fixAdd model capacity (more layers/units), add better features, or reduce over-regularisation. Do NOT just add epochs.

A case where we needed more epoch

shapeBoth curves are actively descending with slope still visible at the epoch limit. Small gap. Not plateaued.

diagnosisUndertrained — learning is happening but not finished. The curve still has slope at the cutoff.

fixIncrease max_epochs. Check learning rate (too low = very slow). Use a LR scheduler for the late-training plateau.

And this is what a good training looks like:

shapeBoth curves descend together and plateau close to each other at a low loss. Small stable gap. Healthy.

diagnosisGood fit — train and val converging together at low loss. Small consistent gap is normal and expected.

fixNo action needed. Monitor for drift. Could try slight LR reduction or more data if you want marginal gains.

Red flag signals

Both train AND val accuracy are low (no gap)

Loss barely decreases across epochs

Residuals show clear structure (regression)

Decision boundary is a straight line on clearly non-linear data

Primary metrics to watch

Train loss (absolute) Residual plot Learning curve plateau Bias variance

Unlike overfitting, train performance itself is the signal — not just the gap. High bias = underfitting.

Fixes

1Increase model capacity (more layers/units)

2Train longer / reduce early stopping patience

3Add features or polynomial terms

4Reduce regularisation strength

5Try a more expressive model family

3. class imbalance - Lets say we have fraud training data and 90% of the cases, it is non-fraud while actual fradulent dataset is low at 10% - this is what we called imbalance data. This affect the way our model learns and able to correctly identified actual fraud. The metric that we will be relying on is

Red flag signals

Accuracy 95% but model predicts majority class always

Recall for minority class near 0%

Confusion matrix: minority class column nearly empty

AUC looks fine (0.85+) but precision-recall AUC is terrible

Primary metrics to watch

Recall (minority) Precision-Recall AUC F1 per class Confusion matrix Matthews CC

Fraud / medical diagnosis: prioritise recall (catching all positives). Spam filter: prioritise precision (avoid false positives). AUC-ROC lies on imbalanced data.

How do we know if our data is imbalance? We can do a exploratory analysis and count the number of fraud vs non-fraud cases. But let's say we didn't do, then we will need to rely on Percision/Recalll and F1. For example, given that our data is imbalance, what we will notice are :

Class 0 (The Majority data or non-fraud data): Precision, Recall, and F1 will all be near 0.99.

Class 1 (The Minority data or actual Fraud data): Precision, Recall, and F1 will be terrible—often 0.10, 0.05, or literally 0.00.

An imbalance dataset will often drag down the value for F1 and percision. In the diagram below notice that "false negative" is 452. The model getting saying legit transaction is wrong.

A better balance data set (doesn't really mean the model is a good one) where we have percision is 41.9%, accurary went up to 63.3% and F1 goes up 57.6%. The model still gets 347 false positive (a reduction to the above). False negative went up to 20.

Fixes

1SMOTE oversampling or class_weight='balanced'

2Adjust classification threshold (default 0.5 is rarely optimal)

3Use PR-AUC, not ROC-AUC as primary metric

4Focal loss for deep learning

SMOTE stands for Synthetic Minority Over-sampling Technique.

When your data is severely imbalanced (like our 1% fraud vs. 99% normal example), your model is starving for examples of the minority class. SMOTE is a clever algorithm that feeds the model by generating brand new, synthetic data points for that minority class solving both over-sampling and under-sampling size.

Imagine plotting your data on a scatter plot. Here is exactly what SMOTE does:

Pick a point: It selects an existing minority data point.
Find the neighbors: It looks at the k-nearest neighbors of that point (other minority points that are most similar to it).
Draw a line: It draws an imaginary line connecting your point to one of those neighbors.
Create a fake point: It randomly drops a brand new, synthetic data point somewhere along that imaginary line.

It repeats this process until the minority class is as large as you need it to be.

Lets look at the training example dataset where we have data imbalance here, where it is not good.

Fraud detection, 1000 transactions. Model predicts everything as "not fraud". Look at the precision, recall and f1 score is zero.

Accuracy190+0 correct out of 200 = 95%. Sounds great. It is not great. The model did nothing useful.

RecallOf the 10 real fraud cases — how many did we catch? 0 out of 10 = 0%. This is your red flag metric.

PrecisionOf everything we called fraud — how many were real? We called nothing fraud, so precision is undefined (0/0).

F1Combines precision and recall. When recall is 0, F1 is 0. This is the metric that saves you from the accuracy lie

Same data. We add class_weight='balanced' to tell the model: "a fraud mistake costs 19× more than a legit mistake."

Much better. Recall jumped from 0% to 80% — we're catching most fraud now. Accuracy actually dropped (from 95% to 92%) because we're generating false positives. That tradeoff is correct for fraud detection.

AccuracyDropped from 95% to 92%. This is GOOD. The 95% was fake. 92% reflects reality — we're actively making decisions now instead of always guessing "legit".

RecallCaught 8 out of 10 real frauds = 80%. This is what class weights fixed — the model stopped ignoring the minority.

Precision8 true fraud / (8+14) flagged = 36%. We're flagging some innocent people. Acceptable for fraud — better to investigate 14 false alarms than miss 10 real frauds.

F10.50 — not great but real. The model is doing something useful now. Room to improve via threshold tuning

Well balanced data and all 4 metric are meaningful now

After applying class weights during training and tuning the threshold on the precision-recall curve, the model is now genuinely learning both classes and the four metrics tell a coherent, honest story.

All four metrics are high and honest. Accuracy (97%) is now trustworthy because it aligns with precision, recall, and F1. The model is genuinely working — not just guessing the majority class.

Accuracy484 correct out of 500 = 97%. This time it's real — it matches the other metrics. Accuracy is only trustworthy when precision and recall are also high.

Precision43 spam caught / (43+9) flagged = 83%. Of every email we send to junk, 83% really is spam. Only 9 real emails wrongly junked.

Recall43 caught / 50 real spam = 86%. We stopped 86% of spam. 7 slipped through — acceptable for email.

F10.84 — when F1 is this close to precision and recall individually, it confirms neither metric is secretly broken. The model is balanced.

Why threshold 0.35?Default 0.50 gave recall of only 70%. Lowering to 0.35 means we flag more things as spam (higher recall, slightly lower precision). The PR curve showed 0.35 maximised F1 for this dataset.

4. data drift - when a model is being trained with fraud data from 2025, it might not be able to detect fraud in 2026 because fraud has evolved. The fix is pretty straight forward but the detection method can be challenging.

Red flag signals

Prod accuracy was fine at launch, now drifting down month-by-month

Input feature distributions shift (PSI > 0.2)

Prediction distribution skews (suddenly predicting more class A)

Concept drift: relationship between X and Y changed (e.g. post-COVID)

Primary metrics to watch

PSI (Population Stability Index) KS test on feature distributions Prediction drift Rolling prod accuracy Data quality score

PSI < 0.1: no drift. 0.1–0.2: monitor. > 0.2: retrain. Concept drift is harder — input distribution looks fine but performance drops.

Fixes

1Scheduled retraining pipeline (weekly / monthly)

2Monitor PSI on all features in production

3Champion-challenger: shadow new model alongside current

4Online learning if distribution changes rapidly

5. data leakage

Red flag signals

Val accuracy 99%+ on a hard problem — implausible

A single feature explains nearly all variance (feature importance)

Model uses future data that wouldn't exist at inference time

Performance collapses when you retrain on a proper time split

Primary metrics to watch

Feature importance audit Temporal split vs random split delta Permutation importance SHAP values

Classic leakage: scaler/imputer fitted on full dataset before splitting. Always fit preprocessing on train set only.

Fixes

1Use sklearn Pipeline — fit only on train fold

2For time series: always split by time, not randomly

3Audit feature definitions — does this exist at prediction time?

4SHAP analysis to catch proxy leakage features

6. llm specifics

Red flag signals

Model scores 95%+ on MMLU but fails simple real-world variants

Benchmark score improves but user satisfaction does not

Model has seen benchmark data in pre-training (contamination)

All teams using same public benchmarks — no signal differentiation

Primary metrics to watch

Private held-out eval set Task-specific domain evals Contamination audit Human preference rate

Build your own eval suite for your specific use case. Public benchmarks are shared across all models — they stop differentiating once everyone optimises for them.

Fixes

1Create private domain eval set never exposed to training

2Use dynamic or adversarial benchmarks

3Audit training data for benchmark contamination

4Prefer behavioral evals over static MCQ scores

Search This Blog

mitzen

model training underfitting, overfitting and more

Comments

Popular posts from this blog

ubuntu 24.04 - setting up nodejs 22/20 instead of install older versions nodejs

Windows SSH: Permissions for 'private-key' are too open

NodeJS: Error: spawn EINVAL in window for node version 20.20 and 18.20