model training underfitting, overfitting and more

Common challenges in a model training are 

1. overfitting - high train accuracy, terrible production performance 

Red flag signals

Train accuracy 98%+, val accuracy 65–70%
Train loss keeps falling, val loss starts rising (divergence point)
Large gap between train F1 and val F1
Model memorises noise — shuffling labels barely changes train loss

Primary metrics to watch

Train/val loss gap Generalisation gap Val accuracy Learning curves Val F1

Watch the gap, not the absolute numbers. Train acc 98% is fine if val acc is also 94%. The gap is the signal.

Primary metrics to watch
Train/val loss gap Generalisation gap Val accuracy Learning curves Val F1
Watch the gap, not the absolute numbers. Train acc 98% is fine if val acc is also 94%. The gap is the signal.

This is what a overfitting learning curve graph looks like. As you can see the generalization gap higher than 0.15 is a red flag. In our case, it is 0.426. 
One more thing to note is the red line will diverge and the gap will be bigger and bigger over time.


And this is what it might look like given that the training gap becomes so huge:-



Possible fixes are :-
Dropout
Dropout in plain English:

Imagine you have a team of 10 employees. Every morning, you randomly tell 3 of them to stay home. The remaining 7 have to do the full job without knowing who's absent tomorrow. Over time, every employee is forced to become genuinely useful on their own — nobody gets lazy by relying on a colleague to always cover for them.

That's dropout. Each training batch, random neurons are switched off. The surviving neurons can't specialise by co-depending on each other, so every neuron learns to be independently meaningful. At inference (production), everyone shows up to work — but because each neuron learned to stand alone, the full team is now stronger and more robust.

  • L2 → punishes the optimizer: "your weights are getting too big, here's a tax on that"
  • Dropout → changes the architecture mid-training: "some of you don't exist this batch, figure it out"
  • Why are we doing that?

    In a typical training, A and B is always together and they become specialize in 'remembering' instead of learning. (we will talk about learning rate later). Neurons A and B learn to always fire as a pair. They specialise in memorising a specific quirk in the training data — not a real pattern. When that quirk isn't in new data, they fail together.



    By randomly removing neurons each batch, A can never guarantee B will be there. So A has to learn to be useful on its own. Same for every neuron. The network stops leaning on fixed partnerships.

    And our model becomes and the constant pairing between A and B is broken. In a way this forces the model to learn


    With dropout we 
    +Forces every neuron to be independently useful
    +Free ensemble — averages many sub-networks
    +Works even when L2 struggles (very deep nets)
    +No assumption about weight magnitude
    Adds training noise — needs more epochs
    Useless on single-layer models
    Can hurt small datasets (high variance)

    Example of Dropout 
    Deep neural networks (2+ hidden layers)
    Large model with many parameters
    Co-adaptation between neurons is the problem
    NLP, vision, tabular deep learning
    Shallow models (logistic regression, linear)
    Very small datasets — increases variance
    Convolutional layers (use spatial dropout instead)
    nn.Sequential( nn.Linear(256, 128), nn.ReLU(), nn.Dropout(p=0.3), # after activation nn.Linear(128, 64), nn.ReLU(), nn.Dropout(p=0.5), # higher near output nn.Linear(64, 1)

    Example of L2 in use
    Any model — universal regulariser
    Logistic/linear regression always
    You need interpretable, small weights
    Tree models (min_samples_leaf acts similarly)
    Want to penalise large weights uniformly
    When you want sparse features (use L1 instead)
    When features are on very different scales (normalise first)
    optimizer = torch.optim.AdamW( model.parameters(), lr=1e-3, weight_decay=1e-4 # L2 built into AdamW ) # sklearn equivalent: LogisticRegression(C=1.0) # C = 1/lambda # low

    2. underfitting - model is too simple and not learning from the data 

    Red flag signals
    Both train AND val accuracy are low (no gap)
    Loss barely decreases across epochs
    Residuals show clear structure (regression)
    Decision boundary is a straight line on clearly non-linear data
    Primary metrics to watch

    Train loss (absolute) Residual plot Learning curve plateau Bias variance

    Unlike overfitting, train performance itself is the signal — not just the gap. High bias = underfitting.

    Fixes

    1Increase model capacity (more layers/units)
    2Train longer / reduce early stopping patience
    3Add features or polynomial terms
    4Reduce regularisation strength
    5Try a more expressive model family


    3. class imbalance - Lets say we have fraud training data and 90% of the cases, it is non-fraud while actual fradulent dataset is low at 10% - this is what we called imbalance data. This affect the way our model learns and able to correctly identified actual fraud. The metric that we will be relying on is 

    Red flag signals

    Accuracy 95% but model predicts majority class always
    Recall for minority class near 0%
    Confusion matrix: minority class column nearly empty
    AUC looks fine (0.85+) but precision-recall AUC is terrible

    Primary metrics to watch

    Recall (minority) Precision-Recall AUC F1 per class Confusion matrix Matthews CC

    Fraud / medical diagnosis: prioritise recall (catching all positives). Spam filter: prioritise precision (avoid false positives). AUC-ROC lies on imbalanced data.

    How do we know if our data is imbalance? We can do a exploratory analysis and count the number of fraud vs non-fraud cases. But let's say we didn't do, then we will need to rely on Percision/Recalll and F1. For example, given that our data is imbalance, what we will notice are :
    Class 0 (The Majority data or non-fraud data): Precision, Recall, and F1 will all be near 0.99.
    Class 1 (The Minority data or actual Fraud data): Precision, Recall, and F1 will be terrible—often 0.10, 0.05, or literally 0.00.
    An imbalance dataset will often drag down the value for F1 and percision. In the diagram below notice that "false negative" is 452. The model getting saying legit transaction is wrong. 

    A better balance data set (doesn't really mean the model is a good one) where we have percision is 41.9%, accurary went up to 63.3% and F1 goes up 57.6%.  The model still gets 347 false positive (a reduction to the above). False negative went up to 20. 

    Fixes
    1SMOTE oversampling or class_weight='balanced'
    2Adjust classification threshold (default 0.5 is rarely optimal)
    3Use PR-AUC, not ROC-AUC as primary metric
    4Focal loss for deep learning

    SMOTE stands for Synthetic Minority Over-sampling Technique.

    When your data is severely imbalanced (like our 1% fraud vs. 99% normal example), your model is starving for examples of the minority class. SMOTE is a clever algorithm that feeds the model by generating brand new, synthetic data points for that minority class solving both over-sampling and under-sampling size.

    Imagine plotting your data on a scatter plot. Here is exactly what SMOTE does:

    1. Pick a point: It selects an existing minority data point.

    2. Find the neighbors: It looks at the k-nearest neighbors of that point (other minority points that are most similar to it).

    3. Draw a line: It draws an imaginary line connecting your point to one of those neighbors.

    4. Create a fake point: It randomly drops a brand new, synthetic data point somewhere along that imaginary line.

    It repeats this process until the minority class is as large as you need it to be.


    4. data drift - when a model is being trained with fraud data from 2025, it might not be able to detect fraud in 2026 because fraud has evolved. The fix is pretty straight forward but the detection method can be challenging. 


    Red flag signals

    Prod accuracy was fine at launch, now drifting down month-by-month
    Input feature distributions shift (PSI > 0.2)
    Prediction distribution skews (suddenly predicting more class A)
    Concept drift: relationship between X and Y changed (e.g. post-COVID)

    Primary metrics to watch

    PSI (Population Stability Index) KS test on feature distributions Prediction drift Rolling prod accuracy Data quality score

    PSI < 0.1: no drift. 0.1–0.2: monitor. > 0.2: retrain. Concept drift is harder — input distribution looks fine but performance drops.

    Fixes
    1Scheduled retraining pipeline (weekly / monthly)
    2Monitor PSI on all features in production
    3Champion-challenger: shadow new model alongside current
    4Online learning if distribution changes rapidly

    5. data leakage 

    Red flag signals
    Val accuracy 99%+ on a hard problem — implausible
    A single feature explains nearly all variance (feature importance)
    Model uses future data that wouldn't exist at inference time
    Performance collapses when you retrain on a proper time split

    Primary metrics to watch

    Feature importance audit Temporal split vs random split delta Permutation importance SHAP values

    Classic leakage: scaler/imputer fitted on full dataset before splitting. Always fit preprocessing on train set only.

    Fixes
    1Use sklearn Pipeline — fit only on train fold
    2For time series: always split by time, not randomly
    3Audit feature definitions — does this exist at prediction time?
    4SHAP analysis to catch proxy leakage features


    6. llm specifics


    Red flag signals
    Model scores 95%+ on MMLU but fails simple real-world variants
    Benchmark score improves but user satisfaction does not
    Model has seen benchmark data in pre-training (contamination)
    All teams using same public benchmarks — no signal differentiation

    Primary metrics to watch

    Private held-out eval set Task-specific domain evals Contamination audit Human preference rate

    Build your own eval suite for your specific use case. Public benchmarks are shared across all models — they stop differentiating once everyone optimises for them.

    Fixes
    1Create private domain eval set never exposed to training
    2Use dynamic or adversarial benchmarks
    3Audit training data for benchmark contamination
    4Prefer behavioral evals over static MCQ scores

















    Comments

    Popular posts from this blog

    mongosh install properly

    gemini cli getting file not defined error

    vllm : Failed to infer device type