Model training papers of interest

 Interesting papers for model optimizations 

Paper: Polar Express

The paper introduces Polar Express, a GPU-friendly polynomial method for computing the matrix polar decomposition, optimizing convergence speed and error minimization.It adapts polynomials iteratively, outperforming classical methods in deep learning applications like Muon, GPT-2 training, and image classification, with robust finite-precision stability and potential for large-scale, aspect-ratio-optimized, spectrum-aware acceleration.

https://arxiv.org/pdf/2505.16932


Paper: LowRA

Paper: "LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits" Stanford University — Zhou, Zhang, Kumbong, Olukotun arXiv: 2502.08141 (Feb 2025, accepted ICLR 2026)

https://arxiv.org/abs/2502.08141

The problem it solves

QLoRA (what you'd use in the training code above) quantizes the base model to 4-bit but keeps the LoRA adapters themselves in full precision (bf16). LowRA asks: what if we also aggressively quantize the adapter weights themselves?

LowRA is the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss. It optimizes fine-grained quantization — mapping, threshold selection, and precision assignment — while leveraging efficient CUDA kernels for scalable deployment. Medium

How it works — three key pieces

1. Fine-grained mixed precision assignment

Rather than quantizing everything to the same bit-width, LowRA looks at each slice of each weight matrix and asks "how sensitive is this slice?" — then assigns more bits where it matters and fewer where it doesn't. LowRA squeezes each parameter to about 2 bits — over 15× smaller than the 32-bit norm — while keeping accuracy nearly intact. It learns quantization encoders/decoders specific to each slice of parameters, assigns 1/2/4-bit budgets with a fast optimizer, and dequantizes on the fly with lightweight CUDA kernels, so there is virtually no runtime cost. arxiv

2. Mappings/thresholds learner

Instead of using fixed quantization bins (like standard INT4 which divides the range evenly), LowRA learns where to place the boundaries between quantization levels, specific to each layer. This is why it can go so low without collapsing.

3. Precision assigner

A small optimizer runs alongside training that continuously reassigns bit budgets across the adapter. A layer that's actively changing gets more bits; a layer that's stabilized gets fewer.

What the results look like

LowRA cuts memory usage by 30–50% during fine-tuning and deployment with minimal performance loss, and enables fine-tuning and deploying LLMs in ultra-resource-constrained settings at as low as 1.15 bits.

This practically this means:

MethodBase model bitsAdapter bitsTotal memory (14B)
Standard LoRA16-bit16-bit~28GB
QLoRA4-bit16-bit~12GB
LowRA4-bit1–2 bit~9GB


Paper "Token-Efficient RL for LLM Reasoning"

This paper introduces S-GRPO (Stochastic GRPO), which extends GRPO to low-memory settings by reducing the tokens that contribute to the gradient from the full response trajectory — making reinforcement learning fine-tuning viable on modest hardware. This is directly relevant if you want your coding model to reason through problems rather than just pattern-match — which matters for harder debugging or architecture tasks.

https://arxiv.org/pdf/2504.20834


Paper: Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

https://arxiv.org/html/2602.24283v1

Comments

Popular posts from this blog

mongosh install properly

gemini cli getting file not defined error

vllm : Failed to infer device type