Model training papers of interest
Interesting papers for model optimizations
Paper: Polar Express
The paper introduces Polar Express, a GPU-friendly polynomial method for computing the matrix polar decomposition, optimizing convergence speed and error minimization.It adapts polynomials iteratively, outperforming classical methods in deep learning applications like Muon, GPT-2 training, and image classification, with robust finite-precision stability and potential for large-scale, aspect-ratio-optimized, spectrum-aware acceleration.
https://arxiv.org/pdf/2505.16932
Paper: LowRA
Paper: "LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits"
Stanford University — Zhou, Zhang, Kumbong, Olukotun
arXiv: 2502.08141 (Feb 2025, accepted ICLR 2026)
https://arxiv.org/abs/2502.08141
The problem it solves
QLoRA (what you'd use in the training code above) quantizes the base model to 4-bit but keeps the LoRA adapters themselves in full precision (bf16). LowRA asks: what if we also aggressively quantize the adapter weights themselves?
LowRA is the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss. It optimizes fine-grained quantization — mapping, threshold selection, and precision assignment — while leveraging efficient CUDA kernels for scalable deployment. Medium
How it works — three key pieces
1. Fine-grained mixed precision assignment
Rather than quantizing everything to the same bit-width, LowRA looks at each slice of each weight matrix and asks "how sensitive is this slice?" — then assigns more bits where it matters and fewer where it doesn't. LowRA squeezes each parameter to about 2 bits — over 15× smaller than the 32-bit norm — while keeping accuracy nearly intact. It learns quantization encoders/decoders specific to each slice of parameters, assigns 1/2/4-bit budgets with a fast optimizer, and dequantizes on the fly with lightweight CUDA kernels, so there is virtually no runtime cost. arxiv
2. Mappings/thresholds learner
Instead of using fixed quantization bins (like standard INT4 which divides the range evenly), LowRA learns where to place the boundaries between quantization levels, specific to each layer. This is why it can go so low without collapsing.
3. Precision assigner
A small optimizer runs alongside training that continuously reassigns bit budgets across the adapter. A layer that's actively changing gets more bits; a layer that's stabilized gets fewer.
What the results look like
LowRA cuts memory usage by 30–50% during fine-tuning and deployment with minimal performance loss, and enables fine-tuning and deploying LLMs in ultra-resource-constrained settings at as low as 1.15 bits.
This practically this means:
| Method | Base model bits | Adapter bits | Total memory (14B) |
|---|---|---|---|
| Standard LoRA | 16-bit | 16-bit | ~28GB |
| QLoRA | 4-bit | 16-bit | ~12GB |
| LowRA | 4-bit | 1–2 bit | ~9GB |
Paper "Token-Efficient RL for LLM Reasoning"
This paper introduces S-GRPO (Stochastic GRPO), which extends GRPO to low-memory settings by reducing the tokens that contribute to the gradient from the full response trajectory — making reinforcement learning fine-tuning viable on modest hardware. This is directly relevant if you want your coding model to reason through problems rather than just pattern-match — which matters for harder debugging or architecture tasks.
https://arxiv.org/pdf/2504.20834
Paper: Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation
https://arxiv.org/html/2602.24283v1
Comments