Model training papers of interest

Interesting papers for model optimizations

Paper: Polar Express

The paper introduces Polar Express, a GPU-friendly polynomial method for computing the matrix polar decomposition, optimizing convergence speed and error minimization.It adapts polynomials iteratively, outperforming classical methods in deep learning applications like Muon, GPT-2 training, and image classification, with robust finite-precision stability and potential for large-scale, aspect-ratio-optimized, spectrum-aware acceleration.

https://arxiv.org/pdf/2505.16932

Paper: LowRA

Paper: "LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits" Stanford University — Zhou, Zhang, Kumbong, Olukotun arXiv: 2502.08141 (Feb 2025, accepted ICLR 2026)

https://arxiv.org/abs/2502.08141

The problem it solves

QLoRA (what you'd use in the training code above) quantizes the base model to 4-bit but keeps the LoRA adapters themselves in full precision (bf16). LowRA asks: what if we also aggressively quantize the adapter weights themselves?

LowRA is the first framework to enable LoRA fine-tuning below 2 bits per parameter with minimal performance loss. It optimizes fine-grained quantization — mapping, threshold selection, and precision assignment — while leveraging efficient CUDA kernels for scalable deployment. Medium

How it works — three key pieces

1. Fine-grained mixed precision assignment

Rather than quantizing everything to the same bit-width, LowRA looks at each slice of each weight matrix and asks "how sensitive is this slice?" — then assigns more bits where it matters and fewer where it doesn't. LowRA squeezes each parameter to about 2 bits — over 15× smaller than the 32-bit norm — while keeping accuracy nearly intact. It learns quantization encoders/decoders specific to each slice of parameters, assigns 1/2/4-bit budgets with a fast optimizer, and dequantizes on the fly with lightweight CUDA kernels, so there is virtually no runtime cost. arxiv

2. Mappings/thresholds learner

Instead of using fixed quantization bins (like standard INT4 which divides the range evenly), LowRA learns where to place the boundaries between quantization levels, specific to each layer. This is why it can go so low without collapsing.

3. Precision assigner

A small optimizer runs alongside training that continuously reassigns bit budgets across the adapter. A layer that's actively changing gets more bits; a layer that's stabilized gets fewer.

What the results look like

LowRA cuts memory usage by 30–50% during fine-tuning and deployment with minimal performance loss, and enables fine-tuning and deploying LLMs in ultra-resource-constrained settings at as low as 1.15 bits.

This practically this means:

Method	Base model bits	Adapter bits	Total memory (14B)
Standard LoRA	16-bit	16-bit	~28GB
QLoRA	4-bit	16-bit	~12GB
LowRA	4-bit	1–2 bit	~9GB

Paper "Token-Efficient RL for LLM Reasoning"

This paper introduces S-GRPO (Stochastic GRPO), which extends GRPO to low-memory settings by reducing the tokens that contribute to the gradient from the full response trajectory — making reinforcement learning fine-tuning viable on modest hardware. This is directly relevant if you want your coding model to reason through problems rather than just pattern-match — which matters for harder debugging or architecture tasks.

https://arxiv.org/pdf/2504.20834

Paper: Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

https://arxiv.org/html/2602.24283v1

Search This Blog

mitzen

Model training papers of interest

Paper: LowRA

Comments

Popular posts from this blog

ubuntu 24.04 - setting up nodejs 22/20 instead of install older versions nodejs

Windows SSH: Permissions for 'private-key' are too open

NodeJS: Error: spawn EINVAL in window for node version 20.20 and 18.20