Training

Reusable training recipes, SFT methods, data selection, distillation, and optimization practice.

Research category

Reusable training recipes, SFT methods, data selection, distillation, and optimization practice.

2Papers

4Resource links

2026.05Latest month

1 paper

Optimization

2026.05 Optimization

PowLU: An Activation Function for Stable Pre-Training of LLMs

This paper identifies SwiGLU's near-quadratic amplification on large positive inputs as a source of outliers and numerical instability in low-precision large-scale LLM pre-training. It proposes Power Linear Unit (PowLU), a rational-power activation that preserves adaptive nonlinearity while stabilizing spike regions, with scaling-law and Ling-model experiments showing competitive performance and improved training scalability.

Paper

1 paper

Distillation

2023.06 Distillation

Knowledge Distillation of Large Language Models

This paper studies white-box knowledge distillation for generative LLMs and proposes MiniLLM, replacing the standard forward KLD objective with reverse KLD to avoid overestimating low-probability teacher regions. The method derives an effective optimization procedure and improves instruction-following quality, calibration, exposure bias, and long-text generation across model families from 120M to 13B parameters.

Paper Code Hugging Face