Awesome LLM Research Collections
  • Home
  • Papers
    • Attention
    • LLMs
    • Multimodal LLMs
    • Embeddings
    • SFT
    • Training
    • Reinforcement Learning
    • Agents Application
    • Vision
    • Auto-Prompt
  • Notes
  • Blogs
  • English
  • 中文

Training

Reusable training recipes, SFT methods, data selection, distillation, and optimization practice.
中文

Research category

Reusable training recipes, SFT methods, data selection, distillation, and optimization practice.

2Papers
4Resource links
2026.05Latest month
Optimization Distillation

1 paper

Optimization

2026.05 Optimization

PowLU: An Activation Function for Stable Pre-Training of LLMs

This paper identifies SwiGLU's near-quadratic amplification on large positive inputs as a source of outliers and numerical instability in low-precision large-scale LLM pre-training. It proposes Power Linear Unit (PowLU), a rational-power activation that preserves adaptive nonlinearity while stabilizing spike regions, with scaling-law and Ling-model experiments showing competitive performance and improved training scalability.

Paper

1 paper

Distillation

2023.06 Distillation

Knowledge Distillation of Large Language Models

This paper studies white-box knowledge distillation for generative LLMs and proposes MiniLLM, replacing the standard forward KLD objective with reverse KLD to avoid overestimating low-probability teacher regions. The method derives an effective optimization procedure and improves instruction-following quality, calibration, exposure bias, and long-text generation across model families from 120M to 13B parameters.

Paper Code Hugging Face
  • View source
  • Report an issue