Awesome LLM Research Collections
  • Home
  • Papers
    • Attention
    • LLMs
    • Multimodal LLMs
    • Embeddings
    • SFT
    • Training
    • Reinforcement Learning
    • Agents Application
    • Vision
    • Auto-Prompt
  • Notes
  • Blogs
  • English
  • 中文

Attention

Transformer internals, attention variants, KV/cache behavior, and depth-wise information flow.
中文

Research category

Transformer internals, attention variants, KV/cache behavior, and depth-wise information flow.

4Papers
8Resource links
2026.06Latest month
Attention Architecture

4 papers

Attention Architecture

2026.06 Attention Architecture

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

This paper proposes Lookahead Sparse Attention, which uses a separately trained neural memory indexer to predict future context needs and retain only query-critical KV chunks on GPU. FlashMemory reduces the physical KV cache footprint to 13.5% of full-context attention on average while preserving or slightly improving long-context accuracy.

Paper Code Hugging Face
2026.05 Attention Architecture

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

This paper proposes Group-Query Latent Attention, a minimal MLA modification that exposes both MQA-absorb and GQA decoding paths from the same trained weights. The runtime can select the path that matches target hardware without retraining or custom kernels, enabling H100-style compressed decoding, H20-oriented GQA plus MTP, and up to 8-way zero-redundancy tensor parallelism.

Paper Code
2026.03 Attention Architecture

Attention Residuals

This work replaces fixed residual accumulation with attention over previous layer outputs, enabling input-dependent depth-wise aggregation and reducing PreNorm-induced representation dilution. It also introduces Block AttnRes for scalable training with lower memory and communication overhead.

Paper Project
2019.11 Attention Architecture

Fast Transformer Decoding: One Write-Head is All You Need

This paper introduces multi-query attention, sharing keys and values across attention heads to reduce the memory-bandwidth cost of incremental Transformer decoding. The variant speeds up decoding substantially while incurring only minor quality degradation relative to multi-head attention baselines.

Paper
  • View source
  • Report an issue