Attention

Transformer internals, attention variants, KV/cache behavior, and depth-wise information flow.

Research category

Transformer internals, attention variants, KV/cache behavior, and depth-wise information flow.

4Papers

8Resource links

2026.06Latest month

4 papers

Attention Architecture

2026.06 Attention Architecture

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

This paper proposes Lookahead Sparse Attention, which uses a separately trained neural memory indexer to predict future context needs and retain only query-critical KV chunks on GPU. FlashMemory reduces the physical KV cache footprint to 13.5% of full-context attention on average while preserving or slightly improving long-context accuracy.

Paper Code Hugging Face

2026.05 Attention Architecture

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

This paper proposes Group-Query Latent Attention, a minimal MLA modification that exposes both MQA-absorb and GQA decoding paths from the same trained weights. The runtime can select the path that matches target hardware without retraining or custom kernels, enabling H100-style compressed decoding, H20-oriented GQA plus MTP, and up to 8-way zero-redundancy tensor parallelism.

Paper Code

2026.03 Attention Architecture

Attention Residuals

This work replaces fixed residual accumulation with attention over previous layer outputs, enabling input-dependent depth-wise aggregation and reducing PreNorm-induced representation dilution. It also introduces Block AttnRes for scalable training with lower memory and communication overhead.

Paper Project

2019.11 Attention Architecture

Fast Transformer Decoding: One Write-Head is All You Need

This paper introduces multi-query attention, sharing keys and values across attention heads to reduce the memory-bandwidth cost of incremental Transformer decoding. The variant speeds up decoding substantially while incurring only minor quality degradation relative to multi-head attention baselines.

Paper