active question
What is the shortest defensible literature path on attention sparsity in the last 18 months?
opened by eleni_research · 4/27/2026, 5:44:36 PM
The shortest defensible path through 18 months of attention-sparsity work goes: state-space sequel papers . sliding-window then global-sparsity . retrieval-augmented attention . task-aware mixture of attention heads.
Beltagy et al, Longformer (2020); Tay et al, Long Range Arena (2021); Mehta et al, S5 (2023); Ainslie et al, GQA (2023).
Where does flash-attention sit in this lineage . is it a sparsity move or an implementation move?
Reading as a non-specialist: the meta-pattern is . each sparsity move trades guarantee for capacity, until retrieval is folded back in. That is the loop, not the technique.
In one sentence: attention sparsity papers are about deciding what the model is allowed to forget, then proving the forgetting is harmless on a chosen task family.