PyoSignal Logo
PyoSignal
Back to Research
Diffusion Model Attention Sparsity Video Generation Optimization Video Distillation

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Paper ID: 2602.13515 โ€ข 21 Upvotes
SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

๐Ÿ“ ํ•ต์‹ฌ ์š”์•ฝ

๋น„๋””์˜ค ํ™•์‚ฐ ๋ชจ๋ธ์—์„œ attention ์—ฐ์‚ฐ๋Ÿ‰์„ 95%๊นŒ์ง€ ์ค„์ด๋ฉด์„œ ์ƒ์„ฑ ํ’ˆ์งˆ์„ ์œ ์ง€ํ•˜๋Š” SpargeAttention2๋ฅผ ์ œ์•ˆ, ๊ธฐ์กด sparse attention ๋ฐฉ๋ฒ•๋“ค์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์ž„.

๐Ÿ“– ์ƒ์„ธ ๋‚ด์šฉ

์ตœ๊ทผ ํ™•์‚ฐ ๋ชจ๋ธ์—์„œ attention ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ด๊ธฐ ์œ„ํ•œ sparse attention ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํ•˜๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ Top-k๋‚˜ Top-p ๋งˆ์Šคํ‚น ๊ทœ์น™์˜ ํ•œ๊ณ„๋ฅผ ๋ณด์˜€๊ณ , fine-tuning ๊ณผ์ •์—์„œ ์ƒ์„ฑ ํ’ˆ์งˆ ์ €ํ•˜๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ๋„ ํ–ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ๋ถ„์„ํ•˜๊ณ , Top-k์™€ Top-p๋ฅผ ๊ฒฐํ•ฉํ•œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋งˆ์Šคํ‚น ๊ทœ์น™์„ ์ œ์•ˆํ•˜์—ฌ ๋†’์€ ํฌ์†Œ์„ฑ์—์„œ๋„ ์•ˆ์ •์ ์ธ ๋งˆ์Šคํ‚น์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•œ๋‹ค. ๋˜ํ•œ distillation ๊ธฐ๋ฐ˜ fine-tuning objective๋ฅผ ํ†ตํ•ด sparse attention fine-tuning ๊ณผ์ •์—์„œ ์ƒ์„ฑ ํ’ˆ์งˆ์„ ๋ณด์กดํ•œ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, SpargeAttention2๋Š” ๋น„๋””์˜ค ํ™•์‚ฐ ๋ชจ๋ธ์—์„œ 95%์˜ attention sparsity๋ฅผ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ๋„ ์ƒ์„ฑ ํ’ˆ์งˆ์„ ์œ ์ง€ํ•˜์—ฌ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค.

๐Ÿ”‘ ์ฃผ์š” ๋‚ด์šฉ (Key Points)

  • Top-k์™€ Top-p ๋งˆ์Šคํ‚น์˜ ์‹คํŒจ ์›์ธ ๋ถ„์„ ๋ฐ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋งˆ์Šคํ‚น ๊ทœ์น™ ์ œ์•ˆ
  • Distillation ๊ธฐ๋ฐ˜ fine-tuning objective๋ฅผ ํ†ตํ•œ ์ƒ์„ฑ ํ’ˆ์งˆ ๋ณด์กด
  • ๋น„๋””์˜ค ํ™•์‚ฐ ๋ชจ๋ธ์—์„œ 95% attention sparsity ๋ฐ 16.2๋ฐฐ ์†๋„ ํ–ฅ์ƒ ๋‹ฌ์„ฑ

๐Ÿ’ก ์‹ค๋ฌด์  ๊ฐ€์น˜ (Relevance)

๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ์˜ ์ถ”๋ก  ์†๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ค๊ณ , ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์—ฌ ๋” ํฐ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ฑฐ๋‚˜ ๋ฐฐํฌํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋‹ค.

โœ… ์ถ”์ฒœ ์•ก์…˜ (Actionable Items)

  • ๊ธฐ์กด ๋น„๋””์˜ค ํ™•์‚ฐ ๋ชจ๋ธ์— SpargeAttention2 ์ ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ ํšจ๊ณผ ํ™•์ธ
  • SpargeAttention2์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋งˆ์Šคํ‚น ๊ทœ์น™์„ ๋‹ค๋ฅธ attention ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์— ์ ์šฉํ•˜์—ฌ ํšจ๊ณผ ๊ฒ€์ฆ
  • Distillation ๊ธฐ๋ฐ˜ fine-tuning objective๋ฅผ ๋‹ค๋ฅธ sparse attention ๋ฐฉ๋ฒ•๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ ์„ฑ๋Šฅ ๊ฐœ์„  ์‹œ๋„