PyoSignal Logo
PyoSignal
Back to Research

Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation

Paper ID: 2606.16429 โ€ข 2 Upvotes
Model Distillation Linear Attention Transformer Efficient Inference Inference Distillation Safety
Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation

๐Ÿ“ ํ•ต์‹ฌ ์š”์•ฝ

์‚ฌ์ „ ํ•™์Šต๋œ Transformer๋ฅผ ํšจ์œจ์ ์ธ Hybrid Linear Attention ๋ชจ๋ธ๋กœ ๋ณ€ํ™˜ํ•  ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ์ดˆ๊ธฐํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” Taylor ๊ธ‰์ˆ˜ ๊ธฐ๋ฐ˜์˜ ์ •๋ฐ€ ์ดˆ๊ธฐํ™” ๊ธฐ๋ฒ•

๐Ÿ“– ์ƒ์„ธ ๋‚ด์šฉ

์ตœ๊ทผ ๊ธด ๋ฌธ๋งฅ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด Transformer๋ฅผ Hybrid Linear Attention ๋ชจ๋ธ๋กœ ๋ณ€ํ™˜ํ•˜๋ ค๋Š” ์‹œ๋„๊ฐ€ ๋Š˜๊ณ  ์žˆ์œผ๋‚˜, ๋‹จ์ˆœ ๊ฐ€์ค‘์น˜ ๋ณต์‚ฌ ๋ฐฉ์‹์€ ๋ชจ๋ธ์˜ ๋™์—ญํ•™(dynamics) ๋ถˆ์ผ์น˜๋กœ ์ธํ•ด ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด ๋ฐฉ์‹์€ ๋ชจ๋ธ์ด ํ•™์Šต ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ ์ดˆ๊ธฐํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ๋„ˆ๋ฌด ๋งŽ์€ ํ† ํฐ์„ ์†Œ๋ชจํ•˜๋Š” ๋น„ํšจ์œจ์„ฑ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ Taylor ๊ธ‰์ˆ˜๋ฅผ ํ™œ์šฉํ•ด ๊ต์‚ฌ(Teacher) ๋ชจ๋ธ์˜ ์–ดํ…์…˜ ํ†ต๊ณ„๋Ÿ‰์„ ๋ถ„์„ํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์ƒ(Student) ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜์™€ ๊ฒŒ์ดํŠธ๋ฅผ ์„ค์ •ํ•˜๋Š” Taylor-Calibrate ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๊ฐ ๋ ˆ์ด์–ด๋ฅผ ๊ต์‚ฌ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์— ๋งž์ถ”๋Š” ์งง์€ ์ •๋ ฌ ๋‹จ๊ณ„๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ œ์•ˆ ๊ธฐ๋ฒ•์€ ์ œ๋กœ์ƒท ์„ฑ๋Šฅ์„ ๋Œ€ํญ ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ ๊ธฐ์กด ๋ฐฉ์‹ ๋Œ€๋น„ ํ›จ์”ฌ ์ ์€ ํ•™์Šต ํ† ํฐ๋งŒ์œผ๋กœ๋„ ๋ชฉํ‘œ ์„ฑ๋Šฅ์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์ฃผ์š” ๋‚ด์šฉ (Key Points)

  • Taylor ๊ธ‰์ˆ˜ ๊ธฐ๋ฐ˜์˜ ์ •๋ฐ€ํ•œ ์ดˆ๊ธฐํ™”๋กœ ํ•™์ƒ ๋ชจ๋ธ์˜ ๋™์—ญํ•™(decay, write, output-gate) ์ตœ์ ํ™”
  • ๊ต์‚ฌ ๋ชจ๋ธ์˜ ์–ดํ…์…˜ ํ†ต๊ณ„๋Ÿ‰์„ ํ™œ์šฉํ•œ ๊ฐ€์ค‘์น˜ ๋ฐ ๊ฒŒ์ดํŠธ ์„ค์ • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ๋ฐœ
  • ํ•™์Šต ํ† ํฐ ์†Œ๋ชจ๋Ÿ‰์„ ํš๊ธฐ์ ์œผ๋กœ ์ค„์ด๋ฉด์„œ๋„ ๋†’์€ ์ œ๋กœ์ƒท ๋ฐ ๋ณต๊ตฌ ์„ฑ๋Šฅ ๋‹ฌ์„ฑ

๐Ÿ’ก ์‹ค๋ฌด์  ๊ฐ€์น˜ (Relevance)

๊ธฐ์กด ๊ฑฐ๋Œ€ ๋ชจ๋ธ(Transformer)์„ ํšจ์œจ์ ์ธ ์ถ”๋ก ์šฉ ๋ชจ๋ธ(Linear Attention)๋กœ ์ „ํ™˜ํ•  ๋•Œ, ํ•™์Šต ๋น„์šฉ์„ ์ค„์ด๊ณ  ์„ฑ๋Šฅ ์†์‹ค์„ ์ตœ์†Œํ™”ํ•˜๋Š” ์‹ค์งˆ์ ์ธ ๊ฐ€์ด๋“œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

โœ… ์ถ”์ฒœ ์•ก์…˜ (Actionable Items)

  • ๊ธฐ์กด Transformer ๋ชจ๋ธ์„ Gated DeltaNet ๊ตฌ์กฐ๋กœ ๋ณ€ํ™˜ ์‹œ ๋‹จ์ˆœ ๋ณต์‚ฌ ๋ฐฉ์‹๊ณผ Taylor-Calibrate ๋ฐฉ์‹์˜ ์„ฑ๋Šฅ ๋น„๊ต ์‹คํ—˜
  • ๋‹ค์–‘ํ•œ ๋ ˆ์ด์–ด ์œ ์ง€ ์ •์ฑ…(retained-layer policies)์— ๋”ฐ๋ฅธ ์ดˆ๊ธฐํ™” ์•ˆ์ •์„ฑ ํ…Œ์ŠคํŠธ
  • ํ•™์Šต ํ† ํฐ ์ˆ˜ ๋ณ€ํ™”์— ๋”ฐ๋ฅธ ์ˆ˜๋ ด ์†๋„ ๋ฐ ์ตœ์ข… ์„ฑ๋Šฅ ์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„