PyoSignal Logo
PyoSignal
Back to Research

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Paper ID: 2604.14258 โ€ข 19 Upvotes
LLM Fine-tuning Reinforcement Learning SFT Post-training Vision
GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

๐Ÿ“ ํ•ต์‹ฌ ์š”์•ฝ

LLM ํ›„์ฒ˜๋ฆฌ ๊ณผ์ •์—์„œ SFT์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ  RL๊ณผ์˜ ํ†ตํ•ฉ์„ ๊ฐ•ํ™”ํ•˜์—ฌ ๋” ์•ˆ์ •์ ์ด๊ณ  ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚œ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ์ƒˆ๋กœ์šด ๋ฏธ์„ธ ์กฐ์ • ํ”„๋ ˆ์ž„์›Œํฌ.

๐Ÿ“– ์ƒ์„ธ ๋‚ด์šฉ

๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์€ ์ฃผ๋กœ SFT(์ง€๋„ ๋ฏธ์„ธ ์กฐ์ •)์™€ RL(๊ฐ•ํ™” ํ•™์Šต)์„ ํ†ตํ•ด ํ›„์ฒ˜๋ฆฌ๋˜์ง€๋งŒ, ํšจ์œจ์ ์ธ ์ง€์‹ ์ฃผ์ž…๊ณผ ๊ฐ•๋ ฅํ•œ ์ผ๋ฐ˜ํ™”๋ฅผ ๋™์‹œ์— ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” SFT๊ฐ€ ํฌ์†Œํ•œ ์•”๋ฌต์  ๋ณด์ƒ๊ณผ ๋ถˆ์•ˆ์ •ํ•œ ์—ญํ™•๋ฅ  ๊ฐ€์ค‘์น˜๋กœ ์ธํ•ด ๋‹จ์ผ ๊ฒฝ๋กœ ์˜์กด์„ฑ, ์—”ํŠธ๋กœํ”ผ ๋ถ•๊ดด, ๊ธฐ์šธ๊ธฐ ํญ๋ฐœ์„ ๊ฒช๋Š”๋‹ค๋Š” ํ›ˆ๋ จ ์—ญํ•™ ๋ถ„์„์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ ์ง„๋‹จ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ, ์šฐ๋ฆฌ๋Š” Group Advantage Learning(๋‹ค์–‘ํ•œ ์‘๋‹ต ๊ทธ๋ฃน ๊ตฌ์„ฑ ๋ฐ ์ •๊ทœํ™”๋œ ๋Œ€๋น„ ๊ฐ๋…)๊ณผ Dynamic Coefficient Rectification(์—ญํ™•๋ฅ  ๊ฐ€์ค‘์น˜ ์ ์‘์  ์ œํ•œ)์ด๋ผ๋Š” ๋‘ ๊ฐ€์ง€ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด SFT์˜ ๋‚ด์žฌ์  ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ํ†ตํ•ฉ ํ›„์ฒ˜๋ฆฌ ํ”„๋ ˆ์ž„์›Œํฌ์ธ GFT๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, GFT๋Š” SFT ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋“ค์„ ์ผ๊ด€๋˜๊ฒŒ ๋Šฅ๊ฐ€ํ•˜๋ฉฐ, ํ›„์† RL ํ›ˆ๋ จ๊ณผ ๋” ์›ํ™œํ•˜๊ฒŒ ํ†ตํ•ฉ๋˜๋Š” ์ •์ฑ…์„ ์ƒ์„ฑํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ”‘ ์ฃผ์š” ๋‚ด์šฉ (Key Points)

  • SFT์˜ ํ›ˆ๋ จ ์—ญํ•™ ๋ถ„์„์„ ํ†ตํ•ด ํฌ์†Œํ•œ ๋ณด์ƒ, ๋ถˆ์•ˆ์ •ํ•œ ๊ฐ€์ค‘์น˜, ์—”ํŠธ๋กœํ”ผ ๋ถ•๊ดด ๋“ฑ์˜ ๊ทผ๋ณธ์ ์ธ ๋ฌธ์ œ์  ์ง„๋‹จ.
  • ๋‹ค์–‘ํ•œ ์‘๋‹ต ๊ทธ๋ฃน์„ ๊ตฌ์„ฑํ•˜๊ณ  ์ •๊ทœํ™”๋œ ๋Œ€๋น„ ๊ฐ๋…์„ ํ™œ์šฉํ•˜์—ฌ ๋ณด์ƒ ํฌ์†Œ์„ฑ์„ ์™„ํ™”ํ•˜๋Š” Group Advantage Learning ์ œ์•ˆ.
  • ์—ญํ™•๋ฅ  ๊ฐ€์ค‘์น˜๋ฅผ ์ ์‘์ ์œผ๋กœ ์ œํ•œํ•˜์—ฌ ์ตœ์ ํ™”๋ฅผ ์•ˆ์ •ํ™”ํ•˜๊ณ  ํšจ์œจ์ ์ธ ์ง€์‹ ์ฃผ์ž…์„ ์œ ์ง€ํ•˜๋Š” Dynamic Coefficient Rectification ์ œ์•ˆ.

๐Ÿ’ก ์‹ค๋ฌด์  ๊ฐ€์น˜ (Relevance)

๊ฐœ๋ฐœ์ž ๊ด€์ ์—์„œ GFT๋Š” ๊ธฐ์กด SFT์˜ ๋ถˆ์•ˆ์ •์„ฑ๊ณผ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ , RLHF์™€ ๊ฐ™์€ ๊ฐ•ํ™” ํ•™์Šต ๋‹จ๊ณ„์™€์˜ ์—ฐ๊ณ„๋ฅผ ๋”์šฑ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ ๋งŒ๋“ค์–ด LLM์˜ ์„ฑ๋Šฅ๊ณผ ์•ˆ์ •์„ฑ์„ ์ „๋ฐ˜์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์‹ค์šฉ์ ์ธ ๋Œ€์•ˆ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋” ๊ฐ•๋ ฅํ•˜๊ณ  ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ๋›ฐ์–ด๋‚œ LLM์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐ ๊ธฐ์—ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ… ์ถ”์ฒœ ์•ก์…˜ (Actionable Items)

  • ํ˜„์žฌ ์‚ฌ์šฉ ์ค‘์ธ SFT ๊ธฐ๋ฐ˜ LLM ๋ฏธ์„ธ ์กฐ์ • ํŒŒ์ดํ”„๋ผ์ธ์— GFT๋ฅผ ์ ์šฉํ•˜์—ฌ ๊ธฐ์กด SFT ๋Œ€๋น„ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๋ฐ ์•ˆ์ •์„ฑ ๊ฐœ์„  ์—ฌ๋ถ€ ๊ฒ€์ฆ.
  • GFT๋กœ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ RLHF(Reinforcement Learning from Human Feedback)์™€ ๊ฐ™์€ ๊ฐ•ํ™” ํ•™์Šต ํ™˜๊ฒฝ์— ์—ฐ๊ฒฐํ•˜์—ฌ RL ํ›ˆ๋ จ์˜ ํšจ์œจ์„ฑ๊ณผ ์ตœ์ข… ๋ชจ๋ธ์˜ ํ’ˆ์งˆ ๋ณ€ํ™”๋ฅผ ํ‰๊ฐ€.
  • ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ(์˜ˆ: ์ฝ”๋“œ ์ƒ์„ฑ, ์š”์•ฝ, ๋Œ€ํ™”) ๋ฐ ํƒœ์Šคํฌ์—์„œ GFT์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ๊ณผ ํŠน์ • ํƒœ์Šคํฌ์— ๋Œ€ํ•œ ์ ํ•ฉ์„ฑ์„ ์‹คํ—˜.