PyoSignal Logo
PyoSignal
Back to Research

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

Paper ID: 2604.20796 โ€ข 207 Upvotes
Diffusion Model Multimodal Image Generation LLM Reasoning Vision Inference Distillation
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

๐Ÿ“ ํ•ต์‹ฌ ์š”์•ฝ

LLaDA2.0-Uni๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ํ†ตํ•ฉ์ ์œผ๋กœ ์ดํ•ดํ•˜๊ณ  ์ƒ์„ฑํ•˜๋Š” diffusion LLM์œผ๋กœ, ์ฐจ์„ธ๋Œ€ ํ†ตํ•ฉ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์˜ ๊ฐ€๋Šฅ์„ฑ์„ ์ œ์‹œํ•˜๋ฉฐ, ํŠนํžˆ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ฐ ํŽธ์ง‘ ์ž‘์—…์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋‚ด์šฉ

๊ธฐ์กด์˜ VLM์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ดํ•ด์™€ ์ƒ์„ฑ์„ ํ†ตํ•ฉ์ ์œผ๋กœ ์ง€์›ํ•˜๋Š” ๋ฐ ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด LLaDA2.0-Uni๋Š” discrete diffusion LLM์„ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์ž…๋ ฅ์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์ด ๋ชจ๋ธ์€ semantic discrete tokenizer, MoE ๊ธฐ๋ฐ˜ dLLM ๋ฐฑ๋ณธ, diffusion ๋””์ฝ”๋”๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ๊ณ ํ’ˆ์งˆ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์™€ ๋งž์ถคํ˜• ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์„ ํ†ตํ•ด LLaDA2.0-Uni๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ดํ•ด๋Š” ๋ฌผ๋ก  ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ฐ ํŽธ์ง‘์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ interleaved ์ƒ์„ฑ ๋ฐ ์ถ”๋ก ์„ ๊ธฐ๋ณธ์ ์œผ๋กœ ์ง€์›ํ•˜์—ฌ ์ฐจ์„ธ๋Œ€ ํ†ตํ•ฉ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์„ ์œ„ํ•œ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์ฃผ์š” ๋‚ด์šฉ (Key Points)

  • ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€์˜ ํ†ตํ•ฉ์  ์ดํ•ด ๋ฐ ์ƒ์„ฑ์„ ์œ„ํ•œ dLLM ๊ตฌ์กฐ ์ œ์•ˆ
  • SigLIP-VQ๋ฅผ ํ†ตํ•œ ์‹œ๊ฐ์  ์ž…๋ ฅ์˜ ์ด์‚ฐํ™” ๋ฐ diffusion decoder๋ฅผ ํ†ตํ•œ ๊ณ ํ’ˆ์งˆ ์ด๋ฏธ์ง€ ๋ณต์›
  • prefix-aware ์ตœ์ ํ™” ๋ฐ few-step distillation์„ ํ†ตํ•œ ์ถ”๋ก  ํšจ์œจ์„ฑ ํ–ฅ์ƒ

๐Ÿ’ก ์‹ค๋ฌด์  ๊ฐ€์น˜ (Relevance)

LLaDA2.0-Uni๋Š” ์ด๋ฏธ์ง€ ์ƒ์„ฑ, ํŽธ์ง‘, ๊ทธ๋ฆฌ๊ณ  ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๊ฐ€ ํ˜ผํ•ฉ๋œ ์ฝ˜ํ…์ธ ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฐœ๋ฐœ์ž์—๊ฒŒ ์œ ์šฉํ•˜๋ฉฐ, ํŠนํžˆ ๋ณต์žกํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž‘์—…์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

โœ… ์ถ”์ฒœ ์•ก์…˜ (Actionable Items)

  • ์ œ๊ณต๋œ GitHub ์ €์žฅ์†Œ์—์„œ ๋ชจ๋ธ ๋ฐ ์ฝ”๋“œ ํ™•์ธ
  • ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ฐ ํŽธ์ง‘ ๊ด€๋ จ task์— ์ ์šฉํ•ด๋ณด๊ณ  ์„ฑ๋Šฅ ํ…Œ์ŠคํŠธ
  • ๋ชจ๋ธ ๊ตฌ์กฐ ๋ฐ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์„ ๋ถ„์„ํ•˜์—ฌ ํŠน์ • ์‚ฌ์šฉ ์‚ฌ๋ก€์— ๋งž๊ฒŒ ์กฐ์ •