PyoSignal Logo
PyoSignal
Back to Research
Diffusion Model Latent Representation Image Generation Video Generation Vision

Unified Latents (UL): How to train your latents

Paper ID: 2602.17270 โ€ข 19 Upvotes
Unified Latents (UL): How to train your latents

๐Ÿ“ ํ•ต์‹ฌ ์š”์•ฝ

Unified Latents (UL) ํ”„๋ ˆ์ž„์›Œํฌ๋Š” diffusion prior์™€ diffusion model์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž ์žฌ ํ‘œํ˜„ ํ•™์Šต ์„ฑ๋Šฅ์„ ๋†’์ด๊ณ , ImageNet-512 ๋ฐ Kinetics-600 ๋ฐ์ดํ„ฐ์…‹์—์„œ SOTA๋ฅผ ๋‹ฌ์„ฑํ•˜์—ฌ ์ด๋ฏธ์ง€ ๋ฐ ๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ์˜ ํšจ์œจ์„ฑ๊ณผ ํ’ˆ์งˆ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋‚ด์šฉ

์ž ์žฌ ํ‘œํ˜„ ํ•™์Šต์€ ์ด๋ฏธ์ง€ ๋ฐ ๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค. ๊ธฐ์กด ๋ฐฉ์‹์€ ํ•™์Šต ํšจ์œจ์„ฑ์ด ๋‚ฎ๊ฑฐ๋‚˜ ์ƒ์„ฑ ํ’ˆ์งˆ์ด ๋–จ์–ด์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” diffusion prior์™€ diffusion model๋กœ ๊ณต๋™ ์ •๊ทœํ™”๋œ ์ž ์žฌ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๋Š” Unified Latents (UL) ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. UL์€ ์ธ์ฝ”๋”์˜ ์ถœ๋ ฅ ๋…ธ์ด์ฆˆ๋ฅผ prior์˜ ์ตœ์†Œ ๋…ธ์ด์ฆˆ ๋ ˆ๋ฒจ์— ์—ฐ๊ฒฐํ•˜์—ฌ ๊ฐ„๋‹จํ•˜๋ฉด์„œ๋„ tightํ•œ bitrate upper bound๋ฅผ ์ œ๊ณตํ•˜๋Š” ํ•™์Šต ๋ชฉํ‘œ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ImageNet-512์—์„œ FID 1.4๋ฅผ ๋‹ฌ์„ฑํ•˜๋ฉฐ, Stable Diffusion ์ž ์žฌ ๊ณต๊ฐ„ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋ณด๋‹ค ์ ์€ FLOPs๋กœ ๋†’์€ ์žฌ๊ตฌ์„ฑ ํ’ˆ์งˆ(PSNR)์„ ๋ณด์ธ๋‹ค. Kinetics-600์—์„œ๋Š” FVD 1.3์œผ๋กœ ์ƒˆ๋กœ์šด SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

๐Ÿ”‘ ์ฃผ์š” ๋‚ด์šฉ (Key Points)

  • Diffusion prior์™€ diffusion model์„ ํ™œ์šฉํ•œ ์ƒˆ๋กœ์šด ์ž ์žฌ ํ‘œํ˜„ ํ•™์Šต ํ”„๋ ˆ์ž„์›Œํฌ (UL) ์ œ์•ˆ
  • Latent bitrate upper bound๋ฅผ tightํ•˜๊ฒŒ ์ œ๊ณตํ•˜๋Š” ๊ฐ„๋‹จํ•œ ํ•™์Šต ๋ชฉํ‘œ ๊ฐœ๋ฐœ
  • ImageNet-512 ๋ฐ Kinetics-600์—์„œ SOTA ์„ฑ๋Šฅ ๋‹ฌ์„ฑ (์ ์€ FLOPs๋กœ ๋†’์€ ํ’ˆ์งˆ)

๐Ÿ’ก ์‹ค๋ฌด์  ๊ฐ€์น˜ (Relevance)

UL ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์ด๋ฏธ์ง€ ๋ฐ ๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ ๊ฐœ๋ฐœ ์‹œ ํ•™์Šต ํšจ์œจ์„ฑ์„ ๋†’์ด๊ณ  ์ƒ์„ฑ ํ’ˆ์งˆ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํŠนํžˆ ์ œํ•œ๋œ ์ปดํ“จํŒ… ์ž์› ํ™˜๊ฒฝ์—์„œ ๋”์šฑ ์œ ์šฉํ•˜๋‹ค.

โœ… ์ถ”์ฒœ ์•ก์…˜ (Actionable Items)

  • UL ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ธฐ์กด ์ด๋ฏธ์ง€/๋น„๋””์˜ค ์ƒ์„ฑ ๋ชจ๋ธ์— ์ ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ ํ–ฅ์ƒ ์‹คํ—˜
  • UL ํ•™์Šต ๋ชฉํ‘œ๋ฅผ ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ ์ž ์žฌ ํ‘œํ˜„ ํ•™์Šต ๋ฐฉ์‹๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก  ํƒ์ƒ‰
  • UL์˜ hyperparameter (noise level ๋“ฑ) ํŠœ๋‹์„ ํ†ตํ•ด ํŠน์ • ๋ฐ์ดํ„ฐ์…‹์— ์ตœ์ ํ™”