PyoSignal Logo
PyoSignal
Back to Research

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

Paper ID: 2606.19531 โ€ข 8 Upvotes
Robot Control Diffusion Models World Models Efficiency Vision Video Inference
ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

๐Ÿ“ ํ•ต์‹ฌ ์š”์•ฝ

๋น„๋””์˜ค ์ƒ์„ฑ ๋Œ€์‹  ์‚ฌ์ „ ํ•™์Šต๋œ ์ด๋ฏธ์ง€ ํŽธ์ง‘ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ์—ฐ์‚ฐ ํšจ์œจ๊ณผ ์ œ์–ด ์ •ํ™•๋„๋ฅผ ๋™์‹œ์— ๋†’์ธ ๋กœ๋ด‡ ํ–‰๋™ ๋ชจ๋ธ(WAM) ํ”„๋ ˆ์ž„์›Œํฌ

๐Ÿ“– ์ƒ์„ธ ๋‚ด์šฉ

๊ธฐ์กด์˜ World Action Models(WAMs)๋Š” ๋น„๋””์˜ค ์ƒ์„ฑ์„ ํ†ตํ•ด ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜๋ ค ํ•˜์ง€๋งŒ, ๊ณผ๋„ํ•œ ์—ฐ์‚ฐ ๋น„์šฉ๊ณผ ๋ถˆํ•„์š”ํ•œ ๋””ํ…Œ์ผ ์ƒ์„ฑ, ์žฅ๊ธฐ ์˜ˆ์ธก ์‹œ ๋ฐœ์ƒํ•˜๋Š” ์˜ค๋ฅ˜ ๋“ฑ์˜ ๋ฌธ์ œ๋ฅผ ์•ˆ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๋น„๋””์˜ค ์ƒ์„ฑ ๋Œ€์‹  ์ด๋ฏธ์ง€ ํŽธ์ง‘ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜๋Š” ImageWAM ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ํŽธ์ง‘์€ ๋™์ž‘๊ณผ ๊ด€๋ จ๋œ ์‹œ๊ฐ์  ๋ณ€ํ™”์—๋งŒ ์ง‘์ค‘ํ•˜๋ฉฐ, ์ž‘์—… ์ง€์‹œ๋ฅผ ๊ตญ์†Œ์ ์ธ ์‹œ๊ฐ์  ๋ณ€ํ™”๋กœ ๋งคํ•‘ํ•˜๋Š” ๋ฐ ์œ ๋ฆฌํ•œ ์‚ฌ์ „ ํ•™์Šต๋œ ์ง€์‹์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ถ”๋ก  ์‹œ์—๋Š” ํƒ€๊ฒŸ ํ”„๋ ˆ์ž„์„ ์ง์ ‘ ์ƒ์„ฑํ•˜์ง€ ์•Š๊ณ , ์ด๋ฏธ์ง€ ํŽธ์ง‘ ๋””๋…ธ์ด์ง• ๊ณผ์ •์—์„œ ์ƒ์„ฑ๋œ KV ์บ์‹œ๋ฅผ Flow-matching ๊ธฐ๋ฐ˜ ์•ก์…˜ ์ „๋ฌธ๊ฐ€ ๋ชจ๋ธ์˜ ์ปจํ…์ŠคํŠธ๋กœ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, ImageWAM์€ ๊ธฐ์กด VLA ๋ฐ ๊ฒฝ์Ÿ ๋ชจ๋ธ ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉด์„œ๋„ ์—ฐ์‚ฐ๋Ÿ‰(FLOPs)๊ณผ ์ง€์—ฐ ์‹œ๊ฐ„(Latency)์„ ํš๊ธฐ์ ์œผ๋กœ ์ค„์˜€์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์ฃผ์š” ๋‚ด์šฉ (Key Points)

  • ๋น„๋””์˜ค ์ƒ์„ฑ ๋Œ€์‹  ์ด๋ฏธ์ง€ ํŽธ์ง‘ ๋ชจ๋ธ์˜ ์‚ฌ์ „ ํ•™์Šต๋œ ์ง€์‹์„ ํ™œ์šฉํ•˜์—ฌ ๋™์ž‘ ๊ด€๋ จ ์‹œ๊ฐ์  ๋ณ€ํ™”์— ์ง‘์ค‘
  • ์ด๋ฏธ์ง€ ํŽธ์ง‘ ๊ณผ์ •์˜ KV ์บ์‹œ๋ฅผ ์•ก์…˜ ์ „๋ฌธ๊ฐ€ ๋ชจ๋ธ์˜ ์ปจํ…์ŠคํŠธ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์ถ”๋ก  ํšจ์œจ ๊ทน๋Œ€ํ™”
  • ๊ธฐ์กด ๋น„๋””์˜ค ๊ธฐ๋ฐ˜ WAM ๋Œ€๋น„ ์—ฐ์‚ฐ๋Ÿ‰ 1/6, ์ง€์—ฐ ์‹œ๊ฐ„ 1/4 ์ˆ˜์ค€์œผ๋กœ ๋Œ€ํญ ์ ˆ๊ฐ

๐Ÿ’ก ์‹ค๋ฌด์  ๊ฐ€์น˜ (Relevance)

๋กœ๋ด‡ ์ œ์–ด ๋ชจ๋ธ์—์„œ ๋ฌด๊ฑฐ์šด ๋น„๋””์˜ค ์ƒ์„ฑ ๊ณผ์ •์„ ์ƒ๋žตํ•˜๊ณ  KV ์บ์‹œ๋งŒ ํ™œ์šฉํ•จ์œผ๋กœ์จ, ์‹ค์‹œ๊ฐ„์„ฑ์ด ์ค‘์š”ํ•œ ๋กœ๋ด‡ ์‹œ์Šคํ…œ์— ์ ์šฉ ๊ฐ€๋Šฅํ•œ ํšจ์œจ์ ์ธ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

โœ… ์ถ”์ฒœ ์•ก์…˜ (Actionable Items)

  • ์‚ฌ์ „ ํ•™์Šต๋œ Diffusion ๊ธฐ๋ฐ˜ ์ด๋ฏธ์ง€ ํŽธ์ง‘ ๋ชจ๋ธ์˜ KV ์บ์‹œ ์ถ”์ถœ ๋กœ์ง ๊ตฌํ˜„ ํ…Œ์ŠคํŠธ
  • Flow-matching ๊ธฐ๋ฐ˜ ์•ก์…˜ ํ—ค๋“œ์™€ ์ด๋ฏธ์ง€ ํŽธ์ง‘ ๋ชจ๋ธ ๊ฐ„์˜ ์ •๋ ฌ(Alignment) ์‹คํ—˜
  • ๋‹ค์–‘ํ•œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์—์„œ์˜ ์ง€์—ฐ ์‹œ๊ฐ„ ๋ฐ ์ถ”๋ก  ์†๋„ ๋ฒค์น˜๋งˆํฌ ์ˆ˜ํ–‰