PyoSignal Logo
PyoSignal
Back to Research

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

Paper ID: 2604.19636 โ€ข 58 Upvotes
Diffusion Model Video Synthesis HOI Transformer Vision Video Audio Inference
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

๐Ÿ“ ํ•ต์‹ฌ ์š”์•ฝ

CoInteract๋Š” Diffusion Transformer ๊ธฐ๋ฐ˜์œผ๋กœ ์‚ฌ๋žŒ-๊ฐ์ฒด ์ƒํ˜ธ์ž‘์šฉ ๋น„๋””์˜ค ์ƒ์„ฑ ์‹œ ์†๊ณผ ์–ผ๊ตด์˜ ๊ตฌ์กฐ์  ์•ˆ์ •์„ฑ ๋ฐ ๋ฌผ๋ฆฌ์  ํ˜„์‹ค๊ฐ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์ƒˆ๋กœ์šด ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋‚ด์šฉ

์‚ฌ๋žŒ-๊ฐ์ฒด ์ƒํ˜ธ์ž‘์šฉ(HOI) ๋น„๋””์˜ค ํ•ฉ์„ฑ์€ ์ „์ž ์ƒ๊ฑฐ๋ž˜, ๋””์ง€ํ„ธ ๊ด‘๊ณ  ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ์ค‘์š”ํ•˜์ง€๋งŒ, ๊ธฐ์กด diffusion ๋ชจ๋ธ์€ ๊ตฌ์กฐ์  ์•ˆ์ •์„ฑ๊ณผ ๋ฌผ๋ฆฌ์  ํ˜„์‹ค๊ฐ ์ธก๋ฉด์—์„œ ํ•œ๊ณ„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด CoInteract๋Š” ์‚ฌ๋žŒ ์ฐธ์กฐ ์ด๋ฏธ์ง€, ์ œํ’ˆ ์ฐธ์กฐ ์ด๋ฏธ์ง€, ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ, ์Œ์„ฑ ์˜ค๋””์˜ค๋ฅผ ์กฐ๊ฑด์œผ๋กœ HOI ๋น„๋””์˜ค๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. CoInteract๋Š” Human-Aware MoE๋ฅผ ํ†ตํ•ด ์˜์—ญ๋ณ„ ์ „๋ฌธ๊ฐ€์—๊ฒŒ ํ† ํฐ์„ ๋ผ์šฐํŒ…ํ•˜์—ฌ ๊ตฌ์กฐ์  ์ •ํ™•๋„๋ฅผ ๋†’์ด๊ณ , Spatially-Structured Co-Generation์„ ํ†ตํ•ด RGB ์ŠคํŠธ๋ฆผ๊ณผ HOI ๊ตฌ์กฐ ์ŠคํŠธ๋ฆผ์„ ๊ณต๋™์œผ๋กœ ๋ชจ๋ธ๋งํ•˜์—ฌ ์ƒํ˜ธ์ž‘์šฉ ๊ธฐํ•˜ํ•™์  ์‚ฌ์ „ ์ •๋ณด๋ฅผ ์ฃผ์ž…ํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, CoInteract๋Š” ๊ตฌ์กฐ์  ์•ˆ์ •์„ฑ, ๋…ผ๋ฆฌ์  ์ผ๊ด€์„ฑ, ์ƒํ˜ธ์ž‘์šฉ ํ˜„์‹ค๊ฐ ์ธก๋ฉด์—์„œ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์ฃผ์š” ๋‚ด์šฉ (Key Points)

  • Human-Aware Mixture-of-Experts (MoE)๋ฅผ ํ†ตํ•ด ์˜์—ญ๋ณ„ ํŠนํ™”๋œ ์ „๋ฌธ๊ฐ€ ํ™œ์šฉ
  • Spatially-Structured Co-Generation์„ ํ†ตํ•ด RGB์™€ HOI ๊ตฌ์กฐ ์ŠคํŠธ๋ฆผ ๊ณต๋™ ๋ชจ๋ธ๋ง
  • HOI ์ŠคํŠธ๋ฆผ์„ ํ•™์Šต ์‹œ์—๋งŒ ์‚ฌ์šฉํ•˜์—ฌ ์ถ”๋ก  ์‹œ ์˜ค๋ฒ„ํ—ค๋“œ ์—†์Œ

๐Ÿ’ก ์‹ค๋ฌด์  ๊ฐ€์น˜ (Relevance)

HOI ๋น„๋””์˜ค ์ƒ์„ฑ ์‹œ ๊ตฌ์กฐ์  ์•ˆ์ •์„ฑ๊ณผ ๋ฌผ๋ฆฌ์  ํ˜„์‹ค๊ฐ์„ ๋†’์—ฌ, ๋ณด๋‹ค ์ž์—ฐ์Šค๋Ÿฝ๊ณ  ํ˜„์‹ค์ ์ธ ๊ฐ€์ƒ ํ™˜๊ฒฝ ๊ตฌ์ถ• ๋ฐ ์ฝ˜ํ…์ธ  ์ œ์ž‘์— ๊ธฐ์—ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โœ… ์ถ”์ฒœ ์•ก์…˜ (Actionable Items)

  • ์ œ๊ณต๋˜๋Š” ๋ฐ๋ชจ ๋น„๋””์˜ค๋ฅผ ํ†ตํ•ด CoInteract์˜ ์„ฑ๋Šฅ ์ง์ ‘ ํ™•์ธ
  • Human-Aware MoE ๋ฐ Spatially-Structured Co-Generation์˜ ๊ตฌํ˜„ ๋ฐฉ์‹ ์ƒ์„ธ ๋ถ„์„
  • ์ž์ฒด ๋ฐ์ดํ„ฐ์…‹์— CoInteract ์ ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ ๊ฐœ์„  ๊ฐ€๋Šฅ์„ฑ ๊ฒ€ํ†