PyoSignal Logo
PyoSignal
Back to Research

Thinking with Visual Grounding

Paper ID: 2606.16122 โ€ข 4 Upvotes
VLM Reasoning Reinforcement Learning Object Grounding Agent Vision Benchmark Distillation
Thinking with Visual Grounding

๐Ÿ“ ํ•ต์‹ฌ ์š”์•ฝ

์ถ”๋ก  ๊ณผ์ •์—์„œ ํ…์ŠคํŠธ์™€ ์‹œ๊ฐ์  ๊ทผ๊ฑฐ(Point/Box)๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ชจ๋ธ์˜ ๋…ผ๋ฆฌ์  ๊ฒ€์ฆ ๊ฐ€๋Šฅ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ์ƒˆ๋กœ์šด ์‹œ๊ฐ์  ์‚ฌ๊ณ  ๋ฐฉ์‹ ์ œ์•ˆ

๐Ÿ“– ์ƒ์„ธ ๋‚ด์šฉ

์ตœ๊ทผ VLM์€ ์ž์—ฐ์–ด ์ถ”๋ก ์„ ์ƒ์„ฑํ•˜์ง€๋งŒ, ์ถ”๋ก ์˜ ๊ทผ๊ฑฐ๊ฐ€ ๋˜๋Š” ์ด๋ฏธ์ง€ ์˜์—ญ์ด ๋ช…์‹œ๋˜์ง€ ์•Š์•„ ๊ฒ€์ฆ๊ณผ ๊ฐ๋…์ด ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํ…์ŠคํŠธ ์ถ”๋ก ๊ณผ ํ•จ๊ป˜ ์ด๋ฏธ์ง€ ๋‚ด ๊ฐ์ฒด๋ฅผ ์ (Point)์ด๋‚˜ ๋ฐ•์Šค(Box)๋กœ ๋ช…์‹œํ•˜๋Š” '์‹œ๊ฐ์  ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์‚ฌ๊ณ (Visually Grounded Thinking)' ๋ฐฉ์‹์„ ๋„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•™์Šต์„ ์œ„ํ•ด SAM3 ๊ธฐ๋ฐ˜ ์—์ด์ „ํŠธ๋ฅผ ํ™œ์šฉํ•œ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ๊ณผ, ์ •๋‹ต ์—ฌ๋ถ€์™€ ๊ทผ๊ฑฐ ์ผ์น˜๋„๋ฅผ ๋™์‹œ์— ๊ณ ๋ คํ•˜๋Š” 'Grounding-aware RL'์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, Gemma3-4B ๋ชจ๋ธ์— ์ด ๋ฐฉ์‹์„ ์ ์šฉํ–ˆ์„ ๋•Œ ์นด์šดํŒ… ๋ฐ ๊ณต๊ฐ„ ์ถ”๋ก  ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ธฐ์กด ๋ชจ๋ธ ๋ฐ ๋น„-๊ทผ๊ฑฐ ๋ชจ๋ธ๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๊ณต๊ฐ„ ์ถ”๋ก  ์ž‘์—…์—์„œ ์†Œํ˜• ๋ชจ๋ธ์ด ๋” ํฐ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ƒํšŒํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์ฃผ์š” ๋‚ด์šฉ (Key Points)

  • ํ…์ŠคํŠธ ์ถ”๋ก ๊ณผ ์‹œ๊ฐ์  ๊ทผ๊ฑฐ(Point/Box)๋ฅผ ๊ต์ฐจ ๋ฐฐ์น˜ํ•˜๋Š” ์ƒˆ๋กœ์šด ์ถ”๋ก  ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์•ˆ
  • SAM3 ๊ธฐ๋ฐ˜์˜ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ์„ ํ†ตํ•œ ๋Œ€๊ทœ๋ชจ ์‹œ๊ฐ์  ์ถ”๋ก  ๋ฐ์ดํ„ฐ ๊ตฌ์ถ•
  • ์ •๋‹ต ์ •ํ™•๋„์™€ ์‹œ๊ฐ์  ๊ทผ๊ฑฐ ์ผ์น˜๋„๋ฅผ ๊ฒฐํ•ฉํ•œ ๊ฐ•ํ™”ํ•™์Šต(Grounding-aware RL) ๊ธฐ๋ฒ• ๋„์ž…

๐Ÿ’ก ์‹ค๋ฌด์  ๊ฐ€์น˜ (Relevance)

๋ชจ๋ธ์˜ ์ถ”๋ก  ๊ณผ์ •์ด ์ด๋ฏธ์ง€์˜ ์–ด๋А ๋ถ€๋ถ„์— ๊ธฐ๋ฐ˜ํ•˜๋Š”์ง€ ๋ช…์‹œํ•˜๋ฏ€๋กœ, ๊ฒฐ๊ณผ์˜ ์‹ ๋ขฐ์„ฑ์„ ๋†’์ด๊ณ  ๋””๋ฒ„๊น… ๋ฐ ๊ฒ€์ฆ์ด ์šฉ์ดํ•œ AI ์—์ด์ „ํŠธ ๊ฐœ๋ฐœ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

โœ… ์ถ”์ฒœ ์•ก์…˜ (Actionable Items)

  • SAM๊ณผ ๊ฐ™์€ ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ๋ชจ๋ธ์„ ํ™œ์šฉํ•œ ๋ฐ์ดํ„ฐ ํ•ฉ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ• ํ…Œ์ŠคํŠธ
  • RLHF ๊ณผ์ •์—์„œ ํ…์ŠคํŠธ ์ •๋‹ต ์™ธ์— ์‹œ๊ฐ์  ๊ทผ๊ฑฐ(Grounding) ๋ณด์ƒ ํ•จ์ˆ˜ ์„ค๊ณ„ ์‹คํ—˜
  • ์†Œํ˜• ๋ชจ๋ธ(SLM)์— ์‹œ๊ฐ์  ๊ทผ๊ฑฐ ํ•™์Šต์„ ์ ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ ํ–ฅ์ƒ ํญ ํ™•์ธ