PyoSignal Logo
PyoSignal
Back to Research

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

Paper ID: 2606.20515 โ€ข 25 Upvotes
Agent Spatial-Intelligence VLM 3D-Vision Reasoning Vision Video Benchmark Inference
S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

๐Ÿ“ ํ•ต์‹ฌ ์š”์•ฝ

์ •์  ์ด๋ฏธ์ง€ ์ธ์‹์„ ๋„˜์–ด ์‹œ๊ณต๊ฐ„์  ์ฆ๊ฑฐ ์ถ•์ ์„ ํ†ตํ•ด 3D ๊ณต๊ฐ„ ์ง€๋Šฅ์„ ๊ตฌํ˜„ํ•˜๋Š” ์—์ด์ „ํŠธ ํ”„๋ ˆ์ž„์›Œํฌ

๐Ÿ“– ์ƒ์„ธ ๋‚ด์šฉ

๊ธฐ์กด์˜ VLM ๊ธฐ๋ฐ˜ ์—์ด์ „ํŠธ๋Š” ์ •์ ์ธ ํ”„๋ ˆ์ž„ ๋‹จ์œ„์˜ ๊ด€์ฐฐ์— ์˜์กดํ•˜์—ฌ ์—ฐ์†์ ์ธ 3D ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์ถ”๋ก ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด S-Agent๋Š” ๊ณต๊ฐ„ ์ถ”๋ก ์„ ์‹œ๊ณต๊ฐ„์  ์ฆ๊ฑฐ๋ฅผ ์ถ•์ ํ•˜๋Š” ๊ณผ์ •์œผ๋กœ ์žฌ์ •์˜ํ•˜๋Š” ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. VLM์„ ์‹œ๋งจํ‹ฑ ํ”Œ๋ž˜๋„ˆ๋กœ ํ™œ์šฉํ•˜์—ฌ ํ•„์š”ํ•œ ์ฆ๊ฑฐ๋ฅผ ๊ฒฐ์ •ํ•˜๊ณ , ๊ณ„์ธต์  ๊ณต๊ฐ„ ๋„๊ตฌ์™€ ์ „๋ฌธ๊ฐ€ ๋ชจ๋ธ์„ ํ†ตํ•ด 2D ๊ฐ์ฒด๋ฅผ 3D ๊ธฐํ•˜ํ•™์  ์ •๋ณด๋กœ ๋ณ€ํ™˜ ๋ฐ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ Scene Memory์™€ Agent Memory๋ฅผ ํ†ตํ•ด ๋ณ€ํ™”ํ•˜๋Š” ์žฅ๋ฉด ์ƒํƒœ์™€ ์ถ”๋ก  ๋งฅ๋ฝ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ, S-Agent๋Š” ๋ณ„๋„์˜ ํ•™์Šต ์—†์ด๋„ ๊ธฐ์กด VLM์˜ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์œผ๋ฉฐ, ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋กœ SFT๋ฅผ ์ง„ํ–‰ํ•œ S-Agent-8B๋Š” ์†Œํ˜• ๋ชจ๋ธ์ž„์—๋„ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์ฃผ์š” ๋‚ด์šฉ (Key Points)

  • ํ”„๋ ˆ์ž„ ์ค‘์‹ฌ ์ธ์‹์„ ๋„˜์–ด์„  ์‹œ๊ณต๊ฐ„์  ์ฆ๊ฑฐ ์ถ•์ (Spatio-temporal evidence accumulation) ๋ฐฉ์‹ ๋„์ž…
  • VLM(ํ”Œ๋ž˜๋„ˆ)๊ณผ ๊ณ„์ธต์  ๊ณต๊ฐ„ ๋„๊ตฌ(์ „๋ฌธ๊ฐ€)๋ฅผ ๊ฒฐํ•ฉํ•œ ์—์ด์ „ํŠธ ๊ตฌ์กฐ ์„ค๊ณ„
  • ์žฅ๋ฉด ์ƒํƒœ ์œ ์ง€๋ฅผ ์œ„ํ•œ Scene Memory์™€ ์ถ”๋ก  ๋งฅ๋ฝ ์œ ์ง€๋ฅผ ์œ„ํ•œ Agent Memory ๋ฉ”์ปค๋‹ˆ์ฆ˜

๐Ÿ’ก ์‹ค๋ฌด์  ๊ฐ€์น˜ (Relevance)

๋‹จ์ˆœ ์ด๋ฏธ์ง€ ์บก์…”๋‹์„ ๋„˜์–ด ๋กœ๋ด‡ ์ œ์–ด๋‚˜ ๋””์ง€ํ„ธ ํŠธ์œˆ ๋“ฑ ์‹ค์ œ 3D ํ™˜๊ฒฝ ์ƒํ˜ธ์ž‘์šฉ์„ ์œ„ํ•œ ์—์ด์ „ํŠธ ์„ค๊ณ„ ํŒจํ„ด์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

โœ… ์ถ”์ฒœ ์•ก์…˜ (Actionable Items)

  • ๊ธฐ์กด VLM์— 2D-to-3D ๊ธฐํ•˜ํ•™ ๋„๊ตฌ๋ฅผ ๊ฒฐํ•ฉํ•œ ์›Œํฌํ”Œ๋กœ์šฐ ๊ตฌํ˜„ ํ…Œ์ŠคํŠธ
  • ๋ฉ€ํ‹ฐ๋ทฐ ์ด๋ฏธ์ง€ ์ž…๋ ฅ ์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜(Scene/Agent Memory)์˜ ํšจ๊ณผ ๊ฒ€์ฆ
  • S-300K์™€ ๊ฐ™์€ ๊ณ ํ’ˆ์งˆ ๊ณต๊ฐ„ ์ถ”๋ก  ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•œ ์†Œํ˜• ๋ชจ๋ธ SFT ์‹คํ—˜