PyoSignal Logo
PyoSignal
Back to Research

ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

Paper ID: 2607.00466 β€’ 19 Upvotes
LLM Serving MoE Load Balancing Distributed Systems Evaluation Inference
ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

πŸ“ 핡심 μš”μ•½

MoE λͺ¨λΈμ˜ μ „λ¬Έκ°€(Expert) ν™œμ„±ν™” νŒ¨ν„΄μ„ κ³ λ €ν•œ λΌμš°νŒ…μ„ 톡해 PD 뢄리 ν™˜κ²½μ—μ„œ λ””μ½”λ”© μ§€μ—° μ‹œκ°„μ„ λ‹¨μΆ•ν•˜λŠ” 기술

πŸ“– 상세 λ‚΄μš©

LLM의 Prefill-Decode(PD) 뢄리 μ„œλΉ™ ν™˜κ²½μ—μ„œ 기쑴의 λΆ€ν•˜ λΆ„μ‚° 방식은 MoE λͺ¨λΈμ˜ νŠΉμ„±μ„ μΆ©λΆ„νžˆ λ°˜μ˜ν•˜μ§€ λͺ»ν•©λ‹ˆλ‹€. λ‹¨μˆœνžˆ μ›Œμ»€μ˜ λΆ€ν•˜λ₯Ό λ§žμΆ”λŠ” κ²ƒλ§ŒμœΌλ‘œλŠ” 각 μš”μ²­μ΄ ν™œμ„±ν™”ν•˜λŠ” μ „λ¬Έκ°€(Expert) κ°€μ€‘μΉ˜ λ‘œλ”©μ— λ”°λ₯Έ μ§€μ—° μ‹œκ°„ 차이λ₯Ό ν•΄κ²°ν•  수 μ—†κΈ° λ•Œλ¬Έμž…λ‹ˆλ‹€. 이λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ ELDRλŠ” 프리필 λ‹¨κ³„μ˜ μ „λ¬Έκ°€ ν™œμ„±ν™” νŒ¨ν„΄μ„ 기반으둜 'μ „λ¬Έκ°€ μ‹œκ·Έλ‹ˆμ²˜'λ₯Ό μƒμ„±ν•©λ‹ˆλ‹€. μ˜€ν”„λΌμΈμ—μ„œλŠ” K-means둜 μ‹œκ·Έλ‹ˆμ²˜ 곡간을 λΆ„ν• ν•˜κ³ , μ˜¨λΌμΈμ—μ„œλŠ” μ‹œκ·Έλ‹ˆμ²˜κ°€ μœ μ‚¬ν•˜λ©΄μ„œ λΆ€ν•˜κ°€ 적은 μ›Œμ»€λ‘œ μš”μ²­μ„ λ³΄λ‚΄λŠ” Locality-band λΌμš°νŒ…μ„ μˆ˜ν–‰ν•©λ‹ˆλ‹€. λ˜ν•œ KV μΊμ‹œμ™€ μ—°λ™λœ μ‹œκ·Έλ‹ˆμ²˜ μΊμ‹œλ₯Ό 톡해 ν”„λ¦¬ν”½μŠ€ 캐싱 ν™˜κ²½μ—μ„œλ„ 정확도λ₯Ό μœ μ§€ν•©λ‹ˆλ‹€. μ‹€ν—˜ κ²°κ³Ό, vLLM에 κ΅¬ν˜„λœ ELDRλŠ” κΈ°μ‘΄ λΆ€ν•˜ λΆ„μ‚° 방식 λŒ€λΉ„ TPOT(Time Per Output Token)λ₯Ό μ΅œλŒ€ 13.9%κΉŒμ§€ κ°œμ„ ν•˜μ˜€μŠ΅λ‹ˆλ‹€.

πŸ”‘ μ£Όμš” λ‚΄μš© (Key Points)

  • 프리필 λ‹¨κ³„μ˜ μ „λ¬Έκ°€ ν™œμ„±ν™” νŒ¨ν„΄μ„ ν™œμš©ν•œ 'μ „λ¬Έκ°€ μ‹œκ·Έλ‹ˆμ²˜' μΆ”μΆœ
  • μ‹œκ·Έλ‹ˆμ²˜ μœ μ‚¬λ„μ™€ μ›Œμ»€ λΆ€ν•˜λ₯Ό λ™μ‹œμ— κ³ λ €ν•˜λŠ” Locality-band λΌμš°νŒ… μ•Œκ³ λ¦¬μ¦˜
  • KV μΊμ‹œ 블둝 λ‹¨μœ„μ˜ μ‹œκ·Έλ‹ˆμ²˜ 캐싱을 ν†΅ν•œ ν”„λ¦¬ν”½μŠ€ 캐싱 λŒ€μ‘

πŸ’‘ 싀무적 κ°€μΉ˜ (Relevance)

MoE λͺ¨λΈμ„ μ‚¬μš©ν•˜λŠ” λŒ€κ·œλͺ¨ λΆ„μ‚° μ„œλΉ™ ν™˜κ²½μ—μ„œ, λ‹¨μˆœ λΆ€ν•˜ 뢄산이 μ•„λ‹Œ λͺ¨λΈ λ‚΄λΆ€μ˜ μ—°μ‚° νŠΉμ„±μ„ κ³ λ €ν•œ 효율적인 μŠ€μΌ€μ€„λ§ μ „λž΅μ„ μ œμ‹œν•©λ‹ˆλ‹€.

βœ… μΆ”μ²œ μ•‘μ…˜ (Actionable Items)

  • vLLM ν™˜κ²½μ—μ„œ MoE λͺ¨λΈ(Mixtral λ“±)을 ν™œμš©ν•œ λΆ€ν•˜ λΆ„μ‚° μ•Œκ³ λ¦¬μ¦˜ 비ꡐ μ‹€ν—˜
  • ν”„λ¦¬ν”½μŠ€ 캐싱(Prefix Caching) μ‚¬μš© μ‹œ μ‹œκ·Έλ‹ˆμ²˜ 정확도 및 μ„±λŠ₯ λ³€ν™” μΈ‘μ •
  • λ‹€μ–‘ν•œ GPU ν΄λŸ¬μŠ€ν„° 규λͺ¨μ—μ„œμ˜ ν™•μž₯μ„±(Scalability) ν…ŒμŠ€νŠΈ