PyoSignal Logo
PyoSignal
Back to Community
๐Ÿค– Reddit r/LocalLLaMA

Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090

309 upvotes 90 comments Read on Reddit
Release LLM Benchmark Qwen3.6 Python

๐Ÿ“ AI Summary

Luce DFlash๋Š” Qwen3.6-27B ๋ชจ๋ธ์„ ๋‹จ์ผ RTX 3090์—์„œ ์ตœ๋Œ€ 2๋ฐฐ์˜ ์ฒ˜๋ฆฌ๋Ÿ‰์œผ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” GGUF ํฌํŠธ์ด๋ฉฐ, Speculative Decoding์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ณ„๋„์˜ Python ๋Ÿฐํƒ€์ž„์ด๋‚˜ llama.cpp ์„ค์น˜ ์—†์ด C++/CUDA ์Šคํƒ์œผ๋กœ ์ž‘๋™ํ•˜๋ฉฐ, ์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ๋Š” ๋กœ์ปฌ AI ์ถ”๋ก ์˜ ํ˜์‹ ์— ๋Œ€ํ•œ ๊ธ์ •์ ์ธ ๋ฐ˜์‘๊ณผ ํ•จ๊ป˜ Docker ์ง€์›์— ๋Œ€ํ•œ ๊ด€์‹ฌ๋„ ๋ณด์ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ Key Discussion Points

  • โ€ข Qwen3.6-27B ๋ชจ๋ธ์„ ์œ„ํ•œ DFlash Speculative Decoding GGUF ํฌํŠธ ์ œ๊ณต. ๋‹จ์ผ RTX 3090์—์„œ ์ตœ๋Œ€ 2๋ฐฐ์˜ ์ฒ˜๋ฆฌ๋Ÿ‰ ํ–ฅ์ƒ ๊ฐ€๋Šฅ
  • โ€ข C++/CUDA ์Šคํƒ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž‘๋™ํ•˜๋ฉฐ, Python ๋Ÿฐํƒ€์ž„์ด๋‚˜ llama.cpp ์„ค์น˜๊ฐ€ ํ•„์š” ์—†์Œ. libggml*.a ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋งํฌ ์‚ฌ์šฉ
  • โ€ข KV ์บ์‹œ๋ฅผ TQ3_0์œผ๋กœ ์••์ถ•ํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๊ณ , Sliding-window Flash Attention์„ ์ ์šฉํ•˜์—ฌ ๊ธด ์ปจํ…์ŠคํŠธ์—์„œ๋„ ๋น ๋ฅธ ๋””์ฝ”๋”ฉ ์†๋„ ์œ ์ง€