Tested some popular GGUFs for 16GB VRAM target
Got interested in local LLMs recently, so I decided to test in coding benchmark which of the popular GGUF distillations work well enough for my 16GB RTX4070Ti SUPER GPU. I haven't found similar tests, people mostly compare non distilled LLMs, which isn't very realistic for local LLMs, as for me. I run LLMs via LM-Studio server and used can-ai-code benchmark locally inside WSL2/Windows 11.
LLM (16K context, all on GPU, 120+ is good) | tok/sec | Passed | Max fit context |
---|---|---|---|
bartowski/Qwen2.5-Coder-32B-Instruct-IQ3_XXS.gguf | 13.71 | 147 | 8K wil fit on ~25t/s |
chatpdflocal/Qwen2.5.1-Coder-14B-Instruct-Q4_K_M.gguf | 48.67 | 146 | 28K |
bartowski/Qwen2.5-Coder-14B-Instruct-Q5_K_M.gguf | 45.13 | 146 | 16K, all 14B |
unsloth/phi-4-Q5_K_M.gguf | 51.04 | 143 | 16K all phi4 |
bartowski/Qwen2.5-Coder-14B-Instruct-Q4_K_M.gguf | 50.79 | 143 | 24K |
bartowski/phi-4-IQ3_M.gguf | 49.35 | 143 | |
bartowski/Mistral-Small-24B-Instruct-2501-IQ3_XS.gguf | 40.86 | 143 | 24K |
bartowski/phi-4-Q5_K_M.gguf | 48.04 | 142 | |
bartowski/Mistral-Small-24B-Instruct-2501-Q3_K_L.gguf | 36.48 | 141 | 16K |
bartowski/Qwen2.5.1-Coder-7B-Instruct-Q8_0.gguf | 60.5 | 140 | 32K, max |
bartowski/Qwen2.5-Coder-7B-Instruct-Q8_0.gguf | 60.06 | 139 | 32K, max |
bartowski/Qwen2.5-Coder-14B-Q5_K_M.gguf | 46.27 | 139 | |
unsloth/Qwen2.5-Coder-14B-Instruct-Q5_K_M.gguf | 38.96 | 139 | |
unsloth/Qwen2.5-Coder-14B-Instruct-Q8_0.gguf | 10.33 | 139 | |
bartowski/Qwen2.5-Coder-14B-Instruct-IQ3_M.gguf | 58.74 | 137 | 32K |
bartowski/Qwen2.5-Coder-14B-Instruct-IQ3_XS.gguf | 47.22 | 135 | 32K |
bartowski/Codestral-22B-v0.1-IQ3_M.gguf | 40.79 | 135 | 16K |
bartowski/Yi-Coder-9B-Chat-Q8_0.gguf | 50.39 | 131 | 40K |
bartowski/Yi-Coder-9B-Chat-Q6_K.gguf | 57.13 | 126 | 50K |
bartowski/codegeex4-all-9b-Q6_K.gguf | 57.12 | 124 | 70K |
bartowski/gemma-2-27b-it-IQ3_XS.gguf | 33.21 | 118 | 8K Context limit! |
bartowski/Qwen2.5-Coder-7B-Instruct-Q6_K.gguf | 70.52 | 115 | |
bartowski/Qwen2.5-Coder-7B-Instruct-Q6_K_L.gguf | 69.67 | 113 | |
bartowski/Mistral-Small-Instruct-2409-22B-Q4_K_M.gguf | 12.96 | 107 | |
unsloth/Qwen2.5-Coder-7B-Instruct-Q8_0.gguf | 51.77 | 105 | 64K |
tensorblock/code-millenials-13b-Q5_K_M.gguf | 17.15 | 102 | |
bartowski/codegeex4-all-9b-Q8_0.gguf | 46.55 | 97 | |
bartowski/Mistral-Small-Instruct-2409-22B-IQ3_M.gguf | 45.26 | 91 | |
starble-dev/Mistral-Nemo-12B-Instruct-2407-GGUF | 51.51 | 82 | 28K |
bartowski/SuperNova-Medius-14.8B-Q5_K_M.gguf | 39.09 | 82 | |
Bartowski/DeepSeek-Coder-V2-Lite-Instruct-Q5_K_M.gguf | 29.21 | 73 | |
bartowski/EXAONE-3.5-7.8B-Instruct-Q6_K.gguf | 73.7 | 42 | |
bartowski/EXAONE-3.5-7.8B-Instruct-GGUF | 54.86 | 16 | |
bartowski/EXAONE-3.5-32B-Instruct-IQ3_XS.gguf | 11.09 | 16 | |
bartowski/DeepSeek-R1-Distill-Qwen-14B-IQ3_M.gguf | 49.11 | 3 | |
bartowski/DeepSeek-R1-Distill-Qwen-14B-Q5_K_M.gguf | 40.52 | 3 |
`bartowski/codegeex4-all-9b-Q6_K.gguf` and `bartowski/Qwen2.5-Coder-7B-Instruct-Q8_0.gguf` worked surprisingly well, as to my finding. I think 16GB VRAM limit will be very relevant for next few years. What do you think?
Edit: updated table with few fixes.
Edit2: replaced image with text table, added Qwen 2.5.1 and Mistral Small 3 2501 24B.