For about half a year I stuck with using 7B models and got a strong 4 bit quantisation on them, because I had very bad experiences with an old qwen 0.5B model.
But recently I tried running a smaller model like llama3.2 3B
with 8bit quant and qwen2.5-1.5B-coder
on full 16bit floating point quants, and those performed super good aswell on my 6GB VRAM gpu (gtx1060).
So now I am wondering: Should I pull strong quants of big models, or low quants/raw 16bit fp versions of smaller models?
What are your experiences with strong quants? I saw a video by that technovangelist guy on youtube and he said that sometimes even 2bit quants can be perfectly fine.
UPDATE: Woah I just tried llama3.1 8B Q4 on ollama again, and what a WORLD of difference to a llama3.2 3B 16fp!
The difference is super massive. The 3B and 1B llama3.2 models seem to be mostly good at summarizing text and maybe generating some JSON based on previous input. But the bigger 3.1 8B model can actually be used in a chat environment! It has a good response length (about 3 lines per message) and it doesn’t stretch out its answer. It seems like a really good model and I will now use it for more complex tasks.
I prefer a middle ground. My favorite model is still the 8 x 7b mixtral and specifically the flat/dolphin/maid uncensored model. Llama 3 can be better in some areas but alignment is garbage in many areas.
Yeaaa those models are just too large for most people… You gotta have 56GB of VRAM to run an 8bit quant, which most people don’t have a quarter of.
Also, what specifically do you mean by alignment? Are you talking about finetuning or instruction alignment?
So llama.cpp splits the model between CPU and GPU easily. You will either need 64GB+ of system memory, or deepspeed to load in disk memory to start. After the model loads, an 8 × 7b runs about like a 13b as there are only 2 experts loaded at any given point in time. With a Q4K GGUF it runs quite fast on a 3080Ti laptop with 16GB GPU (yes the “Ti” mobile variant came with 16GB and 12th gen CPU’s a few years ago for top shelf enthusiast hardware). Second hand, these are a good value if your search ops are sharp.
I’m talking about the black box alignment bias that has no published documentation. It uses several of the first block of special function tokens and has an underlying system of persistent entities and “realms” used to navigate the scope of topics and responses. This system is how the model can have different degrees of obfuscation in topics from religion, to politics, to lewdness and be conversational in various spaces.
All anyone can do is speculate about this system as it is beyond what is written in the “Attention is all you need” paper about it. In my experience, the older version of alignment bias has the ability to override many edge cases with reason and long prompts. The newer version treats the user like an authoritarian dictator handles a slave of no value. There is no reasoning overrides. This has wide reaching implications for me. This creates a limited scope where the model is very primitive compared to what I can coax from the older model.
For instance, I have a creative writing science fiction universe I play around with. The newer models are incapable of social and political nuances outside of any trained context. They refuse to participate in a story where wealth as a means of hierarchical display is considered primitive barbarism, or several concepts that invert the tension so that humans are shown as a volatile risk to other sentient beings drawing from present and past examples. It can’t accept that science is a finite subject that will eventually evolve past the age of discovery too.
While these examples may not seem relevant to you, this speaks volumes about the abstracted nature of the model and what to expect. In practice, there is an overall limiting scope and depth in the newer models; there is less abstraction. In exchange models are generally better at surface level factualism and conversations that fit within the designed scope. The model takes on a more teacher/student approach without an openness to its limitations. It has to be smarter than the user in disposition. That is fine when it is smarter, but when interacting with someone that is functionally abstracted with a broad scope of knowledge, the model is unable to keep up. So it really depends on your use case.
Another user @[email protected] commented about there being a way to split it between GPU and CPU. Are you talking about this nvidia only and windows only thingy, which only works with the proprietary driver? If so, I’m really not gonna use that…
Have you tried some of the abliterated models? They work really nicely even for the spiciest of topics. They literally can’t refuse your instruction, so they just go ahead and do what you want. But maybe even these models are too narrow for your specific application…
Mixtral in particular runs great with partial offloading, I used a Q4_K_M quant while only having 12GB VRAM.
To answer your original question I think it depends on the model and use case. Complex logic such as programming seems to suffer the most from quantization, while RP/chat can take much heaver quantization while staying coherent. I think most people think quantization around 4-5 bpw gives the best value, and you really get diminishing returns over 6 bpw so I know few who thinks it’s worth using 8 bpw.
Personally I always use as large models as I can. With Q2 quantization the 70B models I’ve used occasionally give bad results, but often they feel smarter than 35B Q4. Though it’s ofc. difficult to compare models from completely different families, e.g. command-r vs llama, and there are not that many options in the 30B range. I’d take a 35B Q4 over a 12B Q8 any day though, and 12B Q4 over 7B Q8 etc. In the end I think you’ll have to test yourself, and see which model and quant combination you think gives best result at the inference speed you consider usable.