This is a 8bit GPTQ (not to be confused with 8bit RTN) version of Vicuna 13B v1.1 HF.
Q. Why quantized in 8bit instead of 4bit? A. For evaluation purpose. In theory, a 8bit quantized model should provide slightly better perplexity (maybe not noticeable - To Be Evaluated...) over a 4bit quatized version. If your available GPU VRAM is over 15GB you may want to try this out. Note that quatization in 8bit does not mean loading the model in 8bit precision. Loading your model in 8bit precision (--load-in-8bit) comes with noticeable quality (perplexity) degradation.
Refs:
This model is a 8bit quantization of Vicuna 13Bv1.1.
Using https://github.com/qwopqwop200/GPTQ-for-LLaMa/ . Best results in bold .
--benchmark 2048 --check results:
Model | wikitext2 PPL | ptb PPL | c4 PPL | VRAM Utilization |
---|---|---|---|---|
4bit-GPTQ - TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g | 8.517391204833984 | 20.888103485107422 | 7.058407783508301 | 8670.26953125 |
8bit-GPTQ - Thireus/Vicuna13B-v1.1-8bit-128g | 8.508771896362305 | 20.75649070739746 | 7.105874538421631 | 14840.26171875 |
--eval results:
Model | wikitext2 PPL | ptb PPL | c4 PPL |
---|---|---|---|
4bit-GPTQ - TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g | 7.119165420532227 | 25.692861557006836 | 9.06746768951416 |
8bit-GPTQ - Thireus/Vicuna13B-v1.1-8bit-128g | 6.988043308258057 | 24.882535934448242 | 8.991846084594727 |
--new-eval --eval results:
Model | wikitext2 PPL | ptb-new PPL | c4-new PPL |
---|---|---|---|
4bit-GPTQ - TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g | 7.119165420532227 | 35.637290954589844 | 9.550592422485352 |
8bit-GPTQ - Thireus/Vicuna13B-v1.1-8bit-128g | 6.988043308258057 | 34.264320373535156 | 9.426002502441406 |
PPL = Perplexity (lower is better) - https://huggingface.co/docs/transformers/perplexity
RECOMMENDED - Triton (Fast tokens/s) - Works on Windows with WSL (what I've used) or Linux:
git clone https://github.com/oobabooga/text-generation-webui cd text-generation-webui #git fetch origin pull/1229/head:triton # Since been merged # This is the version that supports Triton - https://github.com/oobabooga/text-generation-webui/pull/1229 git checkout triton pip install -r requirements.txt mkdir repositories cd repositories git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git # -b cuda cd GPTQ-for-LLaMa #git checkout 508de42 # Since been fixed # Before qwopqwop200 broke everything... - https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/183 git checkout 210c379 # Optional - This is a commit I have verified, you may want to try the latest commit instead, if the latest commit doesn't work revert to an older one such as this one pip install -r requirements.txt
DISCOURAGED - Cuda (Slow tokens/s) and output issues https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/128 :
git clone https://github.com/oobabooga/text-generation-webui cd text-generation-webui pip install -r requirements.txt mkdir repositories cd repositories git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda # Make sure you obtain the qwopqwop200 version, not the oobabooga one! (because "act-order: yes") cd GPTQ-for-LLaMa git checkout 505c2c7 # Optional - This is a commit I have verified, you may want to try the latest commit instead, if the latest commit doesn't work revert to an older one such as this one pip install -r requirements.txt python setup_cuda.py install
Latest version of oobabooga + https://github.com/oobabooga/text-generation-webui/pull/1229
NVIDIA GTX 3090
32BG DDR4
i9-7980XE OC @4.6Ghz
11 tokens/s on average with Triton
Equivalent tokens/s observed over the 4bit version
Pending preliminary observation: better quality results than 8bit RTN / --load-in-8bits (To Be Confirmed)
Observation: better quality results than 4bit GPTQ (c.f. PPL benchmarks above) at the cost of +71% VRAM usage.
Tested and working in both chat mode and text generation mode
Research only - non-commercial research purposes - other restrictions apply. See inherited LICENSE file from LLaMa.
LLaMA-13B converted to work with Transformers/HuggingFace is under a special license, please see the LICENSE file for details.
https://www.reddit.com/r/LocalLLaMA/comments/12kl68j/comment/jg31ufe/
Model type: Vicuna is an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. It is an auto-regressive language model, based on the transformer architecture.
Model date: Vicuna was trained between March 2023 and April 2023.
Organizations developing the model: The Vicuna team with members from UC Berkeley, CMU, Stanford, and UC San Diego.
Paper or resources for more information: https://vicuna.lmsys.org/
License: Apache License 2.0
Where to send questions or comments about the model: https://github.com/lm-sys/FastChat/issues
Primary intended uses: The primary use of Vicuna is research on large language models and chatbots.
Primary intended users: The primary intended users of the model are researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.
70K conversations collected from ShareGPT.com.
A preliminary evaluation of the model quality is conducted by creating a set of 80 diverse questions and utilizing GPT-4 to judge the model outputs. See https://vicuna.lmsys.org/ for more details.