模型:
ethzanalytics/RedPajama-INCITE-Chat-3B-v1-GPTQ-4bit-128g
A GPTQ quantization of the RedPajama-INCITE-Chat-3B-v1 via auto-gptq. Model file is only 2GB.
Note that you cannot load directly from the hub with auto_gptq yet - if needed you can use this function to download using the repo name.
first install auto-GPTQ
pip install ninja auto-gptq[triton]
load:
import torch from pathlib import Path from auto_gptq import AutoGPTQForCausalLM from transformers import AutoTokenizer model_repo = Path.cwd() / "RedPajama-INCITE-Chat-3B-v1-GPTQ-4bit-128g" device = "cuda:0" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(model_repo) model = AutoGPTQForCausalLM.from_quantized( model_repo, device=device, use_safetensors=True, use_triton=device != "cpu", # comment/remove if not on Linux ).to(device)
Inference:
import re import pprint as pp prompt = "How can I further strive to increase shareholder value even further?" prompt = f"<human>: {prompt}\n<bot>:" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, penalty_alpha=0.6, top_k=4, temperature=0.7, do_sample=True, max_new_tokens=192, length_penalty=0.9, pad_token_id=model.config.eos_token_id ) result = tokenizer.batch_decode( outputs, skip_special_tokens=True, clean_up_tokenization_spaces=True ) bot_responses = re.findall(r'<bot>:(.*?)(<human>|$)', result[0], re.DOTALL) bot_responses = [response[0].strip() for response in bot_responses] print(bot_responses[0])