模型:
h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2
This model was trained using H2O LLM Studio .
To use the model with the transformers library on a machine with GPUs, first make sure you have the transformers , accelerate and torch libraries installed.
pip install transformers==4.29.2 pip install bitsandbytes==0.39.0 pip install accelerate==0.19.0 pip install torch==2.0.0 pip install einops==0.6.1
import torch from transformers import pipeline, BitsAndBytesConfig, AutoTokenizer model_kwargs = {} quantization_config = None # optional quantization quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, ) model_kwargs["quantization_config"] = quantization_config tokenizer = AutoTokenizer.from_pretrained( "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2", use_fast=False, padding_side="left", trust_remote_code=True, ) generate_text = pipeline( model="h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2", tokenizer=tokenizer, torch_dtype=torch.float16, trust_remote_code=True, use_fast=False, device_map={"": "cuda:0"}, model_kwargs=model_kwargs, ) res = generate_text( "Why is drinking water so healthy?", min_new_tokens=2, max_new_tokens=1024, do_sample=False, num_beams=1, temperature=float(0.3), repetition_penalty=float(1.2), renormalize_logits=True ) print(res[0]["generated_text"])
You can print a sample prompt after the preprocessing step to see how it is feed to the tokenizer:
print(generate_text.preprocess("Why is drinking water so healthy?")["prompt_text"])
<|prompt|>Why is drinking water so healthy?<|endoftext|><|answer|>
Alternatively, you can download h2oai_pipeline.py , store it alongside your notebook, and construct the pipeline yourself from the loaded model and tokenizer:
import torch from h2oai_pipeline import H2OTextGenerationPipeline from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig quantization_config = None # optional quantization quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, ) tokenizer = AutoTokenizer.from_pretrained( "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2", use_fast=False, padding_side="left", trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2", trust_remote_code=True, torch_dtype=torch.float16, device_map={"": "cuda:0"}, quantization_config=quantization_config ).eval() generate_text = H2OTextGenerationPipeline(model=model, tokenizer=tokenizer) res = generate_text( "Why is drinking water so healthy?", min_new_tokens=2, max_new_tokens=1024, do_sample=False, num_beams=1, temperature=float(0.3), repetition_penalty=float(1.2), renormalize_logits=True ) print(res[0]["generated_text"])
You may also construct the pipeline from the loaded model and tokenizer yourself and consider the preprocessing steps:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig # Important: The prompt needs to be in the same format the model was trained with. # You can find an example prompt in the experiment logs. prompt = "<|prompt|>How are you?<|endoftext|><|answer|>" quantization_config = None # optional quantization quantization_config = BitsAndBytesConfig( load_in_8bit=True, llm_int8_threshold=6.0, ) tokenizer = AutoTokenizer.from_pretrained( "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2", use_fast=False, padding_side="left", trust_remote_code=True, ) model = AutoModelForCausalLM.from_pretrained( "h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2", trust_remote_code=True, torch_dtype=torch.float16, device_map={"": "cuda:0"}, quantization_config=quantization_config ).eval() inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda") # generate configuration can be modified to your needs tokens = model.generate( **inputs, min_new_tokens=2, max_new_tokens=1024, do_sample=False, num_beams=1, temperature=float(0.3), repetition_penalty=float(1.2), renormalize_logits=True )[0] tokens = tokens[inputs["input_ids"].shape[1]:] answer = tokenizer.decode(tokens, skip_special_tokens=True) print(answer)
RWForCausalLM( (transformer): RWModel( (word_embeddings): Embedding(65024, 8192) (h): ModuleList( (0-59): 60 x DecoderLayer( (ln_attn): LayerNorm((8192,), eps=1e-05, elementwise_affine=True) (ln_mlp): LayerNorm((8192,), eps=1e-05, elementwise_affine=True) (self_attention): Attention( (maybe_rotary): RotaryEmbedding() (query_key_value): Linear(in_features=8192, out_features=9216, bias=False) (dense): Linear(in_features=8192, out_features=8192, bias=False) (attention_dropout): Dropout(p=0.0, inplace=False) ) (mlp): MLP( (dense_h_to_4h): Linear(in_features=8192, out_features=32768, bias=False) (act): GELU(approximate='none') (dense_4h_to_h): Linear(in_features=32768, out_features=8192, bias=False) ) ) ) (ln_f): LayerNorm((8192,), eps=1e-05, elementwise_affine=True) ) (lm_head): Linear(in_features=8192, out_features=65024, bias=False) )
This model was trained using H2O LLM Studio and with the configuration in cfg.yaml . Visit H2O LLM Studio to learn how to train your own large language models.
Please read this disclaimer carefully before using the large language model provided in this repository. Your use of the model signifies your agreement to the following terms and conditions.
By using the large language model provided in this repository, you agree to accept and comply with the terms and conditions outlined in this disclaimer. If you do not agree with any part of this disclaimer, you should refrain from using the model and any content generated by it.