模型:
TheBloke/falcon-40b-instruct-GPTQ
Chat & support: my new Discord server
Want to contribute? TheBloke's Patreon page
This repo contains an experimantal GPTQ 4bit model for Falcon-40B-Instruct .
It is the result of quantising to 4bit using AutoGPTQ .
A helpful assistant who helps the user with any questions asked. User: prompt Assistant:
Please note this is an experimental GPTQ model. Support for it is currently quite limited.
It is also expected to be VERY SLOW . This is currently unavoidable, but is being looked at.
This 4bit model requires at least 35GB VRAM to load. It can be used on 40GB or 48GB cards, but not less.
Please be aware that you should currently expect around 0.7 tokens/s on 40B Falcon GPTQ.
AutoGPTQ is required: pip install auto-gptq
AutoGPTQ provides pre-compiled wheels for Windows and Linux, with CUDA toolkit 11.7 or 11.8.
If you are running CUDA toolkit 12.x, you will need to compile your own by following these instructions:
git clone https://github.com/PanQiWei/AutoGPTQ cd AutoGPTQ pip install .
These manual steps will require that you have the Nvidia CUDA toolkit installed.
There is provisional AutoGPTQ support in text-generation-webui.
This requires text-generation-webui as of commit 204731952ae59d79ea3805a425c73dd171d943c3.
So please first update text-genration-webui to the latest version.
请注意,此命令行参数会导致 Falcon 提供的 Python 代码在您的计算机上执行。
目前需要此代码,因为 Falcon 过新无法得到 Hugging Face transformers 的支持。将来 transformers 将原生地支持该模型,那时将不再需要“信任远程代码”。
在此存储库中,您可以看到两个.py 文件 - 这些文件将被执行。这些文件是从基本存储库复制而来 Falcon-7B-Instruct 。
要运行此代码,需要安装 AutoGPTQ 和 einops:
pip install auto-gptq pip install einops
然后可以运行此示例代码:
from transformers import AutoTokenizer, pipeline, logging from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig import argparse model_name_or_path = "TheBloke/falcon-40b-instruct-GPTQ" # You could also download the model locally, and access it there # model_name_or_path = "/path/to/TheBloke_falcon-40b-instruct-GPTQ" model_basename = "gptq_model-4bit--1g" use_triton = False tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True) model = AutoGPTQForCausalLM.from_quantized(model_name_or_path, model_basename=model_basename, use_safetensors=True, trust_remote_code=True, device="cuda:0", use_triton=use_triton, quantize_config=None) prompt = "Tell me about AI" prompt_template=f'''A helpful assistant who helps the user with any questions asked. User: {prompt} Assistant:'' print("\n\n*** Generate:") input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda() output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512) print(tokenizer.decode(output[0])) # Inference can also be done using transformers' pipeline # Note that if you use pipeline, you will see a spurious error message saying the model type is not supported # This can be ignored! Or you can hide it with the following logging line: # Prevent printing spurious transformers error when using pipeline with AutoGPTQ logging.set_verbosity(logging.CRITICAL) print("*** Pipeline:") pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512, temperature=0.7, top_p=0.95, repetition_penalty=1.15 ) print(pipe(prompt_template)[0]['generated_text'])
gptq_model-4bit--1g.safetensors
该文件与 AutoGPTQ 0.2.0 及更高版本兼容。
它是不完整的的文件夹,用于减少 VRAM 的需求,具有 desc_act(act-order)以提高推理质量。
如需进一步的支持以及有关这些模型和 AI 的讨论,请加入我们的 Discord:
感谢 chirper.ai 团队!
我有很多人问我是否可以做出贡献。我喜欢提供模型并帮助人们,并希望能够投入更多时间来提供更多模型,并开展新的项目,如微调/训练。
如果您有能力和意愿做出贡献,我将非常感激,并将帮助我继续提供更多模型,并开始进行新的 AI 项目。
捐赠者将在任何与 AI/LLM/模型有关的问题和请求上获得优先支持,并获得访问私人 Discord 房间以及其他福利。
特别感谢我的慷慨的赞助者和捐赠者!
Falcon-40B-Instruct 是由 TII 基于 Falcon-40B 构建的 40B 参数编码器模型,并使用混合的 Baize 进行微调。根据 TII Falcon LLM License ,该模型可供使用。
即将发布论文 ?。
? 这是一个指导型模型,不适合进一步进行微调。如果您有兴趣构建自己的指导/聊天模型,我们建议从 Falcon-40B 开始。
? 寻找一个更小、更便宜的模型? Falcon-7B-Instruct 是 Falcon-40B-Instruct 的“小弟”!
from transformers import AutoTokenizer, AutoModelForCausalLM import transformers import torch model = "tiiuae/falcon-40b-instruct" tokenizer = AutoTokenizer.from_pretrained(model) pipeline = transformers.pipeline( "text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", ) sequences = pipeline( "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:", max_length=200, do_sample=True, top_k=10, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id, ) for seq in sequences: print(f"Result: {seq['generated_text']}")
Falcon-40B-Instruct 从一个聊天数据集进行了微调。
没有进行风险评估和缓解措施的生产使用;任何可能被视为不负责任或有害的用例。
Falcon-40B-Instruct 主要是基于英语数据进行训练的,不会适当地推广到其他语言。此外,由于它是在代表网络的大规模语料库上进行训练的,因此它会带有常见的在线陈规和偏见。
我们建议使用 Falcon-40B-Instruct 的用户制定安全措施,并采取适当的预防措施进行任何生产使用。
from transformers import AutoTokenizer, AutoModelForCausalLM import transformers import torch model = "tiiuae/falcon-40b-instruct" tokenizer = AutoTokenizer.from_pretrained(model) pipeline = transformers.pipeline( "text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", ) sequences = pipeline( "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:", max_length=200, do_sample=True, top_k=10, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id, ) for seq in sequences: print(f"Result: {seq['generated_text']}")
Falcon-40B-Instruct 使用 Bai ze 中的 150M 个标记进行微调,其中还混入了 5% 的 RefinedWeb 数据。
数据使用 Falcon- 7B / 40B 分词器进行分词。
即将发布论文。
请参阅 OpenLLM Leaderboard 获取初步结果。
有关预训练的更多信息,请参阅 Falcon-40B 。
Falcon-40B 是仅有解码器的因果型模型,通过对因果语言建模任务(即预测下一个标记)进行训练。
架构广泛借鉴于 GPT-3 论文( Brown et al., 2020 ),但有以下差异:
对于 multiquery,我们使用一种内部变体,该变体使用每个张量并行度独立的键和值。
Hyperparameter | Value | Comment |
---|---|---|
Layers | 60 | |
d_model | 8192 | |
head_dim | 64 | Reduced to optimise for FlashAttention |
Vocabulary | 65024 | |
Sequence length | 2048 |
Falcon-40B-Instruct 是在 AWS SageMaker 上训练的,使用 P4d 实例上的 64 块 A100 40GB GPU。
软件Falcon-40B-Instruct 是使用自定义的分布式训练代码库 Gigatron 进行训练的。它采用了三维并行性方法,结合了 ZeRO 和高性能 Triton 内核(FlashAttention 等)的使用。
即将发布论文 ?.
Falcon-40B-Instruct 根据 TII Falcon LLM License 授权。总的来说,
falconllm@ tii.ae