模型:
dfurman/mpt-7b-instruct-orca
This short-form instruction following model was built by finetuning MPT-7B on the first 100k rows of the ehartford/dolphin dataset (an open-source implementation of Microsoft's Orca ).
Example 1:
User :
You are a helpful assistant. Write me a numbered list of things to do in New York City.\n
mpt-7b-instruct-orca :
Example 2:
User :
You are a helpful assistant. Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does this make logical sense? Let's work this out in a step by step fashion to make sure we have the right answer.\n
mpt-7b-instruct-orca :
Step 1: Identify the days of the week when the barber works.
The barber works on Mondays, Wednesdays, and Fridays.
Step 2: Determine the day of the week on which Daniel went for a haircut.
Daniel went for a haircut on Sunday.
Step 3: Compare the day of the week on which Daniel went for a haircut to the days when the barber works.
Sunday is not one of the days when the barber works.
Step 4: Determine if the statement makes logical sense.
The statement does not make logical sense because Daniel went for a haircut on a day when the barber does not work.
Answer: No, the statement does not make logical sense. Daniel should have gone for a haircut on a day when the barber works, such as Monday, Wednesday, or Friday.
Example 3:
User :
You are a helpful assistant. Write a short email inviting my friends to a dinner party on Friday. Respond succinctly.\n
mpt-7b-instruct-orca :
Subject: Invitation to Dinner Party on Friday
Dear Friend,
I hope this email finds you well. I am writing to invite you to a dinner party that I am hosting on Friday. The party will be a casual gathering, and I would love to have you join us.
The dinner will be held at my home, and the time is from 7:00 pm to 10:00 pm. The address is 123 Main Street, San Francisco, CA 94122.
The dinner will be a potluck, so please bring a dish to share. I will provide the main course, so you can focus on bringing a side dish, dessert, or appetizer.
Please RSVP by replying to this email or by calling me at 555-555-5555. I look forward to seeing you on Friday!
Yours,
Your Friendly Assistant
The architecture is a modification of a standard decoder-only transformer.
The model has been modified from a standard transformer in the following ways:
Hyperparameter | Value |
---|---|
n_parameters | 6.65B |
n_layers | 32 |
n_heads | 32 |
d_model | 4096 |
vocab size | 50432 |
sequence length | 2048 |
This model was trained on a single H100 (80 GB PCIe) for about 12 hours using the Lambda Labs Platform .
Run: July 5, 2023 ( link )
Plot derived from the tfevents log at runs/jul_5_23_3_15_00_sft-instruction-mpt-7b-orca ( link ).
For more details on the pretraining process, see MPT-7B .
The data was tokenized using the EleutherAI/gpt-neox-20b tokenizer.
The following language is modified from EleutherAI's GPT-NeoX-20B
This model can produce factually incorrect output, and should not be relied on to produce factually accurate information. This model was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
Basic Usage: Notebook
Note: This model requires that trust_remote_code=True be passed to the from_pretrained method. This is because we use a custom model architecture that is not yet part of the transformers package.
It includes options for many training efficiency features such as FlashAttention (Dao et al. 2022) , ALiBi , QK LayerNorm, and more.
First, install the package dependencies:
!pip install -q -U transformers einops accelerate torch !pip install -q -U triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python
Basic model loading:
import transformers model = transformers.AutoModelForCausalLM.from_pretrained( 'dfurman/mpt-7b-instruct-orca', trust_remote_code=True, device_map="auto", )
To use the optimized triton implementation of FlashAttention, you can load the model on GPU with attn_impl='triton' and with bfloat16 precision:
import torch import transformers name = 'dfurman/mpt-7b-instruct-orca' config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True) config.attn_config['attn_impl'] = 'triton' config.init_device = 'meta' model = transformers.AutoModelForCausalLM.from_pretrained( name, config=config, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto", )
Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. Note that larger context windows require more available VRAM. For example:
import transformers name = 'dfurman/mpt-7b-instruct-orca' config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True) config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096 model = transformers.AutoModelForCausalLM.from_pretrained( name, config=config, trust_remote_code=True, device_map="auto", )
This model was trained with the EleutherAI/gpt-neox-20b tokenizer. It can be invoked directly from this model's repo:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('dfurman/mpt-7b-instruct-orca')
Once loaded, the model and tokenizer can be used with the following code:
import transformers import torch # text generation function def mpt_generate( model: transformers.AutoModelForCausalLM, tokenizer: transformers.AutoTokenizer, prompt: str, max_new_tokens: int = 128, temperature: int = 1.0, ) -> str: """ Initialize the pipeline Uses Hugging Face GenerationConfig defaults https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig Args: model (transformers.AutoModelForCausalLM): Model for text generation tokenizer (transformers.AutoTokenizer): Tokenizer for model prompt (str): Prompt for text generation max_new_tokens (int, optional): Max new tokens after the prompt to generate. Defaults to 128. temperature (float, optional): The value used to modulate the next token probabilities. Defaults to 1.0 """ device = torch.device("cuda" if torch.cuda.is_available() else "cpu") inputs = tokenizer( prompt, return_tensors="pt", return_token_type_ids=False, ).to(device) # when running Torch modules in lower precision, it is best practice to use the torch.autocast context manager. with torch.autocast("cuda", dtype=torch.bfloat16): response = model.generate( **inputs, max_new_tokens=max_new_tokens, temperature=temperature, return_dict_in_generate=True, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id, ) decoded_output = tokenizer.decode( response["sequences"][0], skip_special_tokens=True, ) # grab output in natural language return decoded_output[len(prompt) :] # remove prompt from output
We can now generate text! For example:
prompt = "You are a helpful assistant. Here is a recipe for vegan banana bread:\n" response = mpt_generate( model, tokenizer, prompt, max_new_tokens=150, temperature=0.92, ) print(response)
runtime / 50 tokens (sec) | GPU | attn | torch dtype | VRAM (GB) |
---|---|---|---|---|
0.61 | 1x H100 (80 GB PCIe) | triton | bfloat16 | 12 |
0.67 | 1x H100 (80 GB PCIe) | torch | bfloat16 | 12 |
1.17 | 1x A100 (40 GB SXM) | triton | bfloat16 | 13 |
1.36 | 1x A100 (40 GB SXM) | torch | bfloat16 | 13 |
2.25 | 1x V100 (16 GB SXM) | torch | float16 | 13 |
3.75 | 1x V100 (16 GB SXM) | torch | fp4 | 4 |
4.84 | 1x Tesla T4 (15 GB) | torch | float16 | 13 |
8.77 | 1x Tesla T4 (15 GB) | torch | fp4 | 4 |
The runtime statistics above (leftmost column) were generated with following code for each test, as per the corresponding notebook .
prompt = "You are a helpful assistant. Write me a long list of things to do in San Francisco:\n" runtimes = [] for i in tqdm.tqdm(range(100)): start = time.time() response = mpt_generate( model, tokenizer, prompt, max_new_tokens=50, temperature=0.92, ) end = time.time() runtimes.append(end - start) assert len(tokenizer.encode(response)) == 50 avg_runtime = torch.mean(torch.tensor(runtimes)).item() print(f"Runtime avg in seconds: {avg_runtime}") # time in seconds
This model was finetuned by Daniel Furman on July 5, 2023 and is intended primarily for research purposes.
The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please cosult an attorney before using this model for commercial purposes.
@online{MosaicML2023Introducing, author = {MosaicML NLP Team}, title = {Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs}, year = {2023}, url = {www.mosaicml.com/blog/mpt-7b}, note = {Accessed: 2023-07-02}, urldate = {2023-07-02} }