模型:

dfurman/mpt-7b-instruct-orca

许可:

cc-by-sa-3.0

预印本库:

arxiv:2108.12409 arxiv:2205.14135

其他:

MosaicML custom_code mpt

数据集:

3Aehartford/dolphin

类库:

Transformers Safetensors PyTorch

任务:

文本生成

模型介绍文件清单

中文

🐋 Mpt-7b-instruct-orca

This short-form instruction following model was built by finetuning MPT-7B on the first 100k rows of the ehartford/dolphin dataset (an open-source implementation of Microsoft's Orca ).

Model license: CC-By-SA-3.0
Basic usage: notebook
Finetuning code: script
Loss curves: plot
Runtime stats: table

Example prompts and responses

Example 1:

User :

You are a helpful assistant. Write me a numbered list of things to do in New York City.\n

mpt-7b-instruct-orca :

Visit the Statue of Liberty: The Statue of Liberty is a symbol of freedom and democracy, and it is a must-see attraction in New York City.

Take a walk in Central Park: Central Park is a massive green space in the heart of Manhattan, and it is a great place to relax, enjoy nature, and people-watch.

Visit the Metropolitan Museum of Art: The Met is one of the world's largest and most renowned art museums, housing an extensive collection of art from around the world.

Take a stroll through Times Square: Times Square is the heart of the city, and it is a must-see for its iconic billboards, bright lights, and bustling energy.

Visit the Empire State Building: The Empire State Building is a famous skyscraper in Midtown Manhattan, offering panoramic views of the city from its observation deck.

Explore the Museum of Modern Art (MoMA): MoMA is a renowned art museum in Midtown Manhattan, showcasing modern and contemporary art from around the world.

Visit the 9/11 Memorial and Museum: The 9/11 Memorial and Museum is a somber but important attraction in Lower Manhattan, commemorating the victims of the September 11th attacks.

Take a ferry ride to the Statue of Liberty: A ferry ride to the Statue of Liberty is a great way to see the city from the water

Example 2:

User :

You are a helpful assistant. Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does this make logical sense? Let's work this out in a step by step fashion to make sure we have the right answer.\n

mpt-7b-instruct-orca :

Step 1: Identify the days of the week when the barber works.

The barber works on Mondays, Wednesdays, and Fridays.

Step 2: Determine the day of the week on which Daniel went for a haircut.

Daniel went for a haircut on Sunday.

Step 3: Compare the day of the week on which Daniel went for a haircut to the days when the barber works.

Sunday is not one of the days when the barber works.

Step 4: Determine if the statement makes logical sense.

The statement does not make logical sense because Daniel went for a haircut on a day when the barber does not work.

Answer: No, the statement does not make logical sense. Daniel should have gone for a haircut on a day when the barber works, such as Monday, Wednesday, or Friday.

Example 3:

User :

You are a helpful assistant. Write a short email inviting my friends to a dinner party on Friday. Respond succinctly.\n

mpt-7b-instruct-orca :

Subject: Invitation to Dinner Party on Friday

Dear Friend,

I hope this email finds you well. I am writing to invite you to a dinner party that I am hosting on Friday. The party will be a casual gathering, and I would love to have you join us.

The dinner will be held at my home, and the time is from 7:00 pm to 10:00 pm. The address is 123 Main Street, San Francisco, CA 94122.

The dinner will be a potluck, so please bring a dish to share. I will provide the main course, so you can focus on bringing a side dish, dessert, or appetizer.

Please RSVP by replying to this email or by calling me at 555-555-5555. I look forward to seeing you on Friday!

Yours,

Your Friendly Assistant

Model Description

The architecture is a modification of a standard decoder-only transformer.

The model has been modified from a standard transformer in the following ways:

It uses FlashAttention
It uses ALiBi (Attention with Linear Biases) and does not use positional embeddings
It does not use biases

Hyperparameter	Value
n_parameters	6.65B
n_layers	32
n_heads	32
d_model	4096
vocab size	50432
sequence length	2048

Finetuning Description

This model was trained on a single H100 (80 GB PCIe) for about 12 hours using the Lambda Labs Platform .

Run: July 5, 2023 ( link )

Args summary: {'lr': 2e-5, 'num_epochs': 1, 'seed': 43}
Log summary: {'train_runtime': 61098.1062, 'train_samples_per_second': 1.637, 'train_steps_per_second': 0.409, 'train_loss': 1.4058428125, 'epoch': 1.0}

Plot derived from the tfevents log at runs/jul_5_23_3_15_00_sft-instruction-mpt-7b-orca ( link ).

PreTraining Data

For more details on the pretraining process, see MPT-7B .

The data was tokenized using the EleutherAI/gpt-neox-20b tokenizer.

Limitations and Biases

The following language is modified from EleutherAI's GPT-NeoX-20B

This model can produce factually incorrect output, and should not be relied on to produce factually accurate information. This model was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

How to Use

Basic Usage: Notebook

Note: This model requires that trust_remote_code=True be passed to the from_pretrained method. This is because we use a custom model architecture that is not yet part of the transformers package.

It includes options for many training efficiency features such as FlashAttention (Dao et al. 2022) , ALiBi , QK LayerNorm, and more.

First, install the package dependencies:

!pip install -q -U transformers einops accelerate torch
!pip install -q -U triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python

Basic model loading:

import transformers

model = transformers.AutoModelForCausalLM.from_pretrained(
  'dfurman/mpt-7b-instruct-orca',
  trust_remote_code=True,
  device_map="auto",
)

To use the optimized triton implementation of FlashAttention, you can load the model on GPU with attn_impl='triton' and with bfloat16 precision:

import torch
import transformers

name = 'dfurman/mpt-7b-instruct-orca'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton'
config.init_device = 'meta'

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  torch_dtype=torch.bfloat16,
  trust_remote_code=True,
  device_map="auto",
)

Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. Note that larger context windows require more available VRAM. For example:

import transformers

name = 'dfurman/mpt-7b-instruct-orca'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  trust_remote_code=True,
  device_map="auto",
)

This model was trained with the EleutherAI/gpt-neox-20b tokenizer. It can be invoked directly from this model's repo:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('dfurman/mpt-7b-instruct-orca')

Once loaded, the model and tokenizer can be used with the following code:

import transformers
import torch

# text generation function
def mpt_generate(
    model: transformers.AutoModelForCausalLM,
    tokenizer: transformers.AutoTokenizer,
    prompt: str,
    max_new_tokens: int = 128,
    temperature: int = 1.0,
) -> str:
    """
    Initialize the pipeline
    Uses Hugging Face GenerationConfig defaults
        https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig
    Args:
        model (transformers.AutoModelForCausalLM): Model for text generation
        tokenizer (transformers.AutoTokenizer): Tokenizer for model
        prompt (str): Prompt for text generation
        max_new_tokens (int, optional): Max new tokens after the prompt to generate.
            Defaults to 128.
        temperature (float, optional): The value used to modulate the next token probabilities.
            Defaults to 1.0
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        return_token_type_ids=False,
    ).to(device)

    # when running Torch modules in lower precision, it is best practice to use the torch.autocast context manager.
    with torch.autocast("cuda", dtype=torch.bfloat16):
        response = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            return_dict_in_generate=True,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )

    decoded_output = tokenizer.decode(
        response["sequences"][0],
        skip_special_tokens=True,
    )  # grab output in natural language

    return decoded_output[len(prompt) :]  # remove prompt from output

We can now generate text! For example:

prompt = "You are a helpful assistant. Here is a recipe for vegan banana bread:\n"

response = mpt_generate(
    model,
    tokenizer,
    prompt,
    max_new_tokens=150,
    temperature=0.92,
)

print(response)

Runtime tests

runtime / 50 tokens (sec)	GPU	attn	torch dtype	VRAM (GB)
0.61	1x H100 (80 GB PCIe)	triton	bfloat16	12
0.67	1x H100 (80 GB PCIe)	torch	bfloat16	12
1.17	1x A100 (40 GB SXM)	triton	bfloat16	13
1.36	1x A100 (40 GB SXM)	torch	bfloat16	13
2.25	1x V100 (16 GB SXM)	torch	float16	13
3.75	1x V100 (16 GB SXM)	torch	fp4	4
4.84	1x Tesla T4 (15 GB)	torch	float16	13
8.77	1x Tesla T4 (15 GB)	torch	fp4	4

The runtime statistics above (leftmost column) were generated with following code for each test, as per the corresponding notebook .

prompt = "You are a helpful assistant. Write me a long list of things to do in San Francisco:\n"

runtimes = []
for i in tqdm.tqdm(range(100)):
    start = time.time()
    response = mpt_generate(
        model,
        tokenizer,
        prompt,
        max_new_tokens=50,
        temperature=0.92,
    )
    end = time.time()
    runtimes.append(end - start)
    assert len(tokenizer.encode(response)) == 50

avg_runtime = torch.mean(torch.tensor(runtimes)).item()
print(f"Runtime avg in seconds: {avg_runtime}")  # time in seconds

Acknowledgements

This model was finetuned by Daniel Furman on July 5, 2023 and is intended primarily for research purposes.

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please cosult an attorney before using this model for commercial purposes.

MosaicML Citation for MPT-7B

@online{MosaicML2023Introducing,
    author    = {MosaicML NLP Team},
    title     = {Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs},
    year      = {2023},
    url       = {www.mosaicml.com/blog/mpt-7b},
    note      = {Accessed: 2023-07-02},
    urldate   = {2023-07-02}
}

作者:

Daniel Furman

数据集大小:

24.77 GB