模型:
pszemraj/long-t5-tglobal-base-16384-book-summary
任务:
摘要生成数字对象标识符:
10.57967/hf/0100Summarize long text and get a SparkNotes-esque summary of arbitrary topics!
A summary of the infamous navy seals copypasta :
The narrator tells us that he's graduated from the Navy seals and has been involved in many secret raids. He's also one of the best snipers in the entire U.S. military. He promises to "wipe you out with precision" when they meet again.
Contents
A fine-tuned version of google/long-t5-tglobal-base on the kmfoda/booksum dataset:
Read the paper by Guo et al. here: LongT5: Efficient Text-To-Text Transformer for Long Sequences
Install/update transformers pip install -U transformers
Summarize text with pipeline:
import torch from transformers import pipeline summarizer = pipeline( "summarization", "pszemraj/long-t5-tglobal-base-16384-book-summary", device=0 if torch.cuda.is_available() else -1, ) long_text = "Here is a lot of text I don't want to read. Replace me" result = summarizer(long_text) print(result[0]["summary_text"])
Pass other parameters related to beam search textgen when calling summarizer to get even higher quality results.
kmfoda/booksum dataset on HuggingFace - read the original paper here . Summaries longer than 1024 LongT5 tokens were filtered out to prevent the model from learning to generate "partial" summaries.
See summarize.py in the code for my hf space Document Summarization :)
You can also use the same code to split a document into batches of 4096, etc., and run over those with the model. This is useful in situations where CUDA memory is limited.
See train with a script and the summarization scripts .
This model was originally tuned on Google Colab with a heavily modified variant of the longformer training notebook , key enabler being deepspeed. You can try this as an alternate route to fine-tuning the model without using the command line.
For this reason, I created a Python package utility. It's called textsum , and you can use it to load models and summarize things in a few lines of code.
pip install textsum
Use textsum in python with this model:
from textsum.summarize import Summarizer summarizer = Summarizer( model_name_or_path="pszemraj/long-t5-tglobal-base-16384-book-summary" ) long_string = "This is a long string of text that will be summarized." out_str = summarizer.summarize_string(long_string) print(f"summary: {out_str}")
This package provides easy-to-use interfaces for applying summarization models to text documents of arbitrary length. Currently implemented interfaces include a Python API, a CLI, and a shareable demo application.
For details, explanations, and documentation, see the README ( linked above ) or the wiki .
NOTE: early checkpoints of this model were trained on a "smaller" subsection of the dataset as it was filtered for summaries of 1024 characters . This was subsequently caught and adjusted to 1024 tokens and then trained further for 10+ epochs.
The following hyperparameters were used during the most recent training round*:
* Prior training sessions used roughly similar parameters; multiple sessions were required as this takes eons to train
If you find pszemraj/long-t5-tglobal-base-16384-book-summary useful in your work, please consider citing this model :)
@misc {peter_szemraj_2022, author = { {Peter Szemraj} }, title = { long-t5-tglobal-base-16384-book-summary (Revision 4b12bce) }, year = 2022, url = { https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary }, doi = { 10.57967/hf/0100 }, publisher = { Hugging Face } }