模型:
pszemraj/led-base-book-summary
The Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization is a model I fine-tuned from allenai/led-base-16384 to condense extensive technical, academic, and narrative content in a fairly generalizable way.
Note: The API widget has a max length of ~96 tokens due to inference timeout constraints.
The model was trained on the BookSum dataset released by SalesForce, which leads to the bsd-3-clause license. The training process involved 16 epochs with parameters tweaked to facilitate very fine-tuning-type training (super low learning rate).
Model checkpoint: pszemraj/led-base-16384-finetuned-booksum .
This model is the smallest/fastest booksum-tuned model I have worked on. If you're looking for higher quality summaries, check out:
There are also other variants on other datasets etc on my hf profile, feel free to try them out :)
I recommend using encoder_no_repeat_ngram_size=3 when calling the pipeline object, as it enhances the summary quality by encouraging the use of new vocabulary and crafting an abstractive summary.
Create the pipeline object:
import torch from transformers import pipeline hf_name = "pszemraj/led-base-book-summary" summarizer = pipeline( "summarization", hf_name, device=0 if torch.cuda.is_available() else -1, )
Feed the text into the pipeline object:
wall_of_text = "your words here" result = summarizer( wall_of_text, min_length=8, max_length=256, no_repeat_ngram_size=3, encoder_no_repeat_ngram_size=3, repetition_penalty=3.5, num_beams=4, do_sample=False, early_stopping=True, ) print(result[0]["generated_text"])
To streamline the process of using this and other models, I've developed a Python package utility named textsum . This package offers simple interfaces for applying summarization models to text documents of arbitrary length.
Install TextSum:
pip install textsum
Then use it in Python with this model:
from textsum.summarize import Summarizer model_name = "pszemraj/led-base-book-summary" summarizer = Summarizer( model_name_or_path=model_name, # you can use any Seq2Seq model on the Hub token_batch_length=4096, # how many tokens to batch summarize at a time ) long_string = "This is a long string of text that will be summarized." out_str = summarizer.summarize_string(long_string) print(f"summary: {out_str}")
Currently implemented interfaces include a Python API, a Command-Line Interface (CLI), and a shareable demo/web UI.
For detailed explanations and documentation, check the README or the wiki