LED-Based Summarization Model: Condensing Long and Technical Information

The Longformer Encoder-Decoder (LED) for Narrative-Esque Long Text Summarization is a model I fine-tuned from allenai/led-base-16384 to condense extensive technical, academic, and narrative content in a fairly generalizable way.

Key Features and Use Cases

Ideal for summarizing long narratives, articles, papers, textbooks, and other documents.
- the sparknotes-esque style leads to 'explanations' in the summarized content, offering insightful output.
High capacity: Handles up to 16,384 tokens per batch.
demos: try it out in the notebook linked above or in the demo on Spaces

Note: The API widget has a max length of ~96 tokens due to inference timeout constraints.

Training Details

The model was trained on the BookSum dataset released by SalesForce, which leads to the bsd-3-clause license. The training process involved 16 epochs with parameters tweaked to facilitate very fine-tuning-type training (super low learning rate).

Model checkpoint: pszemraj/led-base-16384-finetuned-booksum .

Other Related Checkpoints

This model is the smallest/fastest booksum-tuned model I have worked on. If you're looking for higher quality summaries, check out:

There are also other variants on other datasets etc on my hf profile, feel free to try them out :)

Basic Usage

I recommend using encoder_no_repeat_ngram_size=3 when calling the pipeline object, as it enhances the summary quality by encouraging the use of new vocabulary and crafting an abstractive summary.

Create the pipeline object:

import torch
from transformers import pipeline

hf_name = "pszemraj/led-base-book-summary"

summarizer = pipeline(
    "summarization",
    hf_name,
    device=0 if torch.cuda.is_available() else -1,
)

Feed the text into the pipeline object:

wall_of_text = "your words here"

result = summarizer(
    wall_of_text,
    min_length=8,
    max_length=256,
    no_repeat_ngram_size=3,
    encoder_no_repeat_ngram_size=3,
    repetition_penalty=3.5,
    num_beams=4,
    do_sample=False,
    early_stopping=True,
)
print(result[0]["generated_text"])

Simplified Usage with TextSum

To streamline the process of using this and other models, I've developed a Python package utility named textsum . This package offers simple interfaces for applying summarization models to text documents of arbitrary length.

Install TextSum:

pip install textsum

Then use it in Python with this model:

from textsum.summarize import Summarizer

model_name = "pszemraj/led-base-book-summary"
summarizer = Summarizer(
    model_name_or_path=model_name,  # you can use any Seq2Seq model on the Hub
    token_batch_length=4096,  # how many tokens to batch summarize at a time
)
long_string = "This is a long string of text that will be summarized."
out_str = summarizer.summarize_string(long_string)
print(f"summary: {out_str}")

Currently implemented interfaces include a Python API, a Command-Line Interface (CLI), and a shareable demo/web UI.

For detailed explanations and documentation, check the README or the wiki

作者:

Peter Szemraj

数据集大小:

1.21 GB