this is the "latest" version of the model that has been trained the longest, currently at 70k steps
GOAL:
A summarization model that 1) summarizes the source content accurately 2)
more important IMO
produces summaries that are easy to read and understand (* cough * unlike arXiv * cough *)
This model attempts to help with that by using the
booksum
dataset to provide
explanatory summarization
Explanatory Summary - A summary that both consolidates information and also explains why said consolidated information is important.
This model was trained for seven epochs total (approx 70,000 steps) and is closer to finished.
Will continue to improve (slowly, now that it has been trained for a long time) based on any result findings/feedback.
starting checkpoint was
google/bigbird-pegasus-large-bigpatent
example usage
An extended example, including a demo of batch summarization, is
here
.
create the summarizer object:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import pipeline
model = AutoModelForSeq2SeqLM.from_pretrained(
"pszemraj/bigbird-pegasus-large-K-booksum",
low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"pszemraj/bigbird-pegasus-large-K-booksum",
)
summarizer = pipeline(
"summarization",
model=model,
tokenizer=tokenizer,
)
define text to be summarized, and pass it through the pipeline. Boom done.
wall_of_text = "your text to be summarized goes here."
result= summarizer(
wall_of_text,
min_length=16,
max_length=256,
no_repeat_ngram_size=3,
clean_up_tokenization_spaces=True,
)
print(result[0]["summary_text"])
Alternate Checkpoint
if experiencing runtime/memory issues, try
this earlier checkpoint
at 40,000 steps which is almost as good at the explanatory summarization task but runs faster.
see similar summarization models fine-tuned on booksum but using different architectures:
long-t5 base
and
LED-Large