数据集:
big_patent
任务:
摘要生成语言:
en计算机处理:
monolingual语言创建人:
found批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1906.03741许可:
cc-by-4.0BIGPATENT, consisting of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Each US patent application is filed under a Cooperative Patent Classification (CPC) code. There are nine such classification categories:
Current defaults are 2.1.2 version (fix update to cased raw strings) and 'all' CPC codes:
from datasets import load_dataset ds = load_dataset("big_patent") # default is 'all' CPC codes ds = load_dataset("big_patent", "all") # the same as above ds = load_dataset("big_patent", "a") # only 'a' CPC codes ds = load_dataset("big_patent", codes=["a", "b"])
To use 1.0.0 version (lower cased tokenized words), pass both parameters codes and version :
ds = load_dataset("big_patent", codes="all", version="1.0.0") ds = load_dataset("big_patent", codes="a", version="1.0.0") ds = load_dataset("big_patent", codes=["a", "b"], version="1.0.0")
[More Information Needed]
English
Each instance contains a pair of description and abstract . description is extracted from the Description section of the Patent while abstract is extracted from the Abstract section.
{ 'description': 'FIELD OF THE INVENTION \n [0001] This invention relates to novel calcium phosphate-coated implantable medical devices and processes of making same. The unique calcium-phosphate coated implantable medical devices minimize...', 'abstract': 'This invention relates to novel calcium phosphate-coated implantable medical devices...' }
train | validation | test | |
---|---|---|---|
all | 1207222 | 67068 | 67072 |
a | 174134 | 9674 | 9675 |
b | 161520 | 8973 | 8974 |
c | 101042 | 5613 | 5614 |
d | 10164 | 565 | 565 |
e | 34443 | 1914 | 1914 |
f | 85568 | 4754 | 4754 |
g | 258935 | 14385 | 14386 |
h | 257019 | 14279 | 14279 |
y | 124397 | 6911 | 6911 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@article{DBLP:journals/corr/abs-1906-03741, author = {Eva Sharma and Chen Li and Lu Wang}, title = {{BIGPATENT:} {A} Large-Scale Dataset for Abstractive and Coherent Summarization}, journal = {CoRR}, volume = {abs/1906.03741}, year = {2019}, url = {http://arxiv.org/abs/1906.03741}, eprinttype = {arXiv}, eprint = {1906.03741}, timestamp = {Wed, 26 Jun 2019 07:14:58 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-1906-03741.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }
Thanks to @mattbui for adding this dataset.