数据集:
big_patent
任务:
语言:
计算机处理:
monolingual语言创建人:
found批注创建人:
no-annotation源数据集:
original预印本库:
arxiv:1906.03741许可:
BIGPATENT, consisting of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Each US patent application is filed under a Cooperative Patent Classification (CPC) code. There are nine such classification categories:
Current defaults are 2.1.2 version (fix update to cased raw strings) and 'all' CPC codes:
from datasets import load_dataset
ds = load_dataset("big_patent") # default is 'all' CPC codes
ds = load_dataset("big_patent", "all") # the same as above
ds = load_dataset("big_patent", "a") # only 'a' CPC codes
ds = load_dataset("big_patent", codes=["a", "b"])
To use 1.0.0 version (lower cased tokenized words), pass both parameters codes and version :
ds = load_dataset("big_patent", codes="all", version="1.0.0")
ds = load_dataset("big_patent", codes="a", version="1.0.0")
ds = load_dataset("big_patent", codes=["a", "b"], version="1.0.0")
[More Information Needed]
English
Each instance contains a pair of description and abstract . description is extracted from the Description section of the Patent while abstract is extracted from the Abstract section.
{
'description': 'FIELD OF THE INVENTION \n [0001] This invention relates to novel calcium phosphate-coated implantable medical devices and processes of making same. The unique calcium-phosphate coated implantable medical devices minimize...',
'abstract': 'This invention relates to novel calcium phosphate-coated implantable medical devices...'
}
| train | validation | test | |
|---|---|---|---|
| all | 1207222 | 67068 | 67072 |
| a | 174134 | 9674 | 9675 |
| b | 161520 | 8973 | 8974 |
| c | 101042 | 5613 | 5614 |
| d | 10164 | 565 | 565 |
| e | 34443 | 1914 | 1914 |
| f | 85568 | 4754 | 4754 |
| g | 258935 | 14385 | 14386 |
| h | 257019 | 14279 | 14279 |
| y | 124397 | 6911 | 6911 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@article{DBLP:journals/corr/abs-1906-03741,
author = {Eva Sharma and
Chen Li and
Lu Wang},
title = {{BIGPATENT:} {A} Large-Scale Dataset for Abstractive and Coherent
Summarization},
journal = {CoRR},
volume = {abs/1906.03741},
year = {2019},
url = {http://arxiv.org/abs/1906.03741},
eprinttype = {arXiv},
eprint = {1906.03741},
timestamp = {Wed, 26 Jun 2019 07:14:58 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1906-03741.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Thanks to @mattbui for adding this dataset.