数据集:

big_patent

任务:

摘要生成

语言:

计算机处理:

monolingual

大小:

100K<n<1M 10K<n<100K 1M<n<10M

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:1906.03741

其他:

patent-summarization

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for Big Patent

Dataset Summary

BIGPATENT, consisting of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Each US patent application is filed under a Cooperative Patent Classification (CPC) code. There are nine such classification categories:

a: Human Necessities
b: Performing Operations; Transporting
c: Chemistry; Metallurgy
d: Textiles; Paper
e: Fixed Constructions
f: Mechanical Engineering; Lightning; Heating; Weapons; Blasting
g: Physics
h: Electricity
y: General tagging of new or cross-sectional technology

Current defaults are 2.1.2 version (fix update to cased raw strings) and 'all' CPC codes:

from datasets import load_dataset

ds = load_dataset("big_patent")  # default is 'all' CPC codes
ds = load_dataset("big_patent", "all")  # the same as above
ds = load_dataset("big_patent", "a")  # only 'a' CPC codes
ds = load_dataset("big_patent", codes=["a", "b"])

To use 1.0.0 version (lower cased tokenized words), pass both parameters codes and version :

ds = load_dataset("big_patent", codes="all", version="1.0.0")
ds = load_dataset("big_patent", codes="a", version="1.0.0")
ds = load_dataset("big_patent", codes=["a", "b"], version="1.0.0")

Supported Tasks and Leaderboards

[More Information Needed]

Languages

English

Dataset Structure

Data Instances

Each instance contains a pair of description and abstract . description is extracted from the Description section of the Patent while abstract is extracted from the Abstract section.

{
  'description': 'FIELD OF THE INVENTION  \n       [0001]     This invention relates to novel calcium phosphate-coated implantable medical devices and processes of making same. The unique calcium-phosphate coated implantable medical devices minimize...',
  'abstract': 'This invention relates to novel calcium phosphate-coated implantable medical devices...'
}

Data Fields

description : detailed description of patent.
abstract : Patent abastract.

Data Splits

train	validation	test
all	1207222	67068	67072
a	174134	9674	9675
b	161520	8973	8974
c	101042	5613	5614
d	10164	565	565
e	34443	1914	1914
f	85568	4754	4754
g	258935	14385	14386
h	257019	14279	14279
y	124397	6911	6911

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@article{DBLP:journals/corr/abs-1906-03741,
  author    = {Eva Sharma and
               Chen Li and
               Lu Wang},
  title     = {{BIGPATENT:} {A} Large-Scale Dataset for Abstractive and Coherent
               Summarization},
  journal   = {CoRR},
  volume    = {abs/1906.03741},
  year      = {2019},
  url       = {http://arxiv.org/abs/1906.03741},
  eprinttype = {arXiv},
  eprint    = {1906.03741},
  timestamp = {Wed, 26 Jun 2019 07:14:58 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1906-03741.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Contributions

Thanks to @mattbui for adding this dataset.

作者:

佚名

数据集大小:

15.45 GB