数据集:

big_patent

语言:

en

计算机处理:

monolingual

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

预印本库:

arxiv:1906.03741

许可:

cc-by-4.0
中文

Dataset Card for Big Patent

Dataset Summary

BIGPATENT, consisting of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Each US patent application is filed under a Cooperative Patent Classification (CPC) code. There are nine such classification categories:

  • a: Human Necessities
  • b: Performing Operations; Transporting
  • c: Chemistry; Metallurgy
  • d: Textiles; Paper
  • e: Fixed Constructions
  • f: Mechanical Engineering; Lightning; Heating; Weapons; Blasting
  • g: Physics
  • h: Electricity
  • y: General tagging of new or cross-sectional technology

Current defaults are 2.1.2 version (fix update to cased raw strings) and 'all' CPC codes:

from datasets import load_dataset

ds = load_dataset("big_patent")  # default is 'all' CPC codes
ds = load_dataset("big_patent", "all")  # the same as above
ds = load_dataset("big_patent", "a")  # only 'a' CPC codes
ds = load_dataset("big_patent", codes=["a", "b"])

To use 1.0.0 version (lower cased tokenized words), pass both parameters codes and version :

ds = load_dataset("big_patent", codes="all", version="1.0.0")
ds = load_dataset("big_patent", codes="a", version="1.0.0")
ds = load_dataset("big_patent", codes=["a", "b"], version="1.0.0")

Supported Tasks and Leaderboards

[More Information Needed]

Languages

English

Dataset Structure

Data Instances

Each instance contains a pair of description and abstract . description is extracted from the Description section of the Patent while abstract is extracted from the Abstract section.

{
  'description': 'FIELD OF THE INVENTION  \n       [0001]     This invention relates to novel calcium phosphate-coated implantable medical devices and processes of making same. The unique calcium-phosphate coated implantable medical devices minimize...',
  'abstract': 'This invention relates to novel calcium phosphate-coated implantable medical devices...'
}

Data Fields

  • description : detailed description of patent.
  • abstract : Patent abastract.

Data Splits

train validation test
all 1207222 67068 67072
a 174134 9674 9675
b 161520 8973 8974
c 101042 5613 5614
d 10164 565 565
e 34443 1914 1914
f 85568 4754 4754
g 258935 14385 14386
h 257019 14279 14279
y 124397 6911 6911

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@article{DBLP:journals/corr/abs-1906-03741,
  author    = {Eva Sharma and
               Chen Li and
               Lu Wang},
  title     = {{BIGPATENT:} {A} Large-Scale Dataset for Abstractive and Coherent
               Summarization},
  journal   = {CoRR},
  volume    = {abs/1906.03741},
  year      = {2019},
  url       = {http://arxiv.org/abs/1906.03741},
  eprinttype = {arXiv},
  eprint    = {1906.03741},
  timestamp = {Wed, 26 Jun 2019 07:14:58 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1906-03741.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Contributions

Thanks to @mattbui for adding this dataset.