数据集:
imvladikon/knesset_meetings_corpus
任务:
文本生成子任务:
language-modeling语言:
he计算机处理:
monolingual大小:
n<1K语言创建人:
crowdsourced批注创建人:
no-annotation源数据集:
original许可:
pddlAn example of a sample:
{ "text": <text content of given document>, "path": <file path to docx> }
Dataset usage Available "kneset16","kneset17","knesset_tagged" configurations And only train set.
train_ds = load_dataset("imvladikon/knesset_meetings_corpus", "kneset16", split="train")
The Knesset Meetings Corpus 2004-2005 is made up of two components:
Raw texts - 282 files made up of 867,725 lines together. These can be downloaded in two formats:
As doc files, encoded using windows-1255 encoding:
kneset16.zip - Contains 164 text files made up of 543,228 lines together. [MILA host] <http://yeda.cs.technion.ac.il:8088/corpus/software/corpora/knesset/txt/docs/kneset16.zip> _ [Github Mirror] <https://github.com/NLPH/knesset-2004-2005/blob/master/kneset16.zip?raw=true> _
kneset17.zip - Contains 118 text files made up of 324,497 lines together. [MILA host] <http://yeda.cs.technion.ac.il:8088/corpus/software/corpora/knesset/txt/docs/kneset17.zip> _ [Github Mirror] <https://github.com/NLPH/knesset-2004-2005/blob/master/kneset17.zip?raw=true> _
As txt files, encoded using utf8 encoding:
kneset.tar.gz - An archive of all the raw text files, divided into two folders: [Github mirror] <https://github.com/NLPH/knesset-2004-2005/blob/master/kneset.tar.gz> _
16 - Contains 164 text files made up of 543,228 lines together.
17 - Contains 118 text files made up of 324,497 lines together.
knesset_txt_16.tar.gz - Contains 164 text files made up of 543,228 lines together. [MILA host] <http://yeda.cs.technion.ac.il:8088/corpus/software/corpora/knesset/txt/utf8/knesset_txt_16.tar.gz> _ [Github Mirror] <https://github.com/NLPH/knesset-2004-2005/blob/master/knesset_txt_16.tar.gz?raw=true> _
knesset_txt_17.zip - Contains 118 text files made up of 324,497 lines together. [MILA host] <http://yeda.cs.technion.ac.il:8088/corpus/software/corpora/knesset/txt/utf8/knesset_txt_17.zip> _ [Github Mirror] <https://github.com/NLPH/knesset-2004-2005/blob/master/knesset_txt_17.zip?raw=true> _
Tokenized and morphologically tagged texts - Tagged versions exist only for the files in the 16 folder. The texts are encoded using MILA's XML schema for corpora <http://www.mila.cs.technion.ac.il/eng/resources_standards.html> _. These can be downloaded in two ways:
This repository is a mirror of this dataset found on MILA's website <http://www.mila.cs.technion.ac.il/eng/resources_corpora_haknesset.html> _.
Zenodo mirror: https://zenodo.org/record/2707356 <https://zenodo.org/record/2707356> _
All Knesset meeting protocols are in the public domain <https://en.wikipedia.org/wiki/Public_domain> _ ( רשות הציבור <https://he.wikipedia.org/wiki/%D7%A8%D7%A9%D7%95%D7%AA_%D7%94%D7%A6%D7%99%D7%91%D7%95%D7%A8> _) by law. These files are thus in the public doamin and do not require any license or public domain dedication to set their status.
.. |DOI| image:: https://zenodo.org/badge/DOI/10.5281/zenodo.2707356.svg :target: https://doi.org/10.5281/zenodo.2707356
.. |LICENCE| image:: https://github.com/NLPH/knesset-2004-2005/blob/master/public_domain_shield.svg :target: https://en.wikipedia.org/wiki/Public_domain
.. |PUBDOM| image:: https://github.com/NLPH/knesset-2004-2005/blob/master/public_domain.png :target: https://en.wikipedia.org/wiki/Public_domain
The dataset is available under the Open Data Commons Public Domain Dedication & License 1.0 .