Dataset Card for ParlamentoPT

Dataset Summary

The ParlamentoPT is a Portuguese language data set obtained by collecting publicly available documents containing transcriptions of debates in the Portuguese Parliament. The data was collected from the Portuguese Parliament portal in accordance with its open data policy .

This dataset was collected with the purpose of creating the Albertina-PT* language model, and it serves as training data for model development. The development of the model is a collaborative effort between the University of Lisbon and the University of Porto in Portugal

Citation

When using or citing this data set, kindly cite the following publication :

@misc{albertina-pt,
      title={Advancing Neural Encoding of Portuguese
             with Transformer Albertina PT-*}, 
      author={João Rodrigues and Luís Gomes and João Silva and
              António Branco and Rodrigo Santos and
              Henrique Lopes Cardoso and Tomás Osório},
      year={2023},
      eprint={2305.06721},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgments

The research reported here was partially supported by: PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016; research project ALBERTINA - Foundation Encoder Model for Portuguese and AI, funded by FCT—Fundação para a Ciência e Tecnologia under the grant CPCA-IAC/AV/478394/2022; innovation project ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação under the grant C625734525-00462629, of Plano de Recuperação e Resiliência, call RE-C05-i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização; and LIACC - Laboratory for AI and Computer Science, funded by FCT—Fundação para a Ciência e Tecnologia under the grant FCT/UID/CEC/0027/2020.

作者:

PORTULAN

数据集大小:

2.52 GB