数据集:
Aisha/BAAD6
BAAD6 is an Authorship Attribution dataset for Bengali Literature . It was collected and analyzed by Hemayet et al [1] . The data was obtained from different online posts and blogs. This dataset is balanced among the 6 Authors with 350 sample texts per author. This is a relatively small dataset but is noisy given the sources it was collected from and its cleaning procedure. Nonetheless, it may help evaluate authorship attribution systems as it resembles texts often available on the Internet. Details about the dataset are given in the table below.
Author | Samples | Word count | Unique word |
---|---|---|---|
fe | 350 | 357k | 53k |
ij | 350 | 391k | 72k |
mk | 350 | 377k | 47k |
rn | 350 | 231k | 50k |
hm | 350 | 555k | 72k |
rg | 350 | 391k | 58k |
Total | 2,100 | 2,304,338 | 230,075 |
Average | 350 | 384,056.33 | 59,006.67 |
If you use this dataset, please cite the paper A Comparative Analysis of Word Embedding Representations in Authorship Attribution of Bengali Literature .
@INPROCEEDINGS{BAAD6Dataset, author={Ahmed Chowdhury, Hemayet and Haque Imon, Md. Azizul and Islam, Md. Saiful}, booktitle={2018 21st International Conference of Computer and Information Technology (ICCIT)}, title={A Comparative Analysis of Word Embedding Representations in Authorship Attribution of Bengali Literature}, year={2018}, volume={}, number={}, pages={1-6}, doi={10.1109/ICCITECHN.2018.8631977} }
This dataset is also available in Mendeley: BAAD6 dataset . Always make sure to use the latest version of the dataset. Cite the dataset directly by:
@misc{BAAD6Dataset, author = {Ahmed Chowdhury, Hemayet and Haque Imon, Md. Azizul and Khatun, Aisha and Islam, Md. Saiful}, title = {BAAD6: Bangla Authorship Attribution Dataset}, year={2018}, doi = {10.17632/w9wkd7g43f.5}, howpublished= {\url{https://data.mendeley.com/datasets/w9wkd7g43f/5}} }