数据集:
cjvt/janes_preklop
Janes-Preklop is a corpus of Slovene tweets that is manually annotated for code-switching: the use of words from two or more languages within one sentence or utterance.
Code-switched Slovenian.
A sample instance from the dataset - each word is annotated with its language, either "default" (Slovenian/unclassifiable), en (English), de (German), hbs (Serbo-Croatian), sp (Spanish), la (Latin), ar (Arabic), fr (French), it (Italian), or pt (Portuguese).
{ 'id': 'tid.397447931558895616', 'words': ['Brad', 'Pitt', 'na', 'Planet', 'TV', '.', 'U', 'are', 'welcome', ';)'], 'language': ['default', 'default', 'default', 'default', 'default', 'default', 'B-en', 'I-en', 'I-en', 'I-en'] }
Špela Reher, Tomaž Erjavec, Darja Fišer.
CC BY-SA 4.0.
@misc{janes_preklop, title = {Tweet code-switching corpus Janes-Preklop 1.0}, author = {Reher, {\v S}pela and Erjavec, Toma{\v z} and Fi{\v s}er, Darja}, url = {http://hdl.handle.net/11356/1154}, note = {Slovenian language resource repository {CLARIN}.{SI}}, copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)}, issn = {2820-4042}, year = {2017} }
Thanks to @matejklemen for adding this dataset.