数据集:
alexandrainst/scandi-reddit
ScandiReddit is a filtered and post-processed corpus consisting of comments from Reddit .
All Reddit comments from December 2005 up until October 2022 were downloaded through PushShift , after which these were filtered based on the FastText language detection model. Any comment which was classified as Danish ( da ), Norwegian ( no ), Swedish ( sv ) or Icelandic ( is ) with a confidence score above 70% was kept.
The resulting comments were then deduplicated, removing roughly 438,000 comments. 5,000 comments written by Reddit bots were removed, and roughly 189,000 comments belonging to inappropriate subreddits (explicit and drug-related) were also removed.
Lastly, we remove roughly 40,000 near-duplicate comments from the resulting corpus, where near-duplicate here means that the comments have more than 80% of their word 5-grams in common.
Training language models is the intended task for this dataset. No leaderboard is active at this point.
The dataset is available in Danish ( da ), Swedish ( sv ), Norwegian ( no ) and Icelandic ( is ).
An example from the dataset looks as follows.
{ 'doc': 'Bergen er ødelagt. Det er ikke moro mer.', 'subreddit': 'Norway', 'language': 'da', 'language_confidence': 0.7472341656684875 }
The data fields are the same among all splits.
name | count |
---|---|
sv | 6,967,420 |
da | 4,965,195 |
no | 1,340,470 |
is | 206,689 |
total | 13,479,774 |
name | count |
---|---|
sweden | 4,881,483 |
Denmark | 3,579,178 |
norge | 1,281,655 |
svenskpolitik | 771,960 |
InfluencergossipDK | 649,910 |
swedishproblems | 339,683 |
Iceland | 183,488 |
dkfinance | 113,860 |
unket | 81,077 |
DanishEnts | 69,055 |
dankmark | 62,928 |
swedents | 58,576 |
scandinavia | 57,136 |
Allsvenskan | 56,006 |
Gothenburg | 54,395 |
stockholm | 51,016 |
ISKbets | 47,944 |
Sverige | 39,552 |
SWARJE | 34,691 |
GossipDK | 29,332 |
NorskFotball | 28,571 |
Superligaen | 23,641 |
Aarhus | 22,516 |
Svenska | 20,561 |
newsdk | 19,893 |
AskReddit | 16,672 |
copenhagen | 16,668 |
okpolarncp | 16,583 |
SwedditUniversalis | 15,990 |
Sveriges_politik | 15,058 |
intresseklubben | 13,246 |
Aktiemarknaden | 13,202 |
soccer | 12,637 |
teenagers | 10,845 |
Norway | 10,680 |
europe | 10,247 |
Matinbum | 9,792 |
oslo | 9,650 |
iksdagen | 9,232 |
Asksweddit | 8,851 |
Forsvaret | 8,641 |
Sverigesforsvarsmakt | 8,469 |
memes | 8,299 |
Danish | 8,268 |
DANMAG | 8,214 |
PewdiepieSubmissions | 7,800 |
sweddpolitik | 7,646 |
pinsamt | 7,318 |
arbetarrorelsen | 7,317 |
Ishockey | 6,824 |
The Scandinavian languages do not have many open source social media datasets.
The raw Reddit data was collected through PushShift .
Dan Saattrup Nielsen from the The Alexandra Institute curated this dataset.
The dataset is licensed under the CC BY 4.0 license .