数据集:

alexandrainst/scandi-reddit

任务:

文本生成

填充掩码

子任务:

language-modeling

语言:

计算机处理:

multilingual

大小:

10M<n<100M

许可:

cc-by-4.0

数据集介绍文件清单

中文

Dataset Card for ScandiReddit

Dataset Summary

ScandiReddit is a filtered and post-processed corpus consisting of comments from Reddit .

All Reddit comments from December 2005 up until October 2022 were downloaded through PushShift , after which these were filtered based on the FastText language detection model. Any comment which was classified as Danish ( da ), Norwegian ( no ), Swedish ( sv ) or Icelandic ( is ) with a confidence score above 70% was kept.

The resulting comments were then deduplicated, removing roughly 438,000 comments. 5,000 comments written by Reddit bots were removed, and roughly 189,000 comments belonging to inappropriate subreddits (explicit and drug-related) were also removed.

Lastly, we remove roughly 40,000 near-duplicate comments from the resulting corpus, where near-duplicate here means that the comments have more than 80% of their word 5-grams in common.

Supported Tasks and Leaderboards

Training language models is the intended task for this dataset. No leaderboard is active at this point.

Languages

The dataset is available in Danish ( da ), Swedish ( sv ), Norwegian ( no ) and Icelandic ( is ).

Dataset Structure

Data Instances

Size of downloaded dataset files: 2341 MB
Size of the generated dataset: 3594 MB
Total amount of disk used: 5935 MB

An example from the dataset looks as follows.

{
    'doc': 'Bergen er ødelagt. Det er ikke moro mer.',
    'subreddit': 'Norway',
    'language': 'da',
    'language_confidence': 0.7472341656684875
}

Data Fields

The data fields are the same among all splits.

doc : a string feature.
subreddit : a string feature.
language : a string feature.
language_confidence : a float64 feature.

Language Distribution

name	count
sv	6,967,420
da	4,965,195
no	1,340,470
is	206,689
total	13,479,774

Top-50 Subreddit Distribution

name	count
sweden	4,881,483
Denmark	3,579,178
norge	1,281,655
svenskpolitik	771,960
InfluencergossipDK	649,910
swedishproblems	339,683
Iceland	183,488
dkfinance	113,860
unket	81,077
DanishEnts	69,055
dankmark	62,928
swedents	58,576
scandinavia	57,136
Allsvenskan	56,006
Gothenburg	54,395
stockholm	51,016
ISKbets	47,944
Sverige	39,552
SWARJE	34,691
GossipDK	29,332
NorskFotball	28,571
Superligaen	23,641
Aarhus	22,516
Svenska	20,561
newsdk	19,893
AskReddit	16,672
copenhagen	16,668
okpolarncp	16,583
SwedditUniversalis	15,990
Sveriges_politik	15,058
intresseklubben	13,246
Aktiemarknaden	13,202
soccer	12,637
teenagers	10,845
Norway	10,680
europe	10,247
Matinbum	9,792
oslo	9,650
iksdagen	9,232
Asksweddit	8,851
Forsvaret	8,641
Sverigesforsvarsmakt	8,469
memes	8,299
Danish	8,268
DANMAG	8,214
PewdiepieSubmissions	7,800
sweddpolitik	7,646
pinsamt	7,318
arbetarrorelsen	7,317
Ishockey	6,824

Dataset Creation

Curation Rationale

The Scandinavian languages do not have many open source social media datasets.

Source Data

The raw Reddit data was collected through PushShift .

Additional Information

Dataset Curators

Dan Saattrup Nielsen from the The Alexandra Institute curated this dataset.

Licensing Information

The dataset is licensed under the CC BY 4.0 license .

作者:

alexandrainst

数据集大小:

2.18 GB