数据集:
TheGreatRambler/mm2_level_comments
其他:
text-mining许可:
cc-by-nc-sa-4.0源数据集:
original大小:
10M<n<100M计算机处理:
multilingual语言:
multilingualPart of the Mario Maker 2 Dataset Collection
The Mario Maker 2 level comment dataset consists of 31.9 million level comments from Nintendo's online service totaling around 20GB of data. The dataset was created using the self-hosted Mario Maker 2 api over the course of 1 month in February 2022.
The Mario Maker 2 level comment dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of datasets . You can load and iterate through the dataset with the following code:
from datasets import load_dataset ds = load_dataset("TheGreatRambler/mm2_level_comments", streaming=True, split="train") print(next(iter(ds))) #OUTPUT: { 'data_id': 3000006, 'comment_id': '20200430072710528979_302de3722145c7a2_2dc6c6', 'type': 2, 'pid': '3471680967096518562', 'posted': 1561652887, 'clear_required': 0, 'text': '', 'reaction_image_id': 10, 'custom_image': [some binary data], 'has_beaten': 0, 'x': 557, 'y': 64, 'reaction_face': 0, 'unk8': 0, 'unk10': 0, 'unk12': 0, 'unk14': [some binary data], 'unk17': 0 }
Comments can be one of three types: text, reaction image or custom image. type can be used with the enum below to identify different kinds of comments. Custom images are binary PNGs.
You can also download the full dataset. Note that this will download ~20GB:
ds = load_dataset("TheGreatRambler/mm2_level_comments", split="train")
{ 'data_id': 3000006, 'comment_id': '20200430072710528979_302de3722145c7a2_2dc6c6', 'type': 2, 'pid': '3471680967096518562', 'posted': 1561652887, 'clear_required': 0, 'text': '', 'reaction_image_id': 10, 'custom_image': [some binary data], 'has_beaten': 0, 'x': 557, 'y': 64, 'reaction_face': 0, 'unk8': 0, 'unk10': 0, 'unk12': 0, 'unk14': [some binary data], 'unk17': 0 }
Field | Type | Description |
---|---|---|
data_id | int | The data ID of the level this comment appears on |
comment_id | string | Comment ID |
type | int | Type of comment, enum below |
pid | string | Player ID of the comment creator |
posted | int | UTC timestamp of when this comment was created |
clear_required | bool | Whether this comment requires a clear to view |
text | string | If the comment type is text, the text of the comment |
reaction_image_id | int | If this comment is a reaction image, the id of the reaction image, enum below |
custom_image | bytes | If this comment is a custom drawing, the custom drawing as a PNG binary |
has_beaten | int | Whether the user had beaten the level when they created the comment |
x | int | The X position of the comment in game |
y | int | The Y position of the comment in game |
reaction_face | int | The reaction face of the mii of this user, enum below |
unk8 | int | Unknown |
unk10 | int | Unknown |
unk12 | int | Unknown |
unk14 | bytes | Unknown |
unk17 | int | Unknown |
The dataset only contains a train split.
The dataset contains some enum integer fields. This can be used to convert back to their string equivalents:
CommentType = { 0: "Custom Image", 1: "Text", 2: "Reaction Image" } CommentReactionImage = { 0: "Nice!", 1: "Good stuff!", 2: "So tough...", 3: "EASY", 4: "Seriously?!", 5: "Wow!", 6: "Cool idea!", 7: "SPEEDRUN!", 8: "How?!", 9: "Be careful!", 10: "So close!", 11: "Beat it!" } CommentReactionFace = { 0: "Normal", 16: "Wink", 1: "Happy", 4: "Surprised", 18: "Scared", 3: "Confused" }
The dataset was created over a little more than a month in Febuary 2022 using the self hosted Mario Maker 2 api . As requests made to Nintendo's servers require authentication the process had to be done with upmost care and limiting download speed as to not overload the API and risk a ban. There are no intentions to create an updated release of this dataset.
The dataset consists of comments from many different Mario Maker 2 players globally and as such their text could contain harmful language. Harmful depictions could also be present in the custom images.