数据集:
so_stacksample
任务:
文生文语言:
en计算机处理:
monolingual大小:
1M<n<10M语言创建人:
crowdsourced批注创建人:
no-annotation源数据集:
original许可:
cc-by-sa-3.0Dataset with the text of 10% of questions and answers from the Stack Overflow programming Q&A website.
This is organized as three tables:
Questions table contains the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10. Answers table contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table. Tags table contains the tags on each of these questions.
Example projects include:
English (en) and Programming Languages.
For Answers:
{ "Id": { # Unique ID given to the Answer post "feature_type": "Value", "dtype": "int32" }, "OwnerUserId": { # The UserID of the person who generated the Answer on StackOverflow. -1 means NA "feature_type": "Value", "dtype": "int32" }, "CreationDate": { # The date the Answer was generated. Follows standard datetime format. "feature_type": "Value", "dtype": "string" }, "ParentId": { # Refers to the `Id` of the Question the Answer belong to. "feature_type": "Value", "dtype": "int32" }, "Score": { # The sum of up and down votes given to the Answer. Can be negative. "feature_type": "Value", "dtype": "int32" }, "Body": { # The body content of the Answer. "feature_type": "Value", "dtype": "string" } }
For Questions:
{ "Id": { # Unique ID given to the Question post "feature_type": "Value", "dtype": "int32" }, "OwnerUserId": { # The UserID of the person who generated the Question on StackOverflow. -1 means NA. "feature_type": "Value", "dtype": "int32" }, "CreationDate": { # The date the Question was generated. Follows standard datetime format. "feature_type": "Value", "dtype": "string" }, "ClosedDate": { # The date the Question was generated. Follows standard datetime format. Can be NA. "feature_type": "Value", "dtype": "string" }, "Score": { # The sum of up and down votes given to the Question. Can be negative. "feature_type": "Value", "dtype": "int32" }, "Title": { # The title of the Question. "feature_type": "Value", "dtype": "string" }, "Body": { # The body content of the Question. "feature_type": "Value", "dtype": "string" } }
For Tags:
{ "Id": { # ID of the Question the tag belongs to "feature_type": "Value", "dtype": "int32" }, "Tag": { # The tag name "feature_type": "Value", "dtype": "string" } }
`
For Answers: - Id : Unique ID given to the Answer post OwnerUserId : The UserID of the person who generated the Answer on StackOverflow. -1 means NA " CreationDate ": The date the Answer was generated. Follows standard datetime format. " ParentId ": Refers to the Id of the Question the Answer belong to. " Score ": The sum of up and down votes given to the Answer. Can be negative. " Body ": The body content of the Answer.
For Questions:
For Tags:
The dataset has 3 splits:
Datasets of all R questions and all Python questions are also available on Kaggle, but this dataset is especially useful for analyses that span many languages.
[More Information Needed]
Who are the source language producers?StackOverflow Users.
[More Information Needed]
Who are the annotators?[More Information Needed]
This data contains information that can identify individual users of StackOverflow. The information is self-reported.
[Needs More Information]
StackOverflow answers are not guaranteed to be safe, secure, or correct. Some answers may purposefully be insecure as is done in this https://stackoverflow.com/a/35571883/5768407 answer from user zys , where they show a solution to purposefully bypass Google Play store security checks. Such answers can lead to biased models that use this data and can further propogate unsafe and insecure programming practices.
[Needs More Information]
[Needs More Information]
[Needs More Information]
[Needs More Information]
All Stack Overflow user contributions are licensed under CC-BY-SA 3.0 with attribution required.
The content is from Stack Overflow.
Thanks to @ncoop57 for adding this dataset.