You can find the main data card on the GEM Website .
The MLB dataset is an English sport-related data-to-text dataset in the baseball domain. The input is a large table with results of a game and the output is a description of the game.
You can load the dataset via:
import datasets data = datasets.load_dataset('GEM/mlb_data_to_text')
The data loader can be found here .
website paper authorsRatish Puduppully, Li Dong, Mirella Lapata
@inproceedings{puduppully-etal-2019-data, title = "Data-to-text Generation with Entity Modeling", author = "Puduppully, Ratish and Dong, Li and Lapata, Mirella", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P19-1195", doi = "10.18653/v1/P19-1195", pages = "2023--2035", }Contact Name
Ratish Puduppully
Contact Emailratishpuduppully@gmail.com
Has a Leaderboard?no
no
Covered LanguagesEnglish
Licenseother: Other license
Intended UseThe dataset can be used to study data-to-text generation. The dataset is in sports domain. It pairs statistics of Major League Baseball (MLB) game with its summary. The summary is in the form of a document containing an average of 540 tokens. Thus it is useful to study long document generation.
Add. License InfoRestricted to non-commercial research purposes.
Primary TaskData-to-Text
Communicative GoalProduce a summary of MLB game from its statistics.
academic
Curation Organization(s)University of Edinburgh
Dataset CreatorsRatish Puduppully, Li Dong, Mirella Lapata
features = datasets.Features( { "home_name": datasets.Value("string"), "box_score": [ { "p_l": datasets.Value("string"), "last_name": datasets.Value("string"), "p_h": datasets.Value("string"), "sac": datasets.Value("string"), "p_bb": datasets.Value("string"), "pos": datasets.Value("string"), "ao": datasets.Value("string"), "p_bf": datasets.Value("string"), "cs": datasets.Value("string"), "hbp": datasets.Value("string"), "ab": datasets.Value("string"), "full_name": datasets.Value("string"), "p_w": datasets.Value("string"), "go": datasets.Value("string"), "fldg": datasets.Value("string"), "p_bs": datasets.Value("string"), "avg": datasets.Value("string"), "p_r": datasets.Value("string"), "p_s": datasets.Value("string"), "lob": datasets.Value("string"), "first_name": datasets.Value("string"), "p_sv": datasets.Value("string"), "p_so": datasets.Value("string"), "p_save": datasets.Value("string"), "p_hr": datasets.Value("string"), "po": datasets.Value("string"), "p_ip1": datasets.Value("string"), "p_ip2": datasets.Value("string"), "bb": datasets.Value("string"), "ops": datasets.Value("string"), "p_hld": datasets.Value("string"), "bo": datasets.Value("string"), "p_loss": datasets.Value("string"), "e": datasets.Value("string"), "p_game_score": datasets.Value("string"), "p_win": datasets.Value("string"), "a": datasets.Value("string"), "p_era": datasets.Value("string"), "d": datasets.Value("string"), "p_out": datasets.Value("string"), "h": datasets.Value("string"), "p_er": datasets.Value("string"), "p_np": datasets.Value("string"), "hr": datasets.Value("string"), "r": datasets.Value("string"), "so": datasets.Value("string"), "t": datasets.Value("string"), "rbi": datasets.Value("string"), "team": datasets.Value("string"), "sb": datasets.Value("string"), "slg": datasets.Value("string"), "sf": datasets.Value("string"), "obp": datasets.Value("string"), } ], "home_city": datasets.Value("string"), "vis_name": datasets.Value("string"), "play_by_play": [{ "top": [{ "runs": datasets.Value("string"), "scorers": [ datasets.Value("string") ], "pitcher": datasets.Value("string"), "o": datasets.Value("string"), "b": datasets.Value("string"), "s": datasets.Value("string"), "batter": datasets.Value("string"), "b1": [ datasets.Value("string") ], "b2": [ datasets.Value("string") ], "b3": [ datasets.Value("string") ], "event": datasets.Value("string"), "event2": datasets.Value("string"), "home_team_runs": datasets.Value("string"), "away_team_runs": datasets.Value("string"), "rbi": datasets.Value("string"), "error_runs": datasets.Value("string"), "fielder_error": datasets.Value("string") } ], "bottom": [{ "runs": datasets.Value("string"), "scorers": [ datasets.Value("string") ], "pitcher": datasets.Value("string"), "o": datasets.Value("string"), "b": datasets.Value("string"), "s": datasets.Value("string"), "batter": datasets.Value("string"), "b1": [ datasets.Value("string") ], "b2": [ datasets.Value("string") ], "b3": [ datasets.Value("string") ], "event": datasets.Value("string"), "event2": datasets.Value("string"), "home_team_runs": datasets.Value("string"), "away_team_runs": datasets.Value("string"), "rbi": datasets.Value("string"), "error_runs": datasets.Value("string"), "fielder_error": datasets.Value("string") } ], "inning": datasets.Value("string") } ], "vis_line": { "innings": [{ "inn": datasets.Value("string"), "runs": datasets.Value("string") } ], "result": datasets.Value("string"), "team_runs": datasets.Value("string"), "team_hits": datasets.Value("string"), "team_errors": datasets.Value("string"), "team_name": datasets.Value("string"), "team_city": datasets.Value("string") }, "home_line": { "innings": [{ "inn": datasets.Value("string"), "runs": datasets.Value("string") } ], "result": datasets.Value("string"), "team_runs": datasets.Value("string"), "team_hits": datasets.Value("string"), "team_errors": datasets.Value("string"), "team_name": datasets.Value("string"), "team_city": datasets.Value("string") }, "vis_city": datasets.Value("string"), "day": datasets.Value("string"), "summary": [ datasets.Value("string"), ], "gem_id": datasets.Value("string") }Reason for Structure
The high level structure contains the following attributes: home_name, vis_name, home_city, vis_city, summary, summary_eval, day, gem_id, box_score, play_by_play, home_line, vis_line. The attributes home_name, vis_name, home_city, vis_city and day are string values. The attribute "summary" contains the summary in the form of a list of tokens. The attribute "summary_eval" contains the summary in the form of a string of tokens. The difference from "summary" field is that "summary_eval" doesn't contain " NEWPARAGRAPH " delimiters to separate the paragraphs. "summary_eval" field should be used to evaluate model outputs. "summary" field may be used during the training process. box_score contains the box score statistics of the players in the game. It is in the form of a list (of average size 90), with each element describing the statistics of a player. The box score statistics contain 53 attributes. The description of the attributes is given below. The descriptions of most of the attributes is obtained from mlb.com .
The description of attributes in play-by-play is below:
home_line and vis_line contain string value pairs for team_name , team_city , team_runs , team_hits , team_error , result , and a list of runs scored in each inning.
Data SplitsThere are three splits in the dataset: train, validation and test
Splitting CriteriaThe splits are random.
This dataset can verify if models are capable of long document generation. The challenges in long document generation conditioned on input tables include ensuring coherent output, staying faithful to the input, ensuring fluent output and avoiding repetition of text. Such aspects can be verified on models trained on this dataset
Similar Datasetsyes
Unique Language Coverageno
Difference from other GEM datasetsCompared to the existing RotoWire (Wiseman et al. 2017) dataset, MLB summaries are longer (approximately by 50%) and the input records are richer and more structured (with the addition of play-by-play). Moreover, the MLB dataset is five times larger in terms of data size (i.e., pairs of tables and game summaries).
Ability that the Dataset measuresLong document generation, coherent ordering of information, faithfulness to the input statistics, fluency in generation and avoiding repetition of text.
yes
GEM Modificationsdata points removed
Modification DetailsSome examples have been removed from training dataset which satisfied the below criteria:
no
The research paper is a good resource
Automatic evaluation measure can evaluate the factuality, content selection, content ordering and the fluency of the model output. The factuality, content selection and content ordering is measured using an Information Extraction based evaluation approach introduced by Wiseman et al (2017). The fluency is measured using BLEU.
MetricsOther: Other Metrics
Other MetricsWiseman et al. (2017) define three metrics induced from the outputs of an Information Extraction model which is run on the model/human-written game summaries . Let ŷ be the gold summary and y the model output. • Relation Generation (RG) measures the precision and count of relations extracted from y that also appear in records r. • Content Selection (CS) measures the precision and recall of relations extracted from y that are also extracted from ŷ. • Content Ordering (CO) measures the complement of the normalized Damerau-Levenshtein distance (Brill and Moore, 2000) between the sequences of relations extracted from y and ŷ
Proposed EvaluationWe have reused the automatic metrics based on Information Extraction evaluation introduced by Wiseman et al (2017). For human evaluation, we conducted studies to evaluate the factuality, coherence, grammaticality and conciseness.
Previous results available?yes
Relevant Previous ResultsThe most relevant previous results for dataset are in the TACL 2021 paper on Data-to-text Generation with Macro Planning
This dataset was curated to complement an existing data-to-text generation dataset (RotoWire by Wiseman et al. 2017) which focuses on long document generation. Compared to RotoWire , MLB summaries are longer (approximately by 50%) and the input records are richer and more structured (with the addition of play-by-play). Moreover, the MLB dataset is five times larger in terms of data size (i.e., pairs of tables and game summaries)
Communicative GoalThe goal is to study automatic generation of long documents in a data-to-text setting. The generated summaries should exhibit coherent ordering of content, be faithful to the input statistics, be fluent and avoid repetition of text.
Sourced from Different Sourcesno
Found
Where was it found?Single website
Language ProducersThe game summaries are produced by professional writers.
Topics CoveredThe language focuses on the sports domain.
Data Validationnot validated
Data PreprocessingGame summaries were tokenized using NLTK (Bird et al., 2009) and hyphenated words were separated. Sentences containing quotes were removed as they included opinions and non-factual statements unrelated to the input tables. Sometimes MLB summaries contain a "Game notes" section with incidental information which was also removed.
Was Data Filtered?not filtered
none
Annotation Service?no
no
Justification for Using the DataThe copyright remains with the original data creators and the usage permission is restricted to non-commercial uses.
yes/very likely
Categories of PIIsensitive information , generic PII
Any PII Identification?no identification
no
no
no
unsure
research use only
Copyright Restrictions on the Language Dataresearch use only