数据集:
nell
许可:
license:unknown源数据集:
original批注创建人:
machine-generated语言创建人:
crowdsourced计算机处理:
monolingual语言:
en任务:
文本检索This dataset provides version 1115 of the belief extracted by CMU's Never Ending Language Learner (NELL) and version 1110 of the candidate belief extracted by NELL. See http://rtw.ml.cmu.edu/rtw/overview . NELL is an open information extraction system that attempts to read the Clueweb09 of 500 million web pages ( http://boston.lti.cs.cmu.edu/Data/clueweb09/ ) and general web searches.
The dataset has 4 configurations: nell_belief, nell_candidate, nell_belief_sentences, and nell_candidate_sentences. nell_belief is certainties of belief are lower. The two sentences config extracts the CPL sentence patterns filled with the applicable 'best' literal string for the entities filled into the sentence patterns. And also provides sentences found using web searches containing the entities and relationships.
There are roughly 21M entries for nell_belief_sentences, and 100M sentences for nell_candidate_sentences.
From the NELL website:
Research Goal To build a never-ending machine learning system that acquires the ability to extract structured information from unstructured web pages. If successful, this will result in a knowledge base (i.e., a relational database) of structured information that mirrors the content of the Web. We call this system NELL (Never-Ending Language Learner).
Approach The inputs to NELL include (1) an initial ontology defining hundreds of categories (e.g., person, sportsTeam, fruit, emotion) and relations (e.g., playsOnTeam(athlete,sportsTeam), playsInstrument(musician,instrument)) that NELL is expected to read about, and (2) 10 to 15 seed examples of each category and relation.
Given these inputs, plus a collection of 500 million web pages and access to the remainder of the web through search engine APIs, NELL runs 24 hours per day, continuously, to perform two ongoing tasks:
Extract new instances of categories and relations. In other words, find noun phrases that represent new examples of the input categories (e.g., "Barack Obama" is a person and politician), and find pairs of noun phrases that correspond to instances of the input relations (e.g., the pair "Jason Giambi" and "Yankees" is an instance of the playsOnTeam relation). These new instances are added to the growing knowledge base of structured beliefs. Learn to read better than yesterday. NELL uses a variety of methods to extract beliefs from the web. These are retrained, using the growing knowledge base as a self-supervised collection of training examples. The result is a semi-supervised learning method that couples the training of hundreds of different extraction methods for a wide range of categories and relations. Much of NELL’s current success is due to its algorithm for coupling the simultaneous training of many extraction methods.
For more information, see: http://rtw.ml.cmu.edu/rtw/resources
[More Information Needed]
en, and perhaps some others
There are four configurations for the dataset: nell_belief, nell_candidate, nell_belief_sentences, nell_candidate_sentences.
nell_belief and nell_candidate defines:
{'best_entity_literal_string': 'Aspect Medical Systems', 'best_value_literal_string': '', 'candidate_source': '%5BSEAL-Iter%3A215-2011%2F02%2F26-04%3A27%3A09-%3Ctoken%3Daspect_medical_systems%2Cbiotechcompany%3E-From%3ACategory%3Abiotechcompany-using-KB+http%3A%2F%2Fwww.unionegroup.com%2Fhealthcare%2Fmfg_info.htm+http%3A%2F%2Fwww.conventionspc.com%2Fcompanies.html%2C+CPL-Iter%3A1103-2018%2F03%2F08-15%3A32%3A34-%3Ctoken%3Daspect_medical_systems%2Cbiotechcompany%3E-grant+support+from+_%092%09research+support+from+_%094%09unrestricted+educational+grant+from+_%092%09educational+grant+from+_%092%09research+grant+support+from+_%091%09various+financial+management+positions+at+_%091%5D', 'categories_for_entity': 'concept:biotechcompany', 'categories_for_value': 'concept:company', 'entity': 'concept:biotechcompany:aspect_medical_systems', 'entity_literal_strings': '"Aspect Medical Systems" "aspect medical systems"', 'iteration_of_promotion': '1103', 'relation': 'generalizations', 'score': '0.9244426550775064', 'source': 'MBL-Iter%3A1103-2018%2F03%2F18-01%3A35%3A42-From+ErrorBasedIntegrator+%28SEAL%28aspect_medical_systems%2Cbiotechcompany%29%2C+CPL%28aspect_medical_systems%2Cbiotechcompany%29%29', 'value': 'concept:biotechcompany', 'value_literal_strings': ''}
nell_belief_sentences, nell_candidate_sentences defines:
{'count': 4, 'entity': 'biotechcompany:aspect_medical_systems', 'relation': 'generalizations', 'score': '0.9244426550775064', 'sentence': 'research support from [[ Aspect Medical Systems ]]', 'sentence_type': 'CPL', 'url': '', 'value': 'biotechcompany'}
For nell_belief and nell_canddiate configurations. From http://rtw.ml.cmu.edu/rtw/faq :
For the nell_belief_sentences and nell_candidate_sentences, we have extracted the underlying sentences, sentence count and URLs and provided a shortened version of the entity, relation and value field by removing the string "concept:" and "candidate:". There are two types of sentences, 'CPL' and 'OE', which are generated by two of the modules of NELL, pattern matching and open web searching, respectively. There may be duplicates. The configuration is as follows:
There are no splits.
This dataset was gathered and created over many years of running the NELL system on web data.
See the research paper on NELL. NELL searches a subset of the web (Clueweb09) and the open web using various open information extraction algorithms, including pattern matching.
Who are the source language producers?The NELL authors at Carnegie Mellon Univiersty and data from Cluebweb09 and the open web.
The various open information extraction modules of NELL.
Who are the annotators?Machine annotated.
Unkown, but likely there are names of famous individuals.
The goal for the work is to help machines learn to read and understand the web.
Since the data is gathered from the web, there is likely to be biased text and relationships.
[More Information Needed]
The relationships and concepts gathered from NELL are not 100% accurate, and there could be errors (maybe as high as 30% error). See https://en.wikipedia.org/wiki/Never-Ending_Language_Learning
We did not 'tag' the entity and value in the 'OE' sentences, and this might be an extension in the future.
The authors of NELL at Carnegie Mellon Univeristy
There does not appear to be a license on http://rtw.ml.cmu.edu/rtw/resources . The data is made available by CMU on the web.
@inproceedings{mitchell2015, added-at = {2015-01-27T15:35:24.000+0100}, author = {Mitchell, T. and Cohen, W. and Hruscha, E. and Talukdar, P. and Betteridge, J. and Carlson, A. and Dalvi, B. and Gardner, M. and Kisiel, B. and Krishnamurthy, J. and Lao, N. and Mazaitis, K. and Mohammad, T. and Nakashole, N. and Platanios, E. and Ritter, A. and Samadi, M. and Settles, B. and Wang, R. and Wijaya, D. and Gupta, A. and Chen, X. and Saparov, A. and Greaves, M. and Welling, J.}, biburl = { https://www.bibsonomy.org/bibtex/263070703e6bb812852cca56574aed093/hotho} , booktitle = {AAAI}, description = {Papers by William W. Cohen}, interhash = {52d0d71f6f5b332dabc1412f18e3a93d}, intrahash = {63070703e6bb812852cca56574aed093}, keywords = {learning nell ontology semantic toread}, note = {: Never-Ending Learning in AAAI-2015}, timestamp = {2015-01-27T15:35:24.000+0100}, title = {Never-Ending Learning}, url = { http://www.cs.cmu.edu/~wcohen/pubs.html} , year = 2015 }
Thanks to @ontocord for adding this dataset.