CTMatch Information Retrieval Dataset
This is a dataset of processed clinical trials documents, somehwat of a duplication of that found in
datasets/ir_datasets
except that these have been preprocessed with
ctproc
to clean and extract useful fields from the clinical trial documents.
Note: They are currently saved as text files because of the downstream task in ctmatch, though in the future they may be converted to .csv.
Each .txt file has exactly 374648 lines of corresponding data:
doc_texts.txt
-
texts extracted from documents processed with
ctproc
using and eligbility criteria fields only, structured as this example from NCT00000102:
"Inclusion Criteria: diagnosed with Congenital Adrenal Hyperplasia (CAH) normal ECG during baseline evaluation, Exclusion Criteria: history of liver disease, or elevated liver function tests history of cardiovascular disease"
doc_categories.txt
:
-
1 x 14 vectors of somewhat arbitrarily chosen topic probabilities (softmax output) generated by zero-shot classification model
facebook/bart-large-mnli
, CTMatch.category_model(doc['condition']) lexically ordered as such:
cancer,cardiac,endocrine,gastrointestinal,genetic,healthy,infection,neurological,other,pediatric,psychological,pulmonary,renal,reproductive
doc_embeddings.txt
:
-
1 x 384 vectors of embeddings taken from last hidden state of model encoded doc_text using SentenceTransformers(
sentence-transformers/all-MiniLM-L6-v2
)
index2docid.txt
:
-
simple mapping of index to NCTID's for filtering/reference throughout IR program, corresponding to vector, texts representation order