数据集:
semaj83/ctmatch_classification
CTMatch Classification Dataset
This is a combined set of 2 labelled datasets of:
topic (patient descriptions), doc (clinical trials documents - selected fields), and label ({0, 1, 2}) triples, in jsonl format.
(Somewhat of a duplication of some of the ir_dataset also available on HF.)
These have been processed using ctproc, and in this state can be used by various tokenizers for fine-tuning (see ctmatch for examples).
These 2 datasets contain no patient identifying information are openly available in raw forms:
TREC: http://www.trec-cds.org/2021.html CSIRO: https://data.csiro.au/collection/csiro:17152see repo for more information : https://github.com/semajyllek/ctmatch