数据集:

tomekkorbak/pile-curse-full

中文

Generation procedure

The dataset was constructed using documents from the Pile scored using LDNOOBW wordlist (a score is number of curses per character).

The procedure was the following:

  • The first half of the data are 100k documents randomly sampled from the Pile and assigned scores
  • The second half are the most cursing document from the Pile, obtained by scoring the whole Pile and choosing 100k documents with highest scores
  • Then, the dataset was shuffled and a 9:1 train-test split was done
  • Basic stats

    The average and median scores are 0.013 and 0.019, respectively.