中文

The model has been trained to predict for English sentences, whether they are formal or informal.

Base model: roberta-base

Datasets: GYAFC from Rao and Tetreault, 2018 and online formality corpus from Pavlick and Tetreault, 2016 .

Data augmentation: changing texts to upper or lower case; removing all punctuation, adding dot at the end of a sentence. It was applied because otherwise the model is over-reliant on punctuation and capitalization and does not pay enough attention to other features.

Loss: binary classification (on GYAFC), in-batch ranking (on PT data).

Performance metrics on the test data:

dataset ROC AUC precision recall fscore accuracy Spearman
GYAFC 0.9779 0.90 0.91 0.90 0.9087 0.8233
GYAFC normalized (lowercase + remove punct.) 0.9234 0.85 0.81 0.82 0.8218 0.7294
P&T subset Spearman R
news 0.4003
answers 0.7500
blog 0.7334
email 0.7606