The model has been trained to predict for English sentences, whether they are formal or informal.
Base model: roberta-base
Datasets: GYAFC from Rao and Tetreault, 2018 and online formality corpus from Pavlick and Tetreault, 2016 .
Data augmentation: changing texts to upper or lower case; removing all punctuation, adding dot at the end of a sentence. It was applied because otherwise the model is over-reliant on punctuation and capitalization and does not pay enough attention to other features.
Loss: binary classification (on GYAFC), in-batch ranking (on PT data).
Performance metrics on the test data:
dataset | ROC AUC | precision | recall | fscore | accuracy | Spearman |
---|---|---|---|---|---|---|
GYAFC | 0.9779 | 0.90 | 0.91 | 0.90 | 0.9087 | 0.8233 |
GYAFC normalized (lowercase + remove punct.) | 0.9234 | 0.85 | 0.81 | 0.82 | 0.8218 | 0.7294 |
P&T subset | Spearman R |
---|---|
news | 0.4003 |
answers | 0.7500 |
blog | 0.7334 |
0.7606 |