Legal Longformer (base)
This is a derivative model based on the
LexLM (base)
RoBERTa model.
All model parameters where cloned from the original model, while the positional embeddings were extended by cloning the original embeddings multiple times following
Beltagy et al. (2020)
using a python script similar to this one (
https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb
).
Model description
LexLM (Base/Large) are our newly released RoBERTa models. We follow a series of best-practices in language model development:
-
We warm-start (initialize) our models from the original RoBERTa checkpoints (base or large) of Liu et al. (2019).
-
We train a new tokenizer of 50k BPEs, but we reuse the original embeddings for all lexically overlapping tokens (Pfeiffer et al., 2021).
-
We continue pre-training our models on the diverse LeXFiles corpus for additional 1M steps with batches of 512 samples, and a 20/30% masking rate (Wettig et al., 2022), for base/large models, respectively.
-
We use a sentence sampler with exponential smoothing of the sub-corpora sampling rate following Conneau et al. (2019) since there is a disparate proportion of tokens across sub-corpora and we aim to preserve per-corpus capacity (avoid overfitting).
-
We consider mixed cased models, similar to all recently developed large PLMs.
Citation
Ilias Chalkidis*, Nicolas Garneau*, Catalina E.C. Goanta, Daniel Martin Katz, and Anders Søgaard.
LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development.
2022. In the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada.
@inproceedings{chalkidis-garneau-etal-2023-lexlms,
title = {{LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development}},
author = "Chalkidis*, Ilias and
Garneau*, Nicolas and
Goanta, Catalina and
Katz, Daniel Martin and
Søgaard, Anders",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics",
month = july,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2305.07507",
}