Model and tokenizer files for the InCaseLawBERT model from the paper Pre-training Transformers on Indian Legal Text .
For building the pre-training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India. The court cases in our dataset range from 1950 to 2019, and belong to all legal domains, such as Civil, Criminal, Constitutional, and so on. In total, our dataset contains around 5.4 million Indian legal documents (all in the English language). The raw text corpus size is around 27 GB.
This model is initialized with the Legal-BERT model from the paper When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings . In our work, we refer to this model as CaseLawBERT, and our re-trained model as InCaseLawBERT. We further train this model on our data for 300K steps on the Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks.
This model uses the same tokenizer as CaseLawBERT . This model has the same configuration as the bert-base-uncased model : 12 hidden layers, 768 hidden dimensionality, 12 attention heads, ~110M parameters.
Using the model to get embeddings/representations for a piece of text
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("law-ai/InCaseLawBERT") text = "Replace this string with yours" encoded_input = tokenizer(text, return_tensors="pt") model = AutoModel.from_pretrained("law-ai/InCaseLawBERT") output = model(**encoded_input) last_hidden_state = output.last_hidden_state
We have fine-tuned all pre-trained models on 3 legal tasks with Indian datasets:
InCaseLawBERT performs close to CaseLawBERT across the three tasks, but not as good as InLegalBERT . For details, see our paper .
@inproceedings{paul-2022-pretraining, url = {https://arxiv.org/abs/2209.06049}, author = {Paul, Shounak and Mandal, Arpan and Goyal, Pawan and Ghosh, Saptarshi}, title = {Pre-trained Language Models for the Legal Domain: A Case Study on Indian Law}, booktitle = {Proceedings of 19th International Conference on Artificial Intelligence and Law - ICAIL 2023} year = {2023}, }
We are a group of researchers from the Department of Computer Science and Technology, Indian Insitute of Technology, Kharagpur. Our research interests are primarily ML and NLP applications for the legal domain, with a special focus on the challenges and oppurtunites for the Indian legal scenario. We have, and are currently working on several legal tasks such as: