Poster
in
Workshop: Second Workshop on Efficient Natural Language and Speech Processing (ENLSP-II)
BudgetLongformer: Can we Cheaply Pretrain a SOTA Legal Language Model From Scratch?
Joel Niklaus · Daniele Giofrè
Keywords: [ Efficient Graphs for NLP ] [ ENLSP-Main ]
Pretrained transformer models have achieved state-of-the-art results in many tasksand benchmarks recently. Many state-of-the-art (SOTA) Language Models (LM s),however, do not scale well above the threshold of 512 input tokens. In specializeddomains though (such as legal, scientific or biomedical), models often need toprocess very long text (sometimes well above 10000 tokens). Even though manyefficient transformers have been proposed (such as Longformer, BigBird or FNet),so far, only very few such efficient models are available for specialized domains.Additionally, since the pretraining process is extremely costly in general – buteven more so as the sequence length increases – it is often only in reach of largeresearch labs. One way of making pretraining cheaper is the Replaced TokenDetection ( RTD ) task, by providing more signal during training, since the losscan be computed over all tokens. In this work, we train Longformer models withthe efficient RTD task on legal data to showcase that pretraining efficient LMs ispossible using much less compute. We evaluate the trained models on challengingsummarization tasks requiring the model to summarize long texts to show to whatextent the models can achieve good performance on downstream tasks. We findthat both the small and base models outperform their baselines on the in-domainBillSum and out-of-domain PubMed tasks in their respective parameter range. Wepublish our code and models for research purposes.