VICTOR: a Dataset for Brazilian Legal Documents Classification

This paper describes VICTOR, a novel dataset built from Brazil{'}s Supreme Court digitalized legal documents, composed of more than 45 thousand appeals, which includes roughly 692 thousand documents{---}about 4.6 million pages. The dataset contains labeled text data and supports two types of tasks: document type classification; and theme assignment, a multilabel problem. We present baseline results using bag-of-words models, convolutional neural networks, recurrent neural networks and boosting algorithms. We also experiment using linear-chain Conditional Random Fields to leverage the sequential nature of the lawsuits, which we find to lead to improvements on document type classification. Finally we compare a theme classification approach where we use domain knowledge to filter out the less informative document pages to the default one where we use all pages. Contrary to the Court experts{'} expectations, we find that using all available data is the better method. We make the dataset available in three versions of different sizes and contents to encourage explorations of better models and techniques.

PDF Abstract LREC 2020 PDF LREC 2020 Abstract

Datasets


  Add Datasets introduced or used in this paper
Task Dataset Model Metric Name Metric Value Global Rank Benchmark
Multi-Label Text Classification BVICTOR SVM Weighted F1 0.8235 # 2
Average F1 0.7761 # 2
Multi-Label Text Classification BVICTOR XGBoost Weighted F1 0.8957 # 1
Average F1 0.8843 # 1
Multi-Label Text Classification BVICTOR NB Weighted F1 0.6955 # 3
Average F1 0.6335 # 3
Multi-Label Text Classification MVICTOR (theme) NB Weighted F1 0.6062 # 3
Average F1 0.3797 # 3
Multi-Label Text Classification MVICTOR (theme) SVM Weighted F1 0.8137 # 2
Average F1 0.6642 # 2
Multi-Label Text Classification MVICTOR (theme) XGBoost Weighted F1 0.9072 # 1
Average F1 0.8882 # 1
Text Classification MVICTOR (type) SVM Weighted F1 0.9288 # 4
Average F1 0.6792 # 4
Text Classification MVICTOR (type) NB Weighted F1 0.8477 # 5
Average F1 0.4772 # 5
Text Classification MVICTOR (type) CNN + CRF Weighted F1 0.9537 # 1
Average F1 0.7505 # 1
Text Classification MVICTOR (type) CNN Weighted F1 0.9464 # 2
Average F1 0.7061 # 3
Text Classification MVICTOR (type) BiLSTM Weighted F1 0.9433 # 3
Average F1 0.7092 # 2
Multi-Label Text Classification SVICTOR (theme) XGBoost Weighted F1 0.8634 # 1
Average F1 0.8887 # 1
Multi-Label Text Classification SVICTOR (theme) SVM Weighted F1 0.8231 # 2
Average F1 0.8246 # 2
Multi-Label Text Classification SVICTOR (theme) NB Weighted F1 0.4875 # 3
Average F1 0.5121 # 3
Text Classification SVICTOR (type) CNN Weighted F1 0.9472 # 2
Average F1 0.7584 # 3
Text Classification SVICTOR (type) CNN + CRF Weighted F1 0.9533 # 1
Average F1 0.7740 # 1
Text Classification SVICTOR (type) BiLSTM Weighted F1 0.9465 # 3
Average F1 0.7281 # 4
Text Classification SVICTOR (type) SVM Weighted F1 0.9425 # 4
Average F1 0.7632 # 2
Text Classification SVICTOR (type) NB Weighted F1 0.8893 # 5
Average F1 0.5979 # 5

Methods


No methods listed for this paper. Add relevant methods here