Abstract
In decision-making settings such as medical diagnosis, underwriting, or sentencing in a court of
law, decisions are often influenced by multiple factors like the individual’s background, experience, and
personal biases. Variability in decisions across individuals is thus inevitable, and so are disparities in
different systems. Every system has its own way of addressing prevailing disparities—be it through
noise audits, standardized guidelines, or other approaches. In the medical field, for instance, prior
studies have documented significant inter-observer variations in the interpretation of clinical images
such as MRIs, X-rays, etc., leading to inconsistencies in diagnoses and treatments. To address the
issue, the medical domain is witnessing active development of AI tools intended to support clinical data
analysis and reduce disparities in healthcare outcomes. In this thesis, we explore a data cube-based
methodology to explore the issue of disparities in the legal domain.
In the legal domain, decisions related to parole or bail grants, child custody, or sentence imposition are often left to the judges’ discretion. Specifically for sentence imposition, guidelines in many
countries around the world allow for subjectivity. For example, in India, while sentencing guidelines
prescribe minimum and maximum punishments for different offences, the weights assigned to various aggravating and mitigating circumstances are left to the judge’s discretion. This flexibility, though
intended to accommodate case-specific nuances, increases variance, leading to inconsistencies in trial
outcomes, sentence lengths, and penalties across courts and judges. The literature widely acknowledges that anomalies and disparities exist in sentencing and other legal decisions, often stemming from
personal beliefs, biases, and contextual factors. Over the years, several efforts have also been made
to analyse and assess sentencing anomalies/disparities in India and other parts of the world, particularly with respect to individual factors such as gender, race, socioeconomic background, etc., through
surveys, case studies, and machine learning techniques.
Notably, the Online analytical processing (OLAP) methodology has been widely employed in literature to analyse multidimensional data in different domains. The concept of a data cube was proposed to
summarise and extract all subcubes from a table, enabling multi-dimensional analysis and derivation of
insights from diverse perspectives. It is commonly adopted to extract interesting trends and anomalies
from multidimensional data in domains like sales, marketing, etc. However, so far, no effort has been
made to extend the data cube-based framework to explore anomalies and disparities in the legal domain.
In this thesis, we leverage the OLAP framework and propose a data cube-based approach to explore
potential trends and anomalies in judicial decisions, particularly sentences. A major bottleneck in this domain has been the lack of structured datasets. To address this, we employed a large language model
(LLM) to curate a structured dataset from unstructured data extracted manually from Indian criminal
case judgments. We designed a conceptual schema by identifying relevant attributes, hierarchies, and
defining appropriate aggregate measures. We used this schema to build a data cube on the curated
dataset and facilitate anomaly detection. This approach enabled us to uncover potential anomalies in
court sentences, particularly in terms of quantum and monetary penalties for similar offenses across
Indian states. For instance, we observed that Kerala imposed relatively higher monetary penalties in
cases of rape and murder compared to other states. Several additional trends were also identified.
Our experiments demonstrate that the proposed framework has the potential to identify the anomalies
that could help in further understanding the causal factors, such as bias, that contribute to such anomalies
and disparities.
We also provide a structured dataset annotated by domain experts, which treasures sentencing-related
information along with the verdict rationale from judgements of around 10,000 criminal cases adjudicated in the Trial court, High Court, and Supreme Court of India during 2000-2010. We make this
dataset public to encourage further research