Indian Legal Dataset Analysis

Indian Institute of Technology, Delhi


Frequently Asked Questions
We have scraped a large database of court case summaries from here. For each district, a sample of cases, ranging from ~1000 in number for some districts, to more than 500,000 for other districts is created. Each of these samples, is analysed across several different metrics to identify anomalies. Additionally, we have created an aggregate performance metric, using KL divergence.
It is the time gap (in number of days) between the first hearing date, and the final decision date.
It is the ratio of the standard deviation of case duration time, and the mean of case duration time. The higher the value, the greater the variability in case duration time for a district (or for a state). This is a statistical parameter used to compare across distributions with different means.
For all the disposed cases among the sample of a district, the case duration time is calculated, and the percentage of cases taking more than X units of time (where X can be 3 months, 6 months, etc) is calculated. This was done for 4 different levels - 3 months, 6 months, 1 year and 3 years.
The percentage of cases in the sample for which the final decision date, and the first hearing date is the same.
The aggregate score of a district is calculated by calculating the KL divergence between the histogram of the case duration time of a district, and the case duration time histogram of the entire nation, treating them as probabilistic densities. If a similar process is done with states, the differences in distributions accounting for different divergence scores get normalised, and we get pretty much the same divergence scores for different states. This also makes sense, as we would expect behaviour at the state granularity to be pretty much the same across states.