IE 11 Not Supported

For optimal browsing, we recommend Chrome, Firefox or Safari browsers.

Criminal Justice Algorithm Predicts Risk of Biased Sentencing

Researchers created an algorithm that predicts risks of biased, overly punitive sentencing. The tool performs with similar accuracy — and similar limits — to risk assessment algorithms already used to influence pretrial and parole decisions, authors say.

Justice statue with code on a monitor in the background.
A new algorithm aims to assess the likelihood of defendants being treated unfairly in court.

The tool considers details that ought to be immaterial to the ruling — such as the judge’s and defendant’s gender and race — and then predicts how likely the judge is to award an unusually long sentence. This can suggest when socio-demographic details may be swaying judgements, resulting in especially punitive treatments.

Members of the American Civil Liberties Union (ACLU), Carnegie Mellon University (CMU), Idaho Justice Project and University of Pennsylvania created the algorithm for U.S. district court cases. They presented it in a report during the June Association for Computing Machinery Conference on Fairness, Accountability and Transparency (ACM FAccT).

“The risk assessment instrument we develop in this paper aims to predict disproportionately harsh sentences prior to sentencing with the goal of hopefully avoiding these disproportionate sentences and reducing the disparities in the system going forward,” the authors state in their report.

According to the authors, theirs are the first criminal justice algorithms that take a defendant’s perspective.

“To date, there exists no risk assessment instrument that considers the risk the system poses to the individual,” they write in their report.

Other algorithms instead focus on the risks that individuals who’ve been accused or incarcerated will behave in undesirable ways. Some algorithms, for example, are intended to assess the likelihood that arrestees will flee or be rearrested if released on bail before their court dates.


Before the ACLU, CMU, Idaho Justice Project and UPenn team could develop an algorithm predicting unusually punitive sentencing, they had to determine what usual sentencing looked like. To achieve this, they first created an algorithm that estimates the length of sentence a judge is likely to give based on relevant case details, like the kind of offense and the accused’s criminal history.

ACLU Chief Data Scientist and report co-author Aaron Horowitz told Government Technology that this earlier algorithm might help defense attorneys get additional perspective on their cases by seeing how comparable ones have been sentenced.

“That’s a pretty hard task for public defenders to do right now,” Horowitz said, pointing to the challenges of navigating available data and determining which cases are “similarly situated.”

The report also suggests that potentially wronged defendants could use the second algorithm — the one assessing the likelihood that bias played a role — to argue for reducing sentences that may be unfair. But Horowitz said there may be a ways to go before the tool can be put into play.

“We’re not so sure that this algorithm will or should be used,” Horowitz said.

There are several reasons for that uncertainty, including that judges may be resistant to being told they’re likely biased, he said. The project is also in an early stage, with the algorithm still a prototype.

Plus, “we’re critics of a lot of these algorithms,” Horowitz said.

That leads to another goal of the study: to prompt those who operate in the criminal justice space to think critically about how they’re using algorithms and about the limits and assumptions baked in to those tools.

The team’s algorithm for predicting sentencing bias has various accuracy limits, but other risk assessment tools already used in the criminal justice system bear similar impediments, report authors wrote. Arguments against their algorithm on these bases, then, would be arguments against those other tools.

“Our instrument performs comparably to other risk assessment instruments used in the criminal justice setting, and the predictive accuracy it achieves is considered ‘good’ by the standards of the field,” the report states.


The team drew on roughly 57,000 federal district court sentences from 2016-2017 and created an algorithmic model to identify sentences that were “especially long.” The model considers the details of a case that judges should pay attention to, like mandatory minimum sentencing requirements and the nature of the offense, to see what rulings are normal.

“We compared that predicted sentence length for similarly situated people to the actual sentence length that someone received,” explained report co-author Mikaela Meyer, a CMU doctoral student and a National Science Foundation (NSF) Graduate Research Fellow, during a conversation with GovTech.

If a defendant received a sentence longer than those issued in 90 percent of the other cases with “identical legally relevant factors,” the team flagged the ruling as “especially long.” Six percent of the examined sentences fell into this category.

The team then created a second model that considers other information that should have no impact on the rulings, such as the time of day at which the case was heard and judge and defendants’ races and genders. Other “legally irrelevant” details could include the political party of the president who appointed the judge and the defendants’ education levels and citizenship status. The algorithm predicts how likely a defendant is to receive an unusually long sentence, given these legally immaterial details about the case.


Subjectivity weaves its way into any algorithm, as developers make choices around what data to use and how to approximate measurements for what may be intangible ideas. Report authors say their algorithm has limits — but so do other ones already in use.

“Interrogating the limitations and values choices inherent to the construction of our own model, we have highlighted many parallel issues with traditional risk assessment instruments,” they write. “To the extent that one may reasonably believe that these limitations make our model unsuitable for use to estimate the risk to defendants, we would hope that such objections would be equally applied to the question of the suitability of similar such models to estimate the risk posed by defendants.”

Report authors had to decide how to select and clean up data to create data sets for training the models, for example, and how to define what counts as an “especially lengthy” sentence. Like other predictive algorithms, the team’s tool drew on historic data, which could limit how accurately they can reflect today’s landscape, and Meyer noted that they only looked at a sample — not all — of federal sentencing during that period.

Other tools are similarly constrained. Meyer said pretrial risk assessment tools often draw on limited data, because they are informed only by information about whether people released turned up for their court dates and do not reflect information about people who were held but wouldn’t have fled.

“You’re only observing outcomes for people who are released pretrial; you don’t get to observe outcomes for people who are not released pretrial,” Meyers said. “[So] those people that you are observing outcomes for are not necessarily representative of all people who have pretrial hearings.”

These algorithms make subjective judgments, too. For example, developers choose what counts as “failure to appear,” often defining this in a way that lumps together people who willfully flee with those who faced logistical hurdles, such as an inability to get transportation or time off work, Meyer said.

Predictive models are often rated by a measure known as area under the receiver operating characteristics curve — or AUC. Low-AUC models are closer to random guessing, while those with high AUCs are regarded as performing well.

Report authors say their algorithm has an AUC of 0.64, and AUCs within the range of 0.64-0.7 are commonly regarded as “good.” COMPAS — a tool used to predict an individual’s chance of recidivating — and Public Safety Assessment (PSA) — a tool for predicting arrestees’ likelihood of missing their hearings — also perform in the “good” range, per the report.


With the algorithms now published, Horowitz said he’s starting conversations with public defense workers to see if these models can help them or if a different kind of support would be more useful.

Meyer said the team also wants to create experiments examining how putting the bias prediction algorithm into play might affect sentencing lengths and disparities over time. The idea is to assume that some judges would be swayed by algorithms to correct otherwise overly long sentences, and then to assess that impact.
Jule Pattison-Gordon is a senior staff writer for Government Technology. She previously wrote for PYMNTS and The Bay State Banner, and holds a B.A. in creative writing from Carnegie Mellon. She’s based outside Boston.