A team of data and social scientists at the University of Chicago shared news recently claiming that they had developed an algorithm that can predict crime a week in advance. The announcement is the latest development in what has seemingly turned into law enforcement agencies’ white whale - using AI to prevent crime before it happens.
The chase to better predict and prevent crime through technology isn’t new. In fact, there exists a long list of failed AI programs that were scrapped, most of which due to the tech perpetuating racial bias.
Can the latest AI development deliver predictive policing without bias to help save lives?
The new algorithm
The scientists at the University of Chicago say that their algorithm learns patterns of the timing and locations of both violent and property crimes from public data. Learning those patterns enables the AI to “predict future crimes one week in advance with about 90% accuracy,” the team reports.
The team differentiates this AI from previous, controversial predictive policing tech because of how crime location data is analyzed. This model divides cities into equal blocks of 1,000 feet across to avoid bias attributed to neighborhood borders, while past tech relied on boundaries with existing stereotypes
The model was tested with data from 8 cities including Philadelphia, San Francisco, and Chicago. Interestingly, the scientists warn that the tech shouldn’t be used to direct law enforcement, but rather be a tool in the policing toolbox to address criminal activity.
Biased tech
How did past predictive policing AI fare? Two years ago, LAPD scrapped the predictive policing project Pred-Pol after civil liberties advocates argued that it perpetuated biased policing toward people of color.
Nearby Santa Cruz even banned predictive policing not long after. Another tactic for law enforcement agencies is to use facial recognition tech to catch criminals - but in 2018, a trial of the AI in London performed very poorly due to biased inputs.
Is it the tech’s fault?
Researchers argue that police data in the US has historically been biased because law enforcement disproportionately arrests individuals in low-income areas predominantly housing people of color. That suggests that predictive policing AI leverages a flawed dataset, with biased inputs creating biased outputs.
With that in mind, it’s unclear whether development of predictive policing AI is worth advancing until the data used to teach it is improved. In fact, this appears to be a flaw that the University of Chicago algorithm doesn’t solve.
Human oversight
Biased inputs create one problem and automated tech creates another. Perhaps the historic failures in predictive policing stem from the belief that a perfect AI can exist - or that the tech that we’re capable of developing now are a panacea for crime prevention.
Writers at JSTOR present a symptom of the problem: there’s a gap of clarity and accountability when tech fails. The tools developed to support law enforcement can either be developed in-house or by third-party builders, and especially in the case of contracted work, accountability is hard to come by.
The power of humans + tech
In reality, all AI ought to have some level of human oversight. It’s a belief we preach at Invisible, where we use both humans and machines to get the most out of each other (we call it worksharing).
We combine a human workforce and automation to carry out business processes for our clients. Here’s an example of how we used that combo to help a company find rare leads.
Interested in how we can leverage both humans and technology to help you meet business goals? Get a custom demo.
Tune in next week for more tech fails.
01|
02|
03|
A team of data and social scientists at the University of Chicago shared news recently claiming that they had developed an algorithm that can predict crime a week in advance. The announcement is the latest development in what has seemingly turned into law enforcement agencies’ white whale - using AI to prevent crime before it happens.
The chase to better predict and prevent crime through technology isn’t new. In fact, there exists a long list of failed AI programs that were scrapped, most of which due to the tech perpetuating racial bias.
Can the latest AI development deliver predictive policing without bias to help save lives?
The new algorithm
The scientists at the University of Chicago say that their algorithm learns patterns of the timing and locations of both violent and property crimes from public data. Learning those patterns enables the AI to “predict future crimes one week in advance with about 90% accuracy,” the team reports.
The team differentiates this AI from previous, controversial predictive policing tech because of how crime location data is analyzed. This model divides cities into equal blocks of 1,000 feet across to avoid bias attributed to neighborhood borders, while past tech relied on boundaries with existing stereotypes
The model was tested with data from 8 cities including Philadelphia, San Francisco, and Chicago. Interestingly, the scientists warn that the tech shouldn’t be used to direct law enforcement, but rather be a tool in the policing toolbox to address criminal activity.
Biased tech
How did past predictive policing AI fare? Two years ago, LAPD scrapped the predictive policing project Pred-Pol after civil liberties advocates argued that it perpetuated biased policing toward people of color.
Nearby Santa Cruz even banned predictive policing not long after. Another tactic for law enforcement agencies is to use facial recognition tech to catch criminals - but in 2018, a trial of the AI in London performed very poorly due to biased inputs.
Is it the tech’s fault?
Researchers argue that police data in the US has historically been biased because law enforcement disproportionately arrests individuals in low-income areas predominantly housing people of color. That suggests that predictive policing AI leverages a flawed dataset, with biased inputs creating biased outputs.
With that in mind, it’s unclear whether development of predictive policing AI is worth advancing until the data used to teach it is improved. In fact, this appears to be a flaw that the University of Chicago algorithm doesn’t solve.
Human oversight
Biased inputs create one problem and automated tech creates another. Perhaps the historic failures in predictive policing stem from the belief that a perfect AI can exist - or that the tech that we’re capable of developing now are a panacea for crime prevention.
Writers at JSTOR present a symptom of the problem: there’s a gap of clarity and accountability when tech fails. The tools developed to support law enforcement can either be developed in-house or by third-party builders, and especially in the case of contracted work, accountability is hard to come by.
The power of humans + tech
In reality, all AI ought to have some level of human oversight. It’s a belief we preach at Invisible, where we use both humans and machines to get the most out of each other (we call it worksharing).
We combine a human workforce and automation to carry out business processes for our clients. Here’s an example of how we used that combo to help a company find rare leads.
Interested in how we can leverage both humans and technology to help you meet business goals? Get a custom demo.
Tune in next week for more tech fails.
LLM Task
Benchmark Dataset/Corpus
Common Metric
Dataset available at
Sentiment Analysis
SST-1/SST-2
Accuracy
https://huggingface
.co/datasets/sst2
Natural Language Inference / Recognizing Textual Entailment
Stanford Natural Language Inference Corpus (SNLI)
Accuracy
https://nlp.stanford.edu
projects/snli/
Named Entity Recognition
conll-2003
F1 Score
https://huggingface.co/
datasets/conll2003
Question Answering
SQuAD
F1 Score, Exact Match, ROUGE
https://rajpurkar.github.i
o/SQuAD-explorer/
Machine Translation
WMT
BLEU, METEOR
https://machinetranslate
.org/wmt
Text Summarization
CNN/Daily Mail Dataset
ROUGE
https://www.tensorflow
.org/datasets/catalog/
cnn_dailymail
Text Generation
WikiText
BLEU, ROUGE
Paraphrasing
MRPC
ROUGE, BLEU
https://www.microsoft.
com/en-us/download/details.a
spx?id=52398
Language Modelling
Penn Tree Bank
Perplexity
https://zenodo.org/recor
d/3910021#.ZB3qdHbP
23A
Bias Detection
StereoSet
Bias Score, Differential Performance
Table 1 - Example of some LLM tasks with common benchmark datasets and their respective metrics. Please note for many of these tasks, there are multiple benchmark datasets, some of which have not been mentioned here.
Metric
Usage
Pros
Cons
Accuracy
Measures the proportion of correct predictions made by the model compared to the total number of predictions.
Simple interpretability. Provides an overall measure of model performance.
Sensitive to dataset imbalances, which can make it not informative. Does not take into account false positives and false negatives.
Precision
Measures the proportion of true positives out of all positive predictions.
Useful when the cost of false positives is high. Measures the accuracy of positive predictions.
Does not take into account false negatives.Depends on other metrics to be informative (cannot be used alone).Sensitive to dataset imbalances.
Recall
Measures the proportion of true positives out of all actual positive instances.
Useful when the cost of false negatives is high.
Does not take into account false negatives.Depends on other metrics to be informative (cannot be used alone)and Sensitive to dataset imbalances.
F1 Score
Measures the harmonic mean of precision and recall.
Robust to imbalanced datasets.
Assumes equal importance of precision and recall.May not be suitable for multi-class classification problems with different class distributions.
Perplexity
Measures the model's uncertainty in predicting the next token (common in text generation tasks).
Interpretable as it provides a single value for model performance.
May not directly correlate with human judgment.
BLEU
Measures the similarity between machine-generated text and reference text.
Correlates well with human judgment.Easily interpretable for measuring translation quality.
Does not directly explain the performance on certain tasks (but correlates with human judgment).Lacks sensitivity to word order and semantic meaning.
ROUGE
Measures the similarity between machine-generated and human-generated text.
Has multiple variants to capture different aspects of similarity.
May not capture semantic similarity beyond n-grams or LCS.Limited to measuring surface-level overlap.
METEOR
Measures the similarity between machine-generated translations and reference translations.
Addresses some limitations of BLEU, such as recall and synonyms.
May have higher computational complexity compared to BLEU or ROUGE.Requires linguistic resources for matching, which may not be available for all languages.
Table 2 - Common LLM metrics, their usage as a measurement tool, and their pros and cons. Note that for some of these metrics there exist different versions. For example, some of the versions of ROUGE include ROUGE-N, ROUGE-L, and ROUGE-W. For context, ROUGE-N measures the overlap of sequences of n-length-words between the text reference and the model-generated text. ROUGE-L measures the overlap between the longest common subsequence of tokens in the reference text and generated text, regardless of order. ROUGE-W on the other hand, assigns weights (relative importances) to longer common sub-sequences of common tokens (similar to ROUGE-L but with added weights). A combination of the most relevant variants of a metric, like ROUGE is selected for comprehensive evaluation.