AI Fails

AUI: Artificial Un-Intelligence #6

AI Fails

AI errors happen, but rarely do they result in a violent outcome. This week, a chess playing robot in Russia made international news when it broke a 7-year-old boy’s finger. 

Video at the Moscow Chess Open shows the robot, an industrial arm built to play chess on three boards simultaneously, grabbing the child’s finger for 15 seconds after the child tried to hurry its next move. NBC News gives a breakdown of the video that caught the incident: 

“It shows the machine reaching for and grabbing one of the boy’s chess pieces, and quickly discarding it from the board. Before the robot’s arm retracted, the boy attempted to make another move, pushing one of his rooks into the same place as the recently removed piece. The video then shows the robot’s mechanical claws descending back toward the board, this time grabbing the boy’s index finger instead of a chess piece.” 

A group of bystanders wrestled with the machine before ultimately freeing the child’s hand, but said his fracture was unavoidable. 

What went wrong

Sergey Lazarev, President of the Moscow Chess Federation put it quite plainly when reacting to the news, “The robot broke the child’s finger — this, of course, is bad.” So what went wrong? 

The robot has been used for over a decade, often to challenge chess grandmasters in exhibition matches. In this case, it was playing against three children at once. 

Unfortunately, there isn’t a lot of information about how the robot was programmed. But we can make some inferences as to how this happened after watching the video. 

In fact, we disagree with the version of events from the writers at NBC News. 

Inference #1: It appears that the robot is trying to place a piece on the board right where the child’s finger was. The robot likely couldn’t register that the piece was on the board because of the child’s finger being in the way, so it kept constant pressure trying to place the piece.  

Inference #2: The pressure was probably steady, and not that high, but clearly strong enough to fracture the boy’s finger and make it difficult to get it loose. 

Inference #3: The robot did not let go of the piece or try to move it to a different area of the board, so it’s likely that the robot’s programming broke down by that point. 

All of which leads us to…

Inference #4: There were no programming safeguards in place to prevent the incident from happening. 

See the video for yourself: 

Potential hotfixes 

Sergey Lazarev said of the incident, “we will communicate, try to sort it out and help in any way we can. And the robot operators, apparently, will have to think about strengthening protection so that such a situation does not happen again.” So what types of protections are needed? 

From a tech perspective, most of the answers here could be as simple as improving the robot’s software. Better discernment between chess pieces and non-chess-piece objects would be a start, or perhaps guardrails that only allow the robot to grab things it knows to be chess pieces. 

Humans play a part in safety here, too. From a human perspective, stricter safety rules around the robot, like only moving a piece when the arm isn’t active on the board, could be effective. 

Artie’s take 

Let’s ask our GPT-3 copywriter Artie for some additional guidance here. We asked, “how should we program a robot chess player to not injure humans?” 

Artie says, “ We should program a robot chess player to not injure humans by ensuring that its movements are slow and deliberate.” This, of course, is not bad advice, Artie. 

At Invisible, we take pride in our vision that humans and machines work best together. Of course, errors still occur - that’s why we have people QA all automated business processes. 

Interested in how we can leverage both humans and technology to help you meet business goals? Get in touch

Tune in next week for more tech fails. 

What are Your Top 3 moments at Invisible?

01|

02|

03|

Andrew Hull

AI errors happen, but rarely do they result in a violent outcome. This week, a chess playing robot in Russia made international news when it broke a 7-year-old boy’s finger. 

Video at the Moscow Chess Open shows the robot, an industrial arm built to play chess on three boards simultaneously, grabbing the child’s finger for 15 seconds after the child tried to hurry its next move. NBC News gives a breakdown of the video that caught the incident: 

“It shows the machine reaching for and grabbing one of the boy’s chess pieces, and quickly discarding it from the board. Before the robot’s arm retracted, the boy attempted to make another move, pushing one of his rooks into the same place as the recently removed piece. The video then shows the robot’s mechanical claws descending back toward the board, this time grabbing the boy’s index finger instead of a chess piece.” 

A group of bystanders wrestled with the machine before ultimately freeing the child’s hand, but said his fracture was unavoidable. 

What went wrong

Sergey Lazarev, President of the Moscow Chess Federation put it quite plainly when reacting to the news, “The robot broke the child’s finger — this, of course, is bad.” So what went wrong? 

The robot has been used for over a decade, often to challenge chess grandmasters in exhibition matches. In this case, it was playing against three children at once. 

Unfortunately, there isn’t a lot of information about how the robot was programmed. But we can make some inferences as to how this happened after watching the video. 

In fact, we disagree with the version of events from the writers at NBC News. 

Inference #1: It appears that the robot is trying to place a piece on the board right where the child’s finger was. The robot likely couldn’t register that the piece was on the board because of the child’s finger being in the way, so it kept constant pressure trying to place the piece.  

Inference #2: The pressure was probably steady, and not that high, but clearly strong enough to fracture the boy’s finger and make it difficult to get it loose. 

Inference #3: The robot did not let go of the piece or try to move it to a different area of the board, so it’s likely that the robot’s programming broke down by that point. 

All of which leads us to…

Inference #4: There were no programming safeguards in place to prevent the incident from happening. 

See the video for yourself: 

Potential hotfixes 

Sergey Lazarev said of the incident, “we will communicate, try to sort it out and help in any way we can. And the robot operators, apparently, will have to think about strengthening protection so that such a situation does not happen again.” So what types of protections are needed? 

From a tech perspective, most of the answers here could be as simple as improving the robot’s software. Better discernment between chess pieces and non-chess-piece objects would be a start, or perhaps guardrails that only allow the robot to grab things it knows to be chess pieces. 

Humans play a part in safety here, too. From a human perspective, stricter safety rules around the robot, like only moving a piece when the arm isn’t active on the board, could be effective. 

Artie’s take 

Let’s ask our GPT-3 copywriter Artie for some additional guidance here. We asked, “how should we program a robot chess player to not injure humans?” 

Artie says, “ We should program a robot chess player to not injure humans by ensuring that its movements are slow and deliberate.” This, of course, is not bad advice, Artie. 

At Invisible, we take pride in our vision that humans and machines work best together. Of course, errors still occur - that’s why we have people QA all automated business processes. 

Interested in how we can leverage both humans and technology to help you meet business goals? Get in touch

Tune in next week for more tech fails. 

Overview

LLM Task

Benchmark Dataset/Corpus

Sentiment Analysis

SST-1/SST-2

Natural Language Inference /  Recognizing Textual Entailment

Stanford Natural Language Inference Corpus (SNLI)

Named Entity Recognition

conll-2003

Question Answering

SQuAD

Machine Translation

WMT

Text Summarization

CNN/Daily Mail Dataset

Text Generation

WikiText

Paraphrasing

MRPC

Language Modelling

Penn Tree Bank

Bias Detection

StereoSet

LLM Task

Benchmark Dataset/Corpus

Common Metric

Dataset available at

Sentiment Analysis

SST-1/SST-2

Accuracy

https://huggingface
.co/datasets/sst2

Natural Language Inference /  Recognizing Textual Entailment

Stanford Natural Language Inference Corpus (SNLI)

Accuracy

https://nlp.stanford.edu
projects/snli/

Named Entity Recognition

conll-2003

F1 Score

https://huggingface.co/
datasets/conll2003

Question Answering

SQuAD

F1 Score, Exact Match, ROUGE

https://rajpurkar.github.i
o/SQuAD-explorer/

Machine Translation

WMT

BLEU, METEOR

https://machinetranslate
.org/wmt

Text Summarization

CNN/Daily Mail Dataset

ROUGE

https://www.tensorflow
.org/datasets/catalog/
cnn_dailymail

Text Generation

WikiText

BLEU, ROUGE

https://www.salesforce.
com/products/einstein/
ai-research/the-wikitext-dependency-language-modeling-dataset/

Paraphrasing

MRPC

ROUGE, BLEU

https://www.microsoft.
com/en-us/download/details.a
spx?id=52398

Language Modelling

Penn Tree Bank

Perplexity

https://zenodo.org/recor
d/3910021#.ZB3qdHbP
23A

Bias Detection

StereoSet

Bias Score, Differential Performance

https://huggingface.co/
datasets/stereoset

Table 1 - Example of some LLM tasks with common benchmark datasets and their respective metrics. Please note for many of these tasks, there are multiple benchmark datasets, some of which have not been mentioned here.

Metric Selection

Metric

Usage

Accuracy

Measures the proportion of correct predictions made by the model compared to the total number of predictions.

Precision

Measures the proportion of true positives out of all positive predictions.

Recall

Measures the proportion of true positives out of all actual positive instances.

F1 Score

Measures the harmonic mean of precision and recall.

Perplexity

Measures the model's uncertainty in predicting the next token (common in text generation tasks).

BLEU

Measures the similarity between machine-generated text and reference text.

ROUGE

Measures the similarity between machine-generated and human-generated text.

METEOR

May have higher computational complexity compared to BLEU or ROUGE.Requires linguistic resources for matching, which may not be available for all languages.

Pros

Cons

Simple interpretability. Provides an overall measure of model performance.

Sensitive to dataset imbalances, which can make it not informative. Does not take into account false positives and false negatives.

Useful when the cost of false positives is high. Measures the accuracy of positive predictions.

Does not take into account false negatives.Depends on other metrics to be informative (cannot be used alone).Sensitive to dataset imbalances.

Useful when the cost of false negatives is high.

Does not take into account false negatives.Depends on other metrics to be informative (cannot be used alone)and Sensitive to dataset imbalances.

Robust to imbalanced datasets.

Assumes equal importance of precision and recall.May not be suitable for multi-class classification problems with different class distributions.

Interpretable as it provides a single value for model performance.

May not directly correlate with human judgment.

Correlates well with human judgment.Easily interpretable for measuring translation quality.

Does not directly explain the performance on certain tasks (but correlates with human judgment).Lacks sensitivity to word order and semantic meaning.

Has multiple variants to capture different aspects of similarity.

May not capture semantic similarity beyond n-grams or LCS.Limited to measuring surface-level overlap.

Addresses some limitations of BLEU, such as recall and synonyms.

May have higher computational complexity compared to BLEU or ROUGE.Requires linguistic resources for matching, which may not be available for all languages.

Metric

Usage

Pros

Cons

Accuracy

Measures the proportion of correct predictions made by the model compared to the total number of predictions.

Simple interpretability. Provides an overall measure of model performance.

Sensitive to dataset imbalances, which can make it not informative. Does not take into account false positives and false negatives.

Precision

Measures the proportion of true positives out of all positive predictions.

Useful when the cost of false positives is high. Measures the accuracy of positive predictions.

Does not take into account false negatives.Depends on other metrics to be informative (cannot be used alone).Sensitive to dataset imbalances.

Recall

Measures the proportion of true positives out of all actual positive instances.

Useful when the cost of false negatives is high.

Does not take into account false negatives.Depends on other metrics to be informative (cannot be used alone)and Sensitive to dataset imbalances.

F1 Score

Measures the harmonic mean of precision and recall.

Robust to imbalanced datasets.

Assumes equal importance of precision and recall.May not be suitable for multi-class classification problems with different class distributions.

Perplexity

Measures the model's uncertainty in predicting the next token (common in text generation tasks).

Interpretable as it provides a single value for model performance.

May not directly correlate with human judgment.

BLEU

Measures the similarity between machine-generated text and reference text.

Correlates well with human judgment.Easily interpretable for measuring translation quality.

Does not directly explain the performance on certain tasks (but correlates with human judgment).Lacks sensitivity to word order and semantic meaning.

ROUGE

Measures the similarity between machine-generated and human-generated text.

Has multiple variants to capture different aspects of similarity.

May not capture semantic similarity beyond n-grams or LCS.Limited to measuring surface-level overlap.

METEOR

Measures the similarity between machine-generated translations and reference translations.

Addresses some limitations of BLEU, such as recall and synonyms.

May have higher computational complexity compared to BLEU or ROUGE.Requires linguistic resources for matching, which may not be available for all languages.

Table 2 - Common LLM metrics, their usage as a measurement tool, and their pros and cons. Note that for some of these metrics there exist different versions. For example, some of the versions of ROUGE include ROUGE-N, ROUGE-L, and ROUGE-W. For context, ROUGE-N measures the overlap of sequences of n-length-words between the text reference and the model-generated text. ROUGE-L measures the overlap between the longest common subsequence of tokens in the reference text and generated text, regardless of order. ROUGE-W on the other hand, assigns weights (relative importances) to longer common sub-sequences of common tokens (similar to ROUGE-L but with added weights). A combination of the most relevant variants of a metric, like ROUGE is selected for comprehensive evaluation.

Andrew Hull

Schedule a call to learn more about how Invisible might help your business grow while navigating uncertainty.

Schedule a Call
Request a Demo
Request a Demo
Request a Demo
Request a Demo
Request a Demo
Request a Demo
Request a Demo
Request a Demo
Request a Demo
Request a Demo
Request a Demo