AI errors happen, but rarely do they result in a violent outcome. This week, a chess playing robot in Russia made international news when it broke a 7-year-old boy’s finger.
Video at the Moscow Chess Open shows the robot, an industrial arm built to play chess on three boards simultaneously, grabbing the child’s finger for 15 seconds after the child tried to hurry its next move. NBC News gives a breakdown of the video that caught the incident:
“It shows the machine reaching for and grabbing one of the boy’s chess pieces, and quickly discarding it from the board. Before the robot’s arm retracted, the boy attempted to make another move, pushing one of his rooks into the same place as the recently removed piece. The video then shows the robot’s mechanical claws descending back toward the board, this time grabbing the boy’s index finger instead of a chess piece.”
A group of bystanders wrestled with the machine before ultimately freeing the child’s hand, but said his fracture was unavoidable.
What went wrong
Sergey Lazarev, President of the Moscow Chess Federation put it quite plainly when reacting to the news, “The robot broke the child’s finger — this, of course, is bad.” So what went wrong?
The robot has been used for over a decade, often to challenge chess grandmasters in exhibition matches. In this case, it was playing against three children at once.
Unfortunately, there isn’t a lot of information about how the robot was programmed. But we can make some inferences as to how this happened after watching the video.
In fact, we disagree with the version of events from the writers at NBC News.
Inference #1: It appears that the robot is trying to place a piece on the board right where the child’s finger was. The robot likely couldn’t register that the piece was on the board because of the child’s finger being in the way, so it kept constant pressure trying to place the piece.
Inference #2: The pressure was probably steady, and not that high, but clearly strong enough to fracture the boy’s finger and make it difficult to get it loose.
Inference #3: The robot did not let go of the piece or try to move it to a different area of the board, so it’s likely that the robot’s programming broke down by that point.
All of which leads us to…
Inference #4: There were no programming safeguards in place to prevent the incident from happening.
See the video for yourself:
Potential hotfixes
Sergey Lazarev said of the incident, “we will communicate, try to sort it out and help in any way we can. And the robot operators, apparently, will have to think about strengthening protection so that such a situation does not happen again.” So what types of protections are needed?
From a tech perspective, most of the answers here could be as simple as improving the robot’s software. Better discernment between chess pieces and non-chess-piece objects would be a start, or perhaps guardrails that only allow the robot to grab things it knows to be chess pieces.
Humans play a part in safety here, too. From a human perspective, stricter safety rules around the robot, like only moving a piece when the arm isn’t active on the board, could be effective.
Artie’s take
Let’s ask our GPT-3 copywriter Artie for some additional guidance here. We asked, “how should we program a robot chess player to not injure humans?”
Artie says, “ We should program a robot chess player to not injure humans by ensuring that its movements are slow and deliberate.” This, of course, is not bad advice, Artie.
At Invisible, we take pride in our vision that humans and machines work best together. Of course, errors still occur - that’s why we have people QA all automated business processes.
Interested in how we can leverage both humans and technology to help you meet business goals? Get in touch.
Tune in next week for more tech fails.
01|
02|
03|
AI errors happen, but rarely do they result in a violent outcome. This week, a chess playing robot in Russia made international news when it broke a 7-year-old boy’s finger.
Video at the Moscow Chess Open shows the robot, an industrial arm built to play chess on three boards simultaneously, grabbing the child’s finger for 15 seconds after the child tried to hurry its next move. NBC News gives a breakdown of the video that caught the incident:
“It shows the machine reaching for and grabbing one of the boy’s chess pieces, and quickly discarding it from the board. Before the robot’s arm retracted, the boy attempted to make another move, pushing one of his rooks into the same place as the recently removed piece. The video then shows the robot’s mechanical claws descending back toward the board, this time grabbing the boy’s index finger instead of a chess piece.”
A group of bystanders wrestled with the machine before ultimately freeing the child’s hand, but said his fracture was unavoidable.
What went wrong
Sergey Lazarev, President of the Moscow Chess Federation put it quite plainly when reacting to the news, “The robot broke the child’s finger — this, of course, is bad.” So what went wrong?
The robot has been used for over a decade, often to challenge chess grandmasters in exhibition matches. In this case, it was playing against three children at once.
Unfortunately, there isn’t a lot of information about how the robot was programmed. But we can make some inferences as to how this happened after watching the video.
In fact, we disagree with the version of events from the writers at NBC News.
Inference #1: It appears that the robot is trying to place a piece on the board right where the child’s finger was. The robot likely couldn’t register that the piece was on the board because of the child’s finger being in the way, so it kept constant pressure trying to place the piece.
Inference #2: The pressure was probably steady, and not that high, but clearly strong enough to fracture the boy’s finger and make it difficult to get it loose.
Inference #3: The robot did not let go of the piece or try to move it to a different area of the board, so it’s likely that the robot’s programming broke down by that point.
All of which leads us to…
Inference #4: There were no programming safeguards in place to prevent the incident from happening.
See the video for yourself:
Potential hotfixes
Sergey Lazarev said of the incident, “we will communicate, try to sort it out and help in any way we can. And the robot operators, apparently, will have to think about strengthening protection so that such a situation does not happen again.” So what types of protections are needed?
From a tech perspective, most of the answers here could be as simple as improving the robot’s software. Better discernment between chess pieces and non-chess-piece objects would be a start, or perhaps guardrails that only allow the robot to grab things it knows to be chess pieces.
Humans play a part in safety here, too. From a human perspective, stricter safety rules around the robot, like only moving a piece when the arm isn’t active on the board, could be effective.
Artie’s take
Let’s ask our GPT-3 copywriter Artie for some additional guidance here. We asked, “how should we program a robot chess player to not injure humans?”
Artie says, “ We should program a robot chess player to not injure humans by ensuring that its movements are slow and deliberate.” This, of course, is not bad advice, Artie.
At Invisible, we take pride in our vision that humans and machines work best together. Of course, errors still occur - that’s why we have people QA all automated business processes.
Interested in how we can leverage both humans and technology to help you meet business goals? Get in touch.
Tune in next week for more tech fails.
LLM Task
Benchmark Dataset/Corpus
Common Metric
Dataset available at
Sentiment Analysis
SST-1/SST-2
Accuracy
https://huggingface
.co/datasets/sst2
Natural Language Inference / Recognizing Textual Entailment
Stanford Natural Language Inference Corpus (SNLI)
Accuracy
https://nlp.stanford.edu
projects/snli/
Named Entity Recognition
conll-2003
F1 Score
https://huggingface.co/
datasets/conll2003
Question Answering
SQuAD
F1 Score, Exact Match, ROUGE
https://rajpurkar.github.i
o/SQuAD-explorer/
Machine Translation
WMT
BLEU, METEOR
https://machinetranslate
.org/wmt
Text Summarization
CNN/Daily Mail Dataset
ROUGE
https://www.tensorflow
.org/datasets/catalog/
cnn_dailymail
Text Generation
WikiText
BLEU, ROUGE
Paraphrasing
MRPC
ROUGE, BLEU
https://www.microsoft.
com/en-us/download/details.a
spx?id=52398
Language Modelling
Penn Tree Bank
Perplexity
https://zenodo.org/recor
d/3910021#.ZB3qdHbP
23A
Bias Detection
StereoSet
Bias Score, Differential Performance
Table 1 - Example of some LLM tasks with common benchmark datasets and their respective metrics. Please note for many of these tasks, there are multiple benchmark datasets, some of which have not been mentioned here.
Metric
Usage
Pros
Cons
Accuracy
Measures the proportion of correct predictions made by the model compared to the total number of predictions.
Simple interpretability. Provides an overall measure of model performance.
Sensitive to dataset imbalances, which can make it not informative. Does not take into account false positives and false negatives.
Precision
Measures the proportion of true positives out of all positive predictions.
Useful when the cost of false positives is high. Measures the accuracy of positive predictions.
Does not take into account false negatives.Depends on other metrics to be informative (cannot be used alone).Sensitive to dataset imbalances.
Recall
Measures the proportion of true positives out of all actual positive instances.
Useful when the cost of false negatives is high.
Does not take into account false negatives.Depends on other metrics to be informative (cannot be used alone)and Sensitive to dataset imbalances.
F1 Score
Measures the harmonic mean of precision and recall.
Robust to imbalanced datasets.
Assumes equal importance of precision and recall.May not be suitable for multi-class classification problems with different class distributions.
Perplexity
Measures the model's uncertainty in predicting the next token (common in text generation tasks).
Interpretable as it provides a single value for model performance.
May not directly correlate with human judgment.
BLEU
Measures the similarity between machine-generated text and reference text.
Correlates well with human judgment.Easily interpretable for measuring translation quality.
Does not directly explain the performance on certain tasks (but correlates with human judgment).Lacks sensitivity to word order and semantic meaning.
ROUGE
Measures the similarity between machine-generated and human-generated text.
Has multiple variants to capture different aspects of similarity.
May not capture semantic similarity beyond n-grams or LCS.Limited to measuring surface-level overlap.
METEOR
Measures the similarity between machine-generated translations and reference translations.
Addresses some limitations of BLEU, such as recall and synonyms.
May have higher computational complexity compared to BLEU or ROUGE.Requires linguistic resources for matching, which may not be available for all languages.
Table 2 - Common LLM metrics, their usage as a measurement tool, and their pros and cons. Note that for some of these metrics there exist different versions. For example, some of the versions of ROUGE include ROUGE-N, ROUGE-L, and ROUGE-W. For context, ROUGE-N measures the overlap of sequences of n-length-words between the text reference and the model-generated text. ROUGE-L measures the overlap between the longest common subsequence of tokens in the reference text and generated text, regardless of order. ROUGE-W on the other hand, assigns weights (relative importances) to longer common sub-sequences of common tokens (similar to ROUGE-L but with added weights). A combination of the most relevant variants of a metric, like ROUGE is selected for comprehensive evaluation.