Machine learning image classifiers use context clues to help understand the contents of a room, for example, if they manage to identify a dining-room table with a high degree of confidence, that can help resolve ambiguity about other objects nearby, identifying them as chairs.
The downside of this powerful approach is that it means machine learning classifiers can be confounded by confusing, out-of-context elements in a scene, as is demonstrated in The Elephant in the Room, a paper from a trio of Toronto-based computer science academics.
The authors show that computer vision systems that are able to confidently identify a large number of items in a living-room scene (a man, a chair, a TV, a sofa, etc) become fatally confused when they add an elephant to the room. The presence of the unexpected item throws the classifiers into dire confusion: not only do they struggle to identify the elephant, they also struggle with everything else in the scene, including items they were able to confidently identify when the elephant was absent.
It's a new wrinkle on the idea of adversarial examples, those minor, often human-imperceptible changes to inputs that can completely confuse machine-learning systems.
Contextual Reasoning: It is not common for current object detectors to explicitly take into account context on a semantic level, meaning that interplay between object categories and their relative spatial layout (or possibly additional) relations) are encoded in the reasoning process of the network. Though many methods claim to incorporate contextual reasoning, this is done more in a feature-wise level, meaning that global image information is encoded somehow in each decision. This is in contrast to older works, in which explicit contextual reasoning was quite popular (see [3] for mention of many such works). Still, it is apparent that some implicit form of contextual reasoning does seem to take place. One such example is a person detected near the keyboard (Figure 6, last column, last row). Some of the created images contain pairs of objects that may never appear together in the same image in the training set, or otherwise give rise to scenes with unlikely configurations. For example, non co-occurring categories, such as elephants and books, or unlikely spatial / functional relations such as a large person (in terms of image area) above a small bus. Such scenes could cause misinterpretation due to contextual reasoning, whether it is learned explicitly or not.
The Elephant in the Room [Amir Rosenfeld, Richard Zemel and John K. Tsotsos/Arxiv]
Machine Learning Confronts the Elephant in the Room [Kevin Hartnett/Quanta]