March 19, 2021 – Akshay Jagadeesh

Akshay Jagadeesh, Psychology PhD Student. Stanford University

Texture-like representations support object perception in dCNNs and the human brain. 

Humans identify objects by the presence of complex visual features arranged in particular configurations. How does the visual system represent objects? In one widely held account (“holistic object” hypothesis), the ventral temporal cortex (VTC) explicitly represents objects as a whole, such that it is sensitive not just to the features that make up an object but also to the particular configuration of those features. A contrasting viewpoint (“feature space” hypothesis) proposes that objects are represented by a basis set of distinct complex visual features, from which an experience-dependent readout gives rise to configural object perception. To resolve this debate, we introduce a novel approach for image synthesis that allows us to independently control both the complexity of visual features and the spatial configuration of those features, and we use this approach to compare the configural selectivity of human perception with that of representations in human VTC, macaque inferotemporal (IT) cortex, and deep convolutional neural network models trained to perform object categorization (dCNNs). We demonstrate that human observers are highly selective for the natural configuration of object images. In contrast, dCNNs, human VTC, and macaque IT cortex are all insensitive to object configuration, although they do contain complex feature representations. This configural invariance suggests that these representations are texture-like and provides evidence for the “feature space” account of visual systems. How then do these configurally invariant representations give rise to object perception? We show that a category-specific linear transformation of the VTC representational space can generate a configurally selective object representation that is better able to explain human behavior. Our results suggest that the human visual system lacks explicit object representations and instead should be thought of as representing a space of complex features, which might help to explain the robustness and flexibility of visual object perception.