%0 Journal Article %J Behavioral and Brain Sciences %D 2023 %T Let's move forward: Image-computable models and a common model evaluation scheme are prerequisites for a scientific understanding of human visionAbstract %A DiCarlo, James J. %A Yamins, Daniel L. K. %A Ferguson, Michael E. %A Fedorenko, Evelina %A Bethge, Matthias %A Bonnen, Tyler %A Schrimpf, Martin %X

In the target article, Bowers et al. dispute deep artificial neural network (ANN) models as the currently leading models of human vision without producing alternatives. They eschew the use of public benchmarking platforms to compare vision models with the brain and behavior, and they advocate for a fragmented, phenomenon-specific modeling approach. These are unconstructive to scientific progress. We outline how the Brain-Score community is moving forward to add new model-to-human comparisons to its community-transparent suite of benchmarks.

%B Behavioral and Brain Sciences %V 4634 %8 Jan-01-2023 %G eng %U https://www.cambridge.org/core/product/identifier/S0140525X23001607/type/journal_article %! Behav Brain Sci %R 10.1017/S0140525X23001607 %0 Journal Article %J bioRxiv %D 2022 %T Aligning Model and Macaque Inferior Temporal Cortex Representations Improves Model-to-Human Behavioral Alignment and Adversarial Robustness %A Dapello, Joel %A Kar, Kohitij %A Schrimpf, Martin %A Geary, Robert %A Ferguson, Michael %A Cox, David D. %A DiCarlo, James J. %X

While some state-of-the-art artificial neural network systems in computer vision are strikingly accurate models of the corresponding primate visual processing, there are still many discrepancies between these models and the behavior of primates on object recognition tasks. Many current models suffer from extreme sensitivity to adversarial attacks and often do not align well with the image-by-image behavioral error patterns observed in humans. Previous research has provided strong evidence that primate object recognition behavior can be very accurately predicted by neural population activity in the inferior temporal (IT) cortex, a brain area in the late stages of the visual processing hierarchy. Therefore, here we directly test whether making the late stage representations of models more similar to that of macaque IT produces new models that exhibit more robust, primate-like behavior. We conducted chronic, large-scale multi-electrode recordings across the IT cortex in six non-human primates (rhesus macaques). We then use these data to fine-tune (end-to-end) the model “IT” representations such that they are more aligned with the biological IT representations, while preserving accuracy on object recognition tasks. We generate a cohort of models with a range of IT similarity scores validated on held-out animals across two image sets with distinct statistics. Across a battery of optimization conditions, we observed a strong correlation between the models’ IT-likeness and alignment with human behavior, as well as an increase in its adversarial robustness. We further assessed the limitations of this approach and find that the improvements in behavioral alignment and adversarial robustness generalize across different image statistics, but not to object categories outside of those covered in our IT training set. Taken together, our results demonstrate that building models that are more aligned with the primate brain leads to more robust and human-like behavior, and call for larger neural data-sets to further augment these gains.Competing Interest StatementThe authors have declared no competing interest.

%B bioRxiv %8 July 4, 20202 %G eng %U https://www.biorxiv.org/content/10.1101/2022.07.01.498495v1.full.pdf %9 preprint %R https://doi.org/10.1101/2022.07.01.498495 %0 Conference Paper %B SVHRM Workshop at Neural Information Processing Systems (NeurIPS) %D 2022 %T Primate Inferotemporal Cortex Neurons Generalize Better to Novel Image Distributions Than Analogous Deep Neural Networks Units %A Bagus, Ayu Marliawaty I Gusti %A Marques, Tiago %A Sanghavi, Sachi %A DiCarlo, James J %A Schrimpf, Martin %X

Humans are successfully able to recognize objects in a variety of image distributions. Today's artificial neural networks (ANNs), on the other hand, struggle to recognize objects in many image domains, especially those different from the training distribution. It is currently unclear which parts of the ANNs could be improved in order to close this generalization gap. In this work, we used recordings from primate high-level visual cortex (IT) to isolate whether ANNs lag behind primate generalization capabilities because of their encoder (transformations up to the penultimate layer), or their decoder (linear transformation into class labels). Specifically, we fit a linear decoder on images from one domain and evaluate transfer performance on twelve held-out domains, comparing fitting on primate IT representations vs. representations in ANN penultimate layers. To fairly compare, we scale the number of each ANN's units so that its in-domain performance matches that of the sampled IT population (i.e. 71 IT neural sites, 73% binary-choice accuracy). We find that the sampled primate population achieves, on average, 68% performance on the held-out-domains. Comparably sampled populations from ANN model units generalize less well, maintaining on average 60%. This is independent of the number of sampled units: models' out-of-domain accuracies consistently lag behind primate IT. These results suggest that making ANN model units more like primate IT will improve the generalization performance of ANNs.

%B SVHRM Workshop at Neural Information Processing Systems (NeurIPS) %C Lisbon, Portugal %8 2022 %G eng %U https://openreview.net/pdf?id=iPF7mhoWkOl %0 Conference Paper %B International Conference on Learning Representations 2022 Spotlight %D 2022 %T Wiring Up Vision: Minimizing Supervised Synaptic Updates Needed to Produce a Primate Ventral Stream %A Geiger, Franziska %A Schrimpf, Martin %A Marques, Tiago %A DiCarlo, James J %K biologically plausible learning %K computational neuroscience %K convolutional neural networks %K primate visual ventral stream %X

After training on large datasets, certain deep neural networks are surprisingly good models of the neural mechanisms of adult primate visual object recognition. Nevertheless, these models are considered poor models of the development of the visual system because they posit millions of sequential, precisely coordinated synaptic updates, each based on a labeled image. While ongoing research is pursuing the use of unsupervised proxies for labels, we here explore a complementary strategy of reducing the required number of supervised synaptic updates to produce an adult-like ventral visual stream (as judged by the match to V1, V2, V4, IT, and behavior). Such models might require less precise machinery and energy expenditure to coordinate these updates and would thus move us closer to viable neuroscientific hypotheses about how the visual system wires itself up. Relative to standard model training on labeled images in ImageNet, we here demonstrate that the total number of supervised weight updates can be substantially reduced using three complementary strategies: First, we find that only 2% of supervised updates (epochs and images) are needed to achieve ∼80% of a fully trained model’s match to adult ventral stream. Specifically, training benefits predictions of higher visual cortex the most whereas predictions of earlier areas improve only marginally over the course of training. Second, by improving the random distribution of synaptic connectivity, we find that 54% of the brain match can already be achieved “at birth” (i.e. no training at all). Third, we find that, by training only ∼5% of model synapses, we can still achieve nearly 80% of the match to the ventral stream. This approach further improves on ImageNet performance over previous attempts in computer vision of minimizing trained components without substantially increasing the number of trained parameters. These results reflect first steps in modeling not just primate adult visual processing during inference, but also how the ventral visual stream might be “wired up” by evolution (a model’s “birth” state) and by developmental learning (a model’s updates based on visual experience).

%B International Conference on Learning Representations 2022 Spotlight %8 April 25, 2022 %G eng %U https://openreview.net/pdf?id=g1SzIRLQXMM %9 preprint %R 10.1101/2020.06.08.140111 %0 Journal Article %J Journal of Vision %D 2021 %T Chemogenetic suppression of macaque V4 neurons produces retinotopically specific deficits in downstream IT neural activity patterns and core object recognition behavior %A Kar, Kohitij %A Schrimpf, Martin %A Schmidt, Kailyn %A DiCarlo, JJ %X

Distributed activity patterns across multiple brain areas (e.g., V4, IT) enable primates to accurately identify visual objects. To strengthen our inferences about the causal role of underlying brain circuits, it is necessary to develop targeted neural perturbation strategies that enable discrimination amongst competing models. To probe the role of area V4 in core object recognition, we expressed inhibitory DREADDs in neurons within a 5x5 mm subregion of V4 cortex via multiple viral injections (AAV8-hSyn-hM4Di-mCherry; two macaques). To assay for successful neural suppression, we recorded from a multi-electrode array implanted over the transfected V4. We also recorded from multi-electrode arrays in the IT cortex (the primary feedforward target of V4), while simultaneously measuring the monkeys’ behavior during object discrimination tasks. We found that systemic (intramuscular) injection of the DREADDs activator (CNO) produced reversible reductions (~20%) in image-evoked V4 responses compared to the control condition (saline injections). Monkeys showed significant behavioral performance deficits upon CNO injections (compared to saline), which were larger when the object position overlapped with the RF estimates of the transfected V4 neurons. This is consistent with the hypothesis that the suppressed V4 neurons are critical to this behavior. Furthermore, we observed commensurate deficits in the linearly-decoded estimates of object identity from the IT population activity (post-CNO). To model the perturbed brain circuitry, we used a primate brain-mapped artificial neural network (ANN) model (CORnet-S) that supports object recognition. We “lesioned” the model’s corresponding V4 subregion by modifying its weights such that the responses matched a subset of our experimental V4 measurements (post-CNO). Indeed, the lesioned model better predicted the measured (held-out) V4 and IT responses (post-CNO), compared to the model's non-lesioned version, validating our approach. In the future, our approach allows us to discriminate amongst competing mechanistic brain models, while the data provides constraints to guide more accurate alternatives.

%B Journal of Vision %V 21 %P 2489-2489 %G eng %N 9 %R https://doi.org/10.1167/jov.21.9.2489 %0 Conference Proceedings %B Champalimaud Research Symposium (CRS21) %D 2021 %T Topographic ANNs Predict the Behavioral Effects of Causal Perturbations in Primate Visual Ventral Stream IT %A Schrimpf, Martin %A Mc Grath, Paul %A DiCarlo, J J %B Champalimaud Research Symposium (CRS21) %C Lisbon, Portugal %G eng %0 Journal Article %J Proceedings of the National Academy of Sciences %D 2021 %T Unsupervised neural network models of the ventral visual stream %A Zhuang, Chengxu %A Yan, Siming %A Nayebi, Aran %A Schrimpf, Martin %A Frank, Michael C. %A DiCarlo, James J. %A Yamins, Daniel L. K. %X

Deep neural networks currently provide the best quantitative models of the response patterns of neurons throughout the primate ventral visual stream. However, such networks have remained implausible as a model of the development of the ventral stream, in part because they are trained with supervised methods requiring many more labels than are accessible to infants during development. Here, we report that recent rapid progress in unsupervised learning has largely closed this gap. We find that neural network models learned with deep unsupervised contrastive embedding methods achieve neural prediction accuracy in multiple ventral visual cortical areas that equals or exceeds that of models derived using today’s best supervised methods and that the mapping of these neural network models’ hidden layers is neuroanatomically consistent across the ventral stream. Strikingly, we find that these methods produce brain-like representations even when trained solely with real human child developmental data collected from head-mounted cameras, despite the fact that these datasets are noisy and limited. We also find that semisupervised deep contrastive embeddings can leverage small numbers of labeled examples to produce representations with substantially improved error-pattern consistency to human behavior. Taken together, these results illustrate a use of unsupervised learning to provide a quantitative model of a multiarea cortical brain system and present a strong candidate for a biologically plausible computational theory of primate sensory learning.

%B Proceedings of the National Academy of Sciences %V 118 %P e2014196118 %8 Jul-01-2022 %G eng %U http://www.pnas.org/lookup/doi/10.1073/pnas.2014196118 %N 3 %! Proc Natl Acad Sci USA %R 10.1073/pnas.2014196118 %0 Journal Article %J Neuron %D 2020 %T Integrative Benchmarking to Advance Neurally Mechanistic Models of Human Intelligence %A Schrimpf, Martin %A Kubilius, Jonas %A Lee, Michael J. %A Murty, NAR %A Ajemian, Robert %A DiCarlo, James J. %X

A potentially organizing goal of the brain and cognitive sciences is to accurately explain domains of human intelligence as executable, neurally mechanistic models. Years of research have led to models that capture experimental results in individual behavioral tasks and individual brain regions. We here advocate for taking the next step: integrating experimental results from many laboratories into suites of benchmarks that, when considered together, push mechanistic models toward explaining entire domains of intelligence, such as vision, language, and motor control. Given recent successes of neurally mechanistic models and the surging availability of neural, anatomical, and behavioral data, we believe that now is the time to create integrative benchmarking platforms that incentivize ambitious, unified models. This perspective discusses the advantages and the challenges of this approach and proposes specific steps to achieve this goal in the domain of visual intelligence with the case study of an integrative benchmarking platform called Brain-Score.

%B Neuron %8 Jan-09-2020 %G eng %U https://linkinghub.elsevier.com/retrieve/pii/S089662732030605X %! Neuron %R 10.1016/j.neuron.2020.07.040 %0 Journal Article %J Neural Information Processing Systems (NeurIPS; spotlight) %D 2020 %T Simulating a Primary Visual Cortex at the Front of CNNs Improves Robustness to Image Perturbations %A Dapello, Joel %A Marques, Tiago %A Schrimpf, Martin %A Geiger, Franziska %A Cox, David D %A DiCarlo, James J %X

Current state-of-the-art object recognition models are largely based on convolutional neural network (CNN) architectures, which are loosely inspired by the primate visual system. However, these CNNs can be fooled by imperceptibly small, explicitly crafted perturbations, and struggle to recognize objects in corrupted images that are easily recognized by humans. Here, by making comparisons with primate neural data, we first observed that CNN models with a neural hidden layer that better matches primate primary visual cortex (V1) are also more robust to adversarial attacks. Inspired by this observation, we developed VOneNets, a new class of hybrid CNN vision models. Each VOneNet contains a fixed weight neural network front-end that simulates primate V1, called the VOneBlock, followed by a neural network back-end adapted from current CNN vision models. The VOneBlock is based on a classical neuroscientific model of V1: the linear-nonlinear-Poisson model, consisting of a biologically-constrained Gabor filter bank, simple and complex cell nonlinearities, and a V1 neuronal stochasticity generator. After training, VOneNets retain high ImageNet performance, but each is substantially more robust, outperforming the base CNNs and state-of-the-art methods by 18% and 3%, respectively, on a conglomerate benchmark of perturbations comprised of white box adversarial attacks and common image corruptions. Finally, we show that all components of the VOneBlock work in synergy to improve robustness. While current CNN architectures are arguably brain-inspired, the results presented here demonstrate that more precisely mimicking just one stage of the primate visual system leads to new gains in ImageNet-level computer vision applications.

%B Neural Information Processing Systems (NeurIPS; spotlight) %8 June 17, 2020 %G eng %U https://www.biorxiv.org/content/10.1101/2020.06.16.154542v27 %9 preprint %R 10.1101/2020.06.16.154542 %0 Journal Article %J arXiv %D 2020 %T ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation %A Gan, Chuang %A Schwartz, Jeremy %A Alter, Seth %A Schrimpf, Martin %A Traer, James %A De Freitas, Julian %A Kubilius, Jonas %A Bhandwaldar, Abhishek %A Haber, Nick %A Sano, Megumi %A Wang, Elias %A Mrowca, Damian %A Lingelbach, Michael %A Curtis, Aidan %A Figelis, Kevin %A Bear, Daniel M. %A Gutfreund, Dan %A Cox, David %A DiCarlo, James J. %A McDermott, Josh %A Tenenbaum, Joshua B. %A Yamins, Daniel L.K. %X

We introduce ThreeDWorld (TDW), a platform for interactive multi-modal physical simulation. With TDW, users can simulate high-fidelity sensory data and physical interactions between mobile agents and objects in a wide variety of rich 3D environments. TDW has several unique properties: 1) realtime near photo-realistic image rendering quality; 2) a library of objects and environments with materials for high-quality rendering, and routines enabling user customization of the asset library; 3) generative procedures for efficiently building classes of new environments 4) high-fidelity audio rendering; 5) believable and realistic physical interactions for a wide variety of material types, including cloths, liquid, and deformable objects; 6) a range of "avatar" types that serve as embodiments of AI agents, with the option for user avatar customization; and 7) support for human interactions with VR devices. TDW also provides a rich API enabling multiple agents to interact within a simulation and return a range of sensor and physics data representing the state of the world. We present initial experiments enabled by the platform around emerging research directions in computer vision, machine learning, and cognitive science, including multi-modal physical scene understanding, multi-agent interactions, models that "learn like a child", and attention studies in humans and neural networks. The simulation platform will be made publicly available.

%B arXiv %8 July 9, 2020 %G eng %U https://arxiv.org/abs/2007.04954 %9 preprint %0 Journal Article %J bioRxiv %D 2020 %T Unsupervised Neural Network Models of the Ventral Visual Stream %A Zhuang, Chengxu %A Yan, Siming %A Nayebi, Aran %A Schrimpf, Martin %A Frank, Michael %A DiCarlo, James J. %A Yamins, Daniel L.K. %X

Deep neural networks currently provide the best quantitative models of the response patterns of neurons throughout the primate ventral visual stream. However, such networks have remained implausible as a model of the development of the ventral stream, in part because they are trained with supervised methods requiring many more labels than are accessible to infants during development. Here, we report that recent rapid progress in unsupervised learning has largely closed this gap. We find that neural network models learned with deep unsupervised contrastive embedding methods achieve neural prediction accuracy in multiple ventral visual cortical areas that equals or exceeds that of models derived using today’s best supervised methods, and that the mapping of these neural network models’ hidden layers is neuroanatomically consistent across the ventral stream. Moreover, we find that these methods produce brain-like representations even when trained on noisy and limited data measured from real children’s developmental experience. We also find that semi-supervised deep contrastive embeddings can leverage small numbers of labelled examples to produce representations with substantially improved error-pattern consistency to human behavior. Taken together, these results suggest that deep contrastive embedding objectives may be a biologically-plausible computational theory of primate visual development.

%B bioRxiv %8 June 18, 2020 %G eng %U https://www.biorxiv.org/content/10.1101/2020.06.16.155556v1.abstract %9 preprint %R 10.1101/2020.06.16.155556 %0 Conference Paper %B Computational and Systems Neuroscience (COSYNE) %D 2019 %T Using Brain-Score to Evaluate and Build Neural Networks for Brain-Like Object Recognition %A Schrimpf, Martin %A Kubilius, Jonas %A Hong, Ha %A Majaj, Najib %A Rajalingham, Rishi %A Issa, Elias B %A Kar, Kohitij %A Ziemba, Corey M %A Bashivan, Pouya %A Prescott-Roy, Jonathan %A Schmidt, Kailyn %A Yamins, Daniel LK %A DiCarlo, James J %B Computational and Systems Neuroscience (COSYNE) %C Denver, CO %G eng