Learning only a handful of latent variables produces neural-aligned CNN models of the ventral stream

TitleLearning only a handful of latent variables produces neural-aligned CNN models of the ventral stream
Publication TypeConference Paper
Year of Publication2024
AuthorsXie, Y, Alter, E, Schwartz, J, DiCarlo, JJ
Conference NameComputational and Systems Neuroscience (COSYNE)
Conference LocationLisbon, Portugal
Abstract

Image-computable modeling of primate ventral stream visual processing has made great strides via brainmapped versions of convolutional neural networks (CNNs) that are optimized on thousands of object categories (ImageNet), the performance of which strongly predicts CNNs’ neural alignment. However, human and primate visual intelligence extends far beyond object categorization, encompassing a diverse range of tasks, such as estimating the latent variables of object position or pose in the image. The influence of task choice on neural alignment in CNNs, compared to CNN architecture, remains underexplored, partly due to the scarcity of largescale datasets with rich known labels beyond categories. 3D graphic engines, capable of creating training images with detailed information on various latent variables, offer a solution. Here, we asked how the choice of visual tasks that are used to train CNNs (i.e., the set of latent variables to be estimated) affects their ventral stream neural alignment. We focused on the estimation of variables such as object position and pose, and we tested CNNs’ neural alignment via the Brain-Score open science platform. We found some of these CNNs had neural alignment scores that were very close to those trained on ImageNet, even though their entire training experience has been on synthetic images. Additionally, we found training models on just a handful of latent variables achieved the same level of neural alignment as models trained on a much larger number of categories, suggesting that latent variable training is more efficient than category training in driving model-neural alignment. Moreover, we found that these models’ neural alignment scores scale with the amount of synthetic data used during training, suggesting the potential of obtaining more aligned models with larger synthetic datasets. This study highlights the effectiveness of using synthetic datasets and latent variables in advancing image-computable models of the ventral visual stream.

URLhttps://hdl.handle.net/1721.1/153744

File: