Monthly
208 pp. per issue
8 1/2 x 11, illustrated
ISSN
0898-929X
E-ISSN
1530-8898
2014 Impact factor:
4.69

Journal of Cognitive Neuroscience

October 2021, Vol. 33, No. 10, Pages 2044-2064
(doi: 10.1162/jocn_a_01755)
© 2021 Massachusetts Institute of Technology
Diverse Deep Neural Networks All Predict Human Inferior Temporal Cortex Well, After Training and Fitting
Article PDF (1.66 MB)
Abstract

Deep neural networks (DNNs) trained on object recognition provide the best current models of high-level visual cortex. What remains unclear is how strongly experimental choices, such as network architecture, training, and fitting to brain data, contribute to the observed similarities. Here, we compare a diverse set of nine DNN architectures on their ability to explain the representational geometry of 62 object images in human inferior temporal cortex (hIT), as measured with fMRI. We compare untrained networks to their task-trained counterparts and assess the effect of cross-validated fitting to hIT, by taking a weighted combination of the principal components of features within each layer and, subsequently, a weighted combination of layers. For each combination of training and fitting, we test all models for their correlation with the hIT representational dissimilarity matrix, using independent images and subjects. Trained models outperform untrained models (accounting for 57% more of the explainable variance), suggesting that structured visual features are important for explaining hIT. Model fitting further improves the alignment of DNN and hIT representations (by 124%), suggesting that the relative prevalence of different features in hIT does not readily emerge from the Imagenet object-recognition task used to train the networks. The same models can also explain the disparate representations in primary visual cortex (V1), where stronger weights are given to earlier layers. In each region, all architectures achieved equivalently high performance once trained and fitted. The models' shared properties—deep feedforward hierarchies of spatially restricted nonlinear filters—seem more important than their differences, when modeling human visual representations.