On learning associations of faces and voices

Changil Kim, Hijung Valentina Shin, Tae Hyun Oh, Alexandre Kaspar, Mohamed Elgharib, Wojciech Matusik

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

In this paper, we study the associations between human faces and voices. Audiovisual integration, specifically the integration of facial and vocal information is a well-researched area in neuroscience. It is shown that the overlapping information between the two modalities plays a significant role in perceptual tasks such as speaker identification. Through an online study on a new dataset we created, we confirm previous findings that people can associate unseen faces with corresponding voices and vice versa with greater than chance accuracy. We computationally model the overlapping information between faces and voices and show that the learned cross-modal representation contains enough information to identify matching faces and voices with performance similar to that of humans. Our representation exhibits correlations to certain demographic attributes and features obtained from either visual or aural modality alone. We release our dataset of audiovisual recordings and demographic annotations of people reading out short text used in our studies.

Original languageEnglish
Title of host publicationComputer Vision – ACCV 2018 - 14th Asian Conference on Computer Vision, Revised Selected Papers
EditorsKonrad Schindler, Greg Mori, C.V. Jawahar, Hongdong Li
PublisherSpringer Verlag
Pages276-292
Number of pages17
ISBN (Print)9783030208721
DOIs
Publication statusPublished - 1 Jan 2019
Event14th Asian Conference on Computer Vision, ACCV 2018 - Perth, Australia
Duration: 2 Dec 20186 Dec 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11365 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference14th Asian Conference on Computer Vision, ACCV 2018
CountryAustralia
CityPerth
Period2/12/186/12/18

    Fingerprint

Keywords

  • Face-voice association
  • Multi-modal representation learning

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Kim, C., Shin, H. V., Oh, T. H., Kaspar, A., Elgharib, M., & Matusik, W. (2019). On learning associations of faces and voices. In K. Schindler, G. Mori, C. V. Jawahar, & H. Li (Eds.), Computer Vision – ACCV 2018 - 14th Asian Conference on Computer Vision, Revised Selected Papers (pp. 276-292). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11365 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-030-20873-8_18