On learning associations of faces and voices

Changil Kim, Hijung Valentina Shin, Tae Hyun Oh, Alexandre Kaspar, Mohamed Elgharib, Wojciech Matusik

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we study the associations between human faces and voices. Audiovisual integration, specifically the integration of facial and vocal information is a well-researched area in neuroscience. It is shown that the overlapping information between the two modalities plays a significant role in perceptual tasks such as speaker identification. Through an online study on a new dataset we created, we confirm previous findings that people can associate unseen faces with corresponding voices and vice versa with greater than chance accuracy. We computationally model the overlapping information between faces and voices and show that the learned cross-modal representation contains enough information to identify matching faces and voices with performance similar to that of humans. Our representation exhibits correlations to certain demographic attributes and features obtained from either visual or aural modality alone. We release our dataset of audiovisual recordings and demographic annotations of people reading out short text used in our studies.

Original languageEnglish
Title of host publicationComputer Vision – ACCV 2018 - 14th Asian Conference on Computer Vision, Revised Selected Papers
EditorsKonrad Schindler, Greg Mori, C.V. Jawahar, Hongdong Li
PublisherSpringer Verlag
Pages276-292
Number of pages17
ISBN (Print)9783030208721
DOIs
Publication statusPublished - 1 Jan 2019
Event14th Asian Conference on Computer Vision, ACCV 2018 - Perth, Australia
Duration: 2 Dec 20186 Dec 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11365 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference14th Asian Conference on Computer Vision, ACCV 2018
CountryAustralia
CityPerth
Period2/12/186/12/18

Fingerprint

Face
Modality
Overlapping
Speaker Identification
Neuroscience
Annotation
Attribute
Voice
Learning
Human
Model
Text
Vision

Keywords

  • Face-voice association
  • Multi-modal representation learning

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Kim, C., Shin, H. V., Oh, T. H., Kaspar, A., Elgharib, M., & Matusik, W. (2019). On learning associations of faces and voices. In K. Schindler, G. Mori, C. V. Jawahar, & H. Li (Eds.), Computer Vision – ACCV 2018 - 14th Asian Conference on Computer Vision, Revised Selected Papers (pp. 276-292). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11365 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-030-20873-8_18

On learning associations of faces and voices. / Kim, Changil; Shin, Hijung Valentina; Oh, Tae Hyun; Kaspar, Alexandre; Elgharib, Mohamed; Matusik, Wojciech.

Computer Vision – ACCV 2018 - 14th Asian Conference on Computer Vision, Revised Selected Papers. ed. / Konrad Schindler; Greg Mori; C.V. Jawahar; Hongdong Li. Springer Verlag, 2019. p. 276-292 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11365 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kim, C, Shin, HV, Oh, TH, Kaspar, A, Elgharib, M & Matusik, W 2019, On learning associations of faces and voices. in K Schindler, G Mori, CV Jawahar & H Li (eds), Computer Vision – ACCV 2018 - 14th Asian Conference on Computer Vision, Revised Selected Papers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11365 LNCS, Springer Verlag, pp. 276-292, 14th Asian Conference on Computer Vision, ACCV 2018, Perth, Australia, 2/12/18. https://doi.org/10.1007/978-3-030-20873-8_18
Kim C, Shin HV, Oh TH, Kaspar A, Elgharib M, Matusik W. On learning associations of faces and voices. In Schindler K, Mori G, Jawahar CV, Li H, editors, Computer Vision – ACCV 2018 - 14th Asian Conference on Computer Vision, Revised Selected Papers. Springer Verlag. 2019. p. 276-292. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-030-20873-8_18
Kim, Changil ; Shin, Hijung Valentina ; Oh, Tae Hyun ; Kaspar, Alexandre ; Elgharib, Mohamed ; Matusik, Wojciech. / On learning associations of faces and voices. Computer Vision – ACCV 2018 - 14th Asian Conference on Computer Vision, Revised Selected Papers. editor / Konrad Schindler ; Greg Mori ; C.V. Jawahar ; Hongdong Li. Springer Verlag, 2019. pp. 276-292 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{4e7eb4477d5547e3828f83c3057409dc,
title = "On learning associations of faces and voices",
abstract = "In this paper, we study the associations between human faces and voices. Audiovisual integration, specifically the integration of facial and vocal information is a well-researched area in neuroscience. It is shown that the overlapping information between the two modalities plays a significant role in perceptual tasks such as speaker identification. Through an online study on a new dataset we created, we confirm previous findings that people can associate unseen faces with corresponding voices and vice versa with greater than chance accuracy. We computationally model the overlapping information between faces and voices and show that the learned cross-modal representation contains enough information to identify matching faces and voices with performance similar to that of humans. Our representation exhibits correlations to certain demographic attributes and features obtained from either visual or aural modality alone. We release our dataset of audiovisual recordings and demographic annotations of people reading out short text used in our studies.",
keywords = "Face-voice association, Multi-modal representation learning",
author = "Changil Kim and Shin, {Hijung Valentina} and Oh, {Tae Hyun} and Alexandre Kaspar and Mohamed Elgharib and Wojciech Matusik",
year = "2019",
month = "1",
day = "1",
doi = "10.1007/978-3-030-20873-8_18",
language = "English",
isbn = "9783030208721",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "276--292",
editor = "Konrad Schindler and Greg Mori and C.V. Jawahar and Hongdong Li",
booktitle = "Computer Vision – ACCV 2018 - 14th Asian Conference on Computer Vision, Revised Selected Papers",

}

TY - GEN

T1 - On learning associations of faces and voices

AU - Kim, Changil

AU - Shin, Hijung Valentina

AU - Oh, Tae Hyun

AU - Kaspar, Alexandre

AU - Elgharib, Mohamed

AU - Matusik, Wojciech

PY - 2019/1/1

Y1 - 2019/1/1

N2 - In this paper, we study the associations between human faces and voices. Audiovisual integration, specifically the integration of facial and vocal information is a well-researched area in neuroscience. It is shown that the overlapping information between the two modalities plays a significant role in perceptual tasks such as speaker identification. Through an online study on a new dataset we created, we confirm previous findings that people can associate unseen faces with corresponding voices and vice versa with greater than chance accuracy. We computationally model the overlapping information between faces and voices and show that the learned cross-modal representation contains enough information to identify matching faces and voices with performance similar to that of humans. Our representation exhibits correlations to certain demographic attributes and features obtained from either visual or aural modality alone. We release our dataset of audiovisual recordings and demographic annotations of people reading out short text used in our studies.

AB - In this paper, we study the associations between human faces and voices. Audiovisual integration, specifically the integration of facial and vocal information is a well-researched area in neuroscience. It is shown that the overlapping information between the two modalities plays a significant role in perceptual tasks such as speaker identification. Through an online study on a new dataset we created, we confirm previous findings that people can associate unseen faces with corresponding voices and vice versa with greater than chance accuracy. We computationally model the overlapping information between faces and voices and show that the learned cross-modal representation contains enough information to identify matching faces and voices with performance similar to that of humans. Our representation exhibits correlations to certain demographic attributes and features obtained from either visual or aural modality alone. We release our dataset of audiovisual recordings and demographic annotations of people reading out short text used in our studies.

KW - Face-voice association

KW - Multi-modal representation learning

UR - http://www.scopus.com/inward/record.url?scp=85066796718&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85066796718&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-20873-8_18

DO - 10.1007/978-3-030-20873-8_18

M3 - Conference contribution

AN - SCOPUS:85066796718

SN - 9783030208721

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 276

EP - 292

BT - Computer Vision – ACCV 2018 - 14th Asian Conference on Computer Vision, Revised Selected Papers

A2 - Schindler, Konrad

A2 - Mori, Greg

A2 - Jawahar, C.V.

A2 - Li, Hongdong

PB - Springer Verlag

ER -