VQA-E

Explaining, elaborating, and enhancing your answers for visual questions

Qing Li, Qingyi Tao, Shafiq Rayhan Joty, Jianfei Cai, Jiebo Luo

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We also conduct a user study to validate the quality of the synthesized explanations. We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.

Original languageEnglish
Title of host publicationComputer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings
EditorsVittorio Ferrari, Cristian Sminchisescu, Martial Hebert, Yair Weiss
PublisherSpringer Verlag
Pages570-586
Number of pages17
ISBN (Print)9783030012335
DOIs
Publication statusPublished - 1 Jan 2018
Event15th European Conference on Computer Vision, ECCV 2018 - Munich, Germany
Duration: 8 Sep 201814 Sep 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11211 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other15th European Conference on Computer Vision, ECCV 2018
CountryGermany
CityMunich
Period8/9/1814/9/18

Fingerprint

Question Answering
Multi-task Learning
User Studies
Vision
Justify
Margin
Prediction
Model

Keywords

  • Model with Explanation
  • Visual question answering

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Li, Q., Tao, Q., Rayhan Joty, S., Cai, J., & Luo, J. (2018). VQA-E: Explaining, elaborating, and enhancing your answers for visual questions. In V. Ferrari, C. Sminchisescu, M. Hebert, & Y. Weiss (Eds.), Computer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings (pp. 570-586). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11211 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-030-01234-2_34

VQA-E : Explaining, elaborating, and enhancing your answers for visual questions. / Li, Qing; Tao, Qingyi; Rayhan Joty, Shafiq; Cai, Jianfei; Luo, Jiebo.

Computer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings. ed. / Vittorio Ferrari; Cristian Sminchisescu; Martial Hebert; Yair Weiss. Springer Verlag, 2018. p. 570-586 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11211 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Li, Q, Tao, Q, Rayhan Joty, S, Cai, J & Luo, J 2018, VQA-E: Explaining, elaborating, and enhancing your answers for visual questions. in V Ferrari, C Sminchisescu, M Hebert & Y Weiss (eds), Computer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11211 LNCS, Springer Verlag, pp. 570-586, 15th European Conference on Computer Vision, ECCV 2018, Munich, Germany, 8/9/18. https://doi.org/10.1007/978-3-030-01234-2_34
Li Q, Tao Q, Rayhan Joty S, Cai J, Luo J. VQA-E: Explaining, elaborating, and enhancing your answers for visual questions. In Ferrari V, Sminchisescu C, Hebert M, Weiss Y, editors, Computer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings. Springer Verlag. 2018. p. 570-586. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-030-01234-2_34
Li, Qing ; Tao, Qingyi ; Rayhan Joty, Shafiq ; Cai, Jianfei ; Luo, Jiebo. / VQA-E : Explaining, elaborating, and enhancing your answers for visual questions. Computer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings. editor / Vittorio Ferrari ; Cristian Sminchisescu ; Martial Hebert ; Yair Weiss. Springer Verlag, 2018. pp. 570-586 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{b2cde0f56b034bcdaaba519ec77533df,
title = "VQA-E: Explaining, elaborating, and enhancing your answers for visual questions",
abstract = "Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We also conduct a user study to validate the quality of the synthesized explanations. We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.",
keywords = "Model with Explanation, Visual question answering",
author = "Qing Li and Qingyi Tao and {Rayhan Joty}, Shafiq and Jianfei Cai and Jiebo Luo",
year = "2018",
month = "1",
day = "1",
doi = "10.1007/978-3-030-01234-2_34",
language = "English",
isbn = "9783030012335",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "570--586",
editor = "Vittorio Ferrari and Cristian Sminchisescu and Martial Hebert and Yair Weiss",
booktitle = "Computer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings",

}

TY - GEN

T1 - VQA-E

T2 - Explaining, elaborating, and enhancing your answers for visual questions

AU - Li, Qing

AU - Tao, Qingyi

AU - Rayhan Joty, Shafiq

AU - Cai, Jianfei

AU - Luo, Jiebo

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We also conduct a user study to validate the quality of the synthesized explanations. We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.

AB - Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We also conduct a user study to validate the quality of the synthesized explanations. We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.

KW - Model with Explanation

KW - Visual question answering

UR - http://www.scopus.com/inward/record.url?scp=85055100716&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055100716&partnerID=8YFLogxK

U2 - 10.1007/978-3-030-01234-2_34

DO - 10.1007/978-3-030-01234-2_34

M3 - Conference contribution

SN - 9783030012335

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 570

EP - 586

BT - Computer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings

A2 - Ferrari, Vittorio

A2 - Sminchisescu, Cristian

A2 - Hebert, Martial

A2 - Weiss, Yair

PB - Springer Verlag

ER -