Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models

Jiuxiang Gu, Jianfei Cai, Shafiq Rayhan Joty, Li Niu, Gang Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

20 Citations (Scopus)

Abstract

Textual-visual cross-modal retrieval has been a hot research topic in both computer vision and natural language processing communities. Learning appropriate representations for multi-modal data is crucial for the cross-modal retrieval performance. Unlike existing image-text retrieval approaches that embed image-text pairs as single feature vectors in a common representational space, we propose to incorporate generative processes into the cross-modal feature embedding, through which we are able to learn not only the global abstract features but also the local grounded features. Extensive experiments show that our framework can well match images and sentences with complex content, and achieve the state-of-the-art cross-modal retrieval results on MSCOCO dataset.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
PublisherIEEE Computer Society
Pages7181-7189
Number of pages9
ISBN (Electronic)9781538664209
DOIs
Publication statusPublished - 14 Dec 2018
Event31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018 - Salt Lake City, United States
Duration: 18 Jun 201822 Jun 2018

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919

Conference

Conference31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018
CountryUnited States
CitySalt Lake City
Period18/6/1822/6/18

Fingerprint

Computer vision
Processing
Experiments

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Cite this

Gu, J., Cai, J., Rayhan Joty, S., Niu, L., & Wang, G. (2018). Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models. In Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018 (pp. 7181-7189). [8578848] (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). IEEE Computer Society. https://doi.org/10.1109/CVPR.2018.00750

Look, Imagine and Match : Improving Textual-Visual Cross-Modal Retrieval with Generative Models. / Gu, Jiuxiang; Cai, Jianfei; Rayhan Joty, Shafiq; Niu, Li; Wang, Gang.

Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018. IEEE Computer Society, 2018. p. 7181-7189 8578848 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Gu, J, Cai, J, Rayhan Joty, S, Niu, L & Wang, G 2018, Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models. in Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018., 8578848, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, pp. 7181-7189, 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, United States, 18/6/18. https://doi.org/10.1109/CVPR.2018.00750
Gu J, Cai J, Rayhan Joty S, Niu L, Wang G. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models. In Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018. IEEE Computer Society. 2018. p. 7181-7189. 8578848. (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). https://doi.org/10.1109/CVPR.2018.00750
Gu, Jiuxiang ; Cai, Jianfei ; Rayhan Joty, Shafiq ; Niu, Li ; Wang, Gang. / Look, Imagine and Match : Improving Textual-Visual Cross-Modal Retrieval with Generative Models. Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018. IEEE Computer Society, 2018. pp. 7181-7189 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).
@inproceedings{21ffe54159d24707a5db2b62700d51f1,
title = "Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models",
abstract = "Textual-visual cross-modal retrieval has been a hot research topic in both computer vision and natural language processing communities. Learning appropriate representations for multi-modal data is crucial for the cross-modal retrieval performance. Unlike existing image-text retrieval approaches that embed image-text pairs as single feature vectors in a common representational space, we propose to incorporate generative processes into the cross-modal feature embedding, through which we are able to learn not only the global abstract features but also the local grounded features. Extensive experiments show that our framework can well match images and sentences with complex content, and achieve the state-of-the-art cross-modal retrieval results on MSCOCO dataset.",
author = "Jiuxiang Gu and Jianfei Cai and {Rayhan Joty}, Shafiq and Li Niu and Gang Wang",
year = "2018",
month = "12",
day = "14",
doi = "10.1109/CVPR.2018.00750",
language = "English",
series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",
publisher = "IEEE Computer Society",
pages = "7181--7189",
booktitle = "Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018",

}

TY - GEN

T1 - Look, Imagine and Match

T2 - Improving Textual-Visual Cross-Modal Retrieval with Generative Models

AU - Gu, Jiuxiang

AU - Cai, Jianfei

AU - Rayhan Joty, Shafiq

AU - Niu, Li

AU - Wang, Gang

PY - 2018/12/14

Y1 - 2018/12/14

N2 - Textual-visual cross-modal retrieval has been a hot research topic in both computer vision and natural language processing communities. Learning appropriate representations for multi-modal data is crucial for the cross-modal retrieval performance. Unlike existing image-text retrieval approaches that embed image-text pairs as single feature vectors in a common representational space, we propose to incorporate generative processes into the cross-modal feature embedding, through which we are able to learn not only the global abstract features but also the local grounded features. Extensive experiments show that our framework can well match images and sentences with complex content, and achieve the state-of-the-art cross-modal retrieval results on MSCOCO dataset.

AB - Textual-visual cross-modal retrieval has been a hot research topic in both computer vision and natural language processing communities. Learning appropriate representations for multi-modal data is crucial for the cross-modal retrieval performance. Unlike existing image-text retrieval approaches that embed image-text pairs as single feature vectors in a common representational space, we propose to incorporate generative processes into the cross-modal feature embedding, through which we are able to learn not only the global abstract features but also the local grounded features. Extensive experiments show that our framework can well match images and sentences with complex content, and achieve the state-of-the-art cross-modal retrieval results on MSCOCO dataset.

UR - http://www.scopus.com/inward/record.url?scp=85055116468&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055116468&partnerID=8YFLogxK

U2 - 10.1109/CVPR.2018.00750

DO - 10.1109/CVPR.2018.00750

M3 - Conference contribution

AN - SCOPUS:85055116468

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 7181

EP - 7189

BT - Proceedings - 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2018

PB - IEEE Computer Society

ER -