Watch it twice: Video captioning with a refocused video encoder

Xiangxi Shi, Shafiq Joty, Jianfei Cai, Jiuxiang Gu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With the rapid growth of video data and the increasing demands of various crossmodal applications such as intelligent video search and assistance towards visually-impaired people, video captioning task has received a lot of attention recently in computer vision and natural language processing fields. The state-of-the-art video captioning methods focus more on encoding the temporal information, while lacking effective ways to remove irrelevant temporal information and also neglecting the spatial details. In particular, the current unidirectional video encoder can be negatively affected by irrelevant temporal information, especially the irrelevant information at the beginning and at the end of a video. In addition, disregarding detailed spatial features may lead to incorrect word choices in decoding. In this paper, we propose a novel recurrent video encoding method and a novel visual spatial feature for the video captioning task. The recurrent encoding module encodes the video twice with a predicted key frame to avoid irrelevant temporal information often occurring at the beginning and at the end of a video. The novel spatial features represent spatial information from different regions of a video and provide the decoder with more detailed information. Experiments on two benchmark datasets show superior performance of the proposed method.

Original languageEnglish
Title of host publicationMM 2019 - Proceedings of the 27th ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages818-826
Number of pages9
ISBN (Electronic)9781450368896
DOIs
Publication statusPublished - 15 Oct 2019
Event27th ACM International Conference on Multimedia, MM 2019 - Nice, France
Duration: 21 Oct 201925 Oct 2019

Publication series

NameMM 2019 - Proceedings of the 27th ACM International Conference on Multimedia

Conference

Conference27th ACM International Conference on Multimedia, MM 2019
CountryFrance
CityNice
Period21/10/1925/10/19

Fingerprint

Watches
Computer vision
Decoding
Processing
Experiments

Keywords

  • Key frame
  • Recurrent video encoding
  • Reinforcement learning
  • Video captioning

ASJC Scopus subject areas

  • Media Technology
  • Computer Science(all)

Cite this

Shi, X., Joty, S., Cai, J., & Gu, J. (2019). Watch it twice: Video captioning with a refocused video encoder. In MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia (pp. 818-826). (MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia). Association for Computing Machinery, Inc. https://doi.org/10.1145/3343031.3351060

Watch it twice : Video captioning with a refocused video encoder. / Shi, Xiangxi; Joty, Shafiq; Cai, Jianfei; Gu, Jiuxiang.

MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2019. p. 818-826 (MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Shi, X, Joty, S, Cai, J & Gu, J 2019, Watch it twice: Video captioning with a refocused video encoder. in MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia. MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia, Association for Computing Machinery, Inc, pp. 818-826, 27th ACM International Conference on Multimedia, MM 2019, Nice, France, 21/10/19. https://doi.org/10.1145/3343031.3351060
Shi X, Joty S, Cai J, Gu J. Watch it twice: Video captioning with a refocused video encoder. In MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia. Association for Computing Machinery, Inc. 2019. p. 818-826. (MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia). https://doi.org/10.1145/3343031.3351060
Shi, Xiangxi ; Joty, Shafiq ; Cai, Jianfei ; Gu, Jiuxiang. / Watch it twice : Video captioning with a refocused video encoder. MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia. Association for Computing Machinery, Inc, 2019. pp. 818-826 (MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia).
@inproceedings{f040b11cf9714e1fa4068f137d4fd71b,
title = "Watch it twice: Video captioning with a refocused video encoder",
abstract = "With the rapid growth of video data and the increasing demands of various crossmodal applications such as intelligent video search and assistance towards visually-impaired people, video captioning task has received a lot of attention recently in computer vision and natural language processing fields. The state-of-the-art video captioning methods focus more on encoding the temporal information, while lacking effective ways to remove irrelevant temporal information and also neglecting the spatial details. In particular, the current unidirectional video encoder can be negatively affected by irrelevant temporal information, especially the irrelevant information at the beginning and at the end of a video. In addition, disregarding detailed spatial features may lead to incorrect word choices in decoding. In this paper, we propose a novel recurrent video encoding method and a novel visual spatial feature for the video captioning task. The recurrent encoding module encodes the video twice with a predicted key frame to avoid irrelevant temporal information often occurring at the beginning and at the end of a video. The novel spatial features represent spatial information from different regions of a video and provide the decoder with more detailed information. Experiments on two benchmark datasets show superior performance of the proposed method.",
keywords = "Key frame, Recurrent video encoding, Reinforcement learning, Video captioning",
author = "Xiangxi Shi and Shafiq Joty and Jianfei Cai and Jiuxiang Gu",
year = "2019",
month = "10",
day = "15",
doi = "10.1145/3343031.3351060",
language = "English",
series = "MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia",
publisher = "Association for Computing Machinery, Inc",
pages = "818--826",
booktitle = "MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia",

}

TY - GEN

T1 - Watch it twice

T2 - Video captioning with a refocused video encoder

AU - Shi, Xiangxi

AU - Joty, Shafiq

AU - Cai, Jianfei

AU - Gu, Jiuxiang

PY - 2019/10/15

Y1 - 2019/10/15

N2 - With the rapid growth of video data and the increasing demands of various crossmodal applications such as intelligent video search and assistance towards visually-impaired people, video captioning task has received a lot of attention recently in computer vision and natural language processing fields. The state-of-the-art video captioning methods focus more on encoding the temporal information, while lacking effective ways to remove irrelevant temporal information and also neglecting the spatial details. In particular, the current unidirectional video encoder can be negatively affected by irrelevant temporal information, especially the irrelevant information at the beginning and at the end of a video. In addition, disregarding detailed spatial features may lead to incorrect word choices in decoding. In this paper, we propose a novel recurrent video encoding method and a novel visual spatial feature for the video captioning task. The recurrent encoding module encodes the video twice with a predicted key frame to avoid irrelevant temporal information often occurring at the beginning and at the end of a video. The novel spatial features represent spatial information from different regions of a video and provide the decoder with more detailed information. Experiments on two benchmark datasets show superior performance of the proposed method.

AB - With the rapid growth of video data and the increasing demands of various crossmodal applications such as intelligent video search and assistance towards visually-impaired people, video captioning task has received a lot of attention recently in computer vision and natural language processing fields. The state-of-the-art video captioning methods focus more on encoding the temporal information, while lacking effective ways to remove irrelevant temporal information and also neglecting the spatial details. In particular, the current unidirectional video encoder can be negatively affected by irrelevant temporal information, especially the irrelevant information at the beginning and at the end of a video. In addition, disregarding detailed spatial features may lead to incorrect word choices in decoding. In this paper, we propose a novel recurrent video encoding method and a novel visual spatial feature for the video captioning task. The recurrent encoding module encodes the video twice with a predicted key frame to avoid irrelevant temporal information often occurring at the beginning and at the end of a video. The novel spatial features represent spatial information from different regions of a video and provide the decoder with more detailed information. Experiments on two benchmark datasets show superior performance of the proposed method.

KW - Key frame

KW - Recurrent video encoding

KW - Reinforcement learning

KW - Video captioning

UR - http://www.scopus.com/inward/record.url?scp=85074832749&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85074832749&partnerID=8YFLogxK

U2 - 10.1145/3343031.3351060

DO - 10.1145/3343031.3351060

M3 - Conference contribution

AN - SCOPUS:85074832749

T3 - MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia

SP - 818

EP - 826

BT - MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia

PB - Association for Computing Machinery, Inc

ER -