Skip to yearly menu bar Skip to main content


Poster

Learnability Matters: Active Learning for Video Captionin

Yiqian Zhang · Buyu Liu · Jun Bao · Qiang Huang · Min Zhang · Jun Yu

[ ]
Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract: This work focuses on the active learning in video captioning. In particular, we propose to address the learnability problem in active learning, which has been brought up by collective outliers in video captioning and neglected in the literature. To start with, we conduct a comprehensive study of collective outliers, exploring their hard-to-learn property and concluding that ground truth inconsistency is one of the main causes. Motivated by this, we design a novel active learning algorithm that takes three complementary aspects, or learnability, diversity, and uncertainty, into account. Ideally, learnability is reflected by ground truth consistency. Under the active learning scenario where ground truths are not available until human involvement, we measure the consistency on estimated ground truths, where predictions from off-the-shelf models are utilized as approximations to ground truths. These predictions are further used to estimate sample frequency and reliability, evincing the diversity and uncertainty respectively. With the help of our novel caption-wise active learning protocol, our algorithm is capable of leveraging knowledge from humans in a more effective yet intellectual manner. Results on publicly available video captioning datasets with diverse video captioning models demonstrate that our algorithm outperforms state-of-the-art methods by a large margin, e.g., we can achieve about $103\%$ of full performance on CIDEr with $25\%$ of human annotations on MSR-VTT with all models.

Live content is unavailable. Log in and register to view live content