ANetQA: A Large-scale Benchmark for Fine-grained
Compositional Reasoning over Untrimmed Videos
Zhou Yu1 Lixiang Zheng1 Zhou Zhao2 Fei Wu2 Jianping Fan1,3 Kui Ren4 Jun Yu1*
1School of Computer Science, Hangzhou Dianzi University, China
2Colledge of Computer Science and Technology, Zhejiang University, China
3AI Lab at Lenovo Research, China
4School of Cyber Science and Technology, Zhejiang University, China
*Corresponding author
Abstract
Building benchmarks to systemically analyze different
capabilities of video question answering (VideoQA) models
is challenging yet crucial. Existing benchmarks often
use non-compositional simple questions and suffer from
language biases, making it difficult to diagnose model
weaknesses incisively. A recent benchmark AGQA poses
a promising paradigm to generate QA pairs automatically
from pre-annotated scene graphs, enabling it to measure
diverse reasoning abilities with granular control. However,
its questions have limitations in reasoning about the finegrained
semantics in videos as such information is absent
in its scene graphs. To this end, we present ANetQA, a
large-scale benchmark that supports fine-grained compositional
reasoning over the challenging untrimmed videos
from ActivityNet. Similar to AGQA, the QA pairs
in ANetQA are automatically generated from annotated
video scene graphs. The fine-grained properties of ANetQA
are reflected in the following: (i) untrimmed videos with
fine-grained semantics; (ii) spatio-temporal scene graphs
with fine-grained taxonomies; and (iii) diverse questions
generated from fine-grained templates. ANetQA attains 1.4
billion unbalanced and 13.4 million balanced QA pairs,
which is an order of magnitude larger than AGQA with
a similar number of videos. Comprehensive experiments
are performed for state-of-the-art methods. The best model
achieves 44.5% accuracy while human performance tops
out at 84.5%, leaving sufficient room for improvements.
Paper
ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos
Zhou Yu, Lixiang Zheng, Zhou Zhao, Fei Wu, Jianping Fan, Kui Ren, Jun Yu
CVPR, 2023
[PDF]
Zhou Yu, Lixiang Zheng, Zhou Zhao, Fei Wu, Jianping Fan, Kui Ren, Jun Yu
CVPR, 2023
[PDF]
Dataset
Videos
Scene Graphs
Question-Answer Pairs
More details of the dataset are provided here.
Evaluation
Evaluation for the testing set is provided on the online EvalAI server.
Submit Format
[...
{
"question_id": question_id,
"answer": answer
},
...]
Licence
The annotations in this dataset belong to the ANetQA Team and are licensed under a CC BY-NC 4.0 License.
Bibtex
@inproceedings{yu2023anetqa, title={ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos}, author={Yu, Zhou and Zheng, Lixiang and Zhao, Zhou and Wu, Fei and Fan, Jianping and Ren, Kui and Yu, Jun}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={23191--23200} year={2023} }