ANetQA: A Large-scale Benchmark for Fine-grained
Compositional Reasoning over Untrimmed Videos


Zhou Yu1     Lixiang Zheng1     Zhou Zhao2     Fei Wu2     Jianping Fan1,3     Kui Ren4     Jun Yu1*

1School of Computer Science, Hangzhou Dianzi University, China
2Colledge of Computer Science and Technology, Zhejiang University, China
3AI Lab at Lenovo Research, China
4School of Cyber Science and Technology, Zhejiang University, China
*Corresponding author


Abstract


Building benchmarks to systemically analyze different capabilities of video question answering (VideoQA) models is challenging yet crucial. Existing benchmarks often use non-compositional simple questions and suffer from language biases, making it difficult to diagnose model weaknesses incisively. A recent benchmark AGQA poses a promising paradigm to generate QA pairs automatically from pre-annotated scene graphs, enabling it to measure diverse reasoning abilities with granular control. However, its questions have limitations in reasoning about the finegrained semantics in videos as such information is absent in its scene graphs. To this end, we present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over the challenging untrimmed videos from ActivityNet. Similar to AGQA, the QA pairs in ANetQA are automatically generated from annotated video scene graphs. The fine-grained properties of ANetQA are reflected in the following: (i) untrimmed videos with fine-grained semantics; (ii) spatio-temporal scene graphs with fine-grained taxonomies; and (iii) diverse questions generated from fine-grained templates. ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos. Comprehensive experiments are performed for state-of-the-art methods. The best model achieves 44.5% accuracy while human performance tops out at 84.5%, leaving sufficient room for improvements.


Paper


ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos
Zhou Yu, Lixiang Zheng, Zhou Zhao, Fei Wu, Jianping Fan, Kui Ren, Jun Yu
CVPR, 2023
[PDF]


Dataset


Videos

  • Raw videos from ActivityNet v1.3
  • Meta information of all videos.

  • Scene Graphs

  • Train scene graphs from 9,155 videos.
  • Val scene graphs from 1,185 videos.
  • Meta information of all scene graphs.
  • Question-Answer Pairs

  • Train QA pairs (10,456,011 samples)
  • Val QA pairs (1,474,723 samples)
  • Test questions (1,503,510 samples)
  • Test-dev questions (300,694 samples)
  • Test-tiny questions (20,000 samples)
  • *The test-dev and test-tiny splits are two subsets of the test split.
    More details of the dataset are provided here.


    Code


    Code for ANetQA baseline models are available here.


    Evaluation


    Evaluation for the testing set is provided on the online EvalAI server.

    Submit Format

    [...
        {
          "question_id": question_id,
          "answer": answer
        },
    ...]

    We have provided an example result JSON file here.


    Licence


    The annotations in this dataset belong to the ANetQA Team and are licensed under a CC BY-NC 4.0 License.


    Bibtex


    @inproceedings{yu2023anetqa,
       title={ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos},
       author={Yu, Zhou and Zheng, Lixiang and Zhao, Zhou and Wu, Fei and Fan, Jianping and Ren, Kui and Yu, Jun},
       booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
       pages={23191--23200}
       year={2023}
    }