Some interesting recent work on video understanding tasks and practitioners from computer science.
CMD Challenge From the VGG group at Oxford. It uses a version of the condensed movie datasetchallenge, based on the same Movieclip collection I describe in Creanalytics. See their repo for challenge details, and the description below:
Focus: The focus of this challenge is on the long-range understanding of high-level narrative structure in movies.
Overview: In the challenge, participants are invited to build a system to retrieve 2-3 minute video clips from movies using corresponding high-level natural language descriptions and a wide range of pre-computed visual features from several pre-trained expert models. Each 2-3 minute clip constitutes a key scene from a movie, each representing important parts in the storyline. Each clip is accompanied by a high-level semantic description which describes the storyline. This includes the motivations of the characters, actions, scenes, objects, interactions and relationships. Participants will use a new challenge version of the Condensed Movies Dataset (CMD) for both training and testing of their retrieval systems.
This group/challenge is related to the video as data ICA workshop. And also to a broader push to integrate natural language and visual understandings. See for example the work of Max Bain, Lisa Anne Hendricks at Deepmind and Andrew Brown, ex VGG now at Meta. See below: