Pose Estimation on Russian International News Media

Principal Investigators:
Prof. Dr. Peter Uhrig
Project Manager:
Prof. Dr. Peter Uhrig
additional Affiliation:
TU Dresden
HPC Platform used:
NHR@FAU: Alex and Fritz
Project ID:
b105dc
Date published:
Researchers:
Ilia Burenko
Introduction:
As multimodal communication analysis continues to evolve, high-performance computing (HPC) is playing a transformative role in enabling large-scale annotation and data processing. In the context of the DFG/AHRC-funded research project World Futures Multimodal Viewpoint Construction by Russian International Media a research team led by Anna Wilson (University of Oxford) and Peter Uhrig (FAU Erlangen-Nürnberg) has developed an innovative framework for automating speech, text, and gesture annotation. This interdisciplinary effort leverages state-of-the-art AI techniques, supported by the scalable HPC infrastructure provided by the Erlangen National High-Performance Computing Centre (NHR@FAU) in the project Pose Estimation on Russian International News Media. By combining expert-driven manual annotation with machine learning, the study pushes the boundaries of multimodal analysis, offering improved methods for the study of human multimodal communication at scale. In 2024, the paper summarizing the concepts behind the manual and automatic annotation, World futures through RT’s eyes: multimodal dataset and interdisciplinary methodology was published in Frontiers in Communication
Body:

Automatic Annotation: Harnessing AI for Scalable Multimodal Analysis
One of the main challenges in multimodal research is the labor-intensive process of annotation. The research team led by Wilson and Uhrig has addressed this challenge by integrating manual expertise with computational automation. OpenPose is utilized as the basis of gesture recognition and tracking, allowing for the automatic identification of movement direction and orientation. Additionally, an eyebrow movement detection tool has been developed to enhance annotation speed and capture intricate multimodal cues.
 

Automatic speech recognition (ASR) plays a crucial role in textual annotation. Initially, YouTube’s ASR was employed for speech-to-text conversion, but the study also tested OpenAI’s Whisper model, which demonstrated superior transcription accuracy. However, Whisper’s timestamping limitations necessitated manual refinements. By refining automatic annotation with human expertise, the research strikes a balance between efficiency and precision.
One further central aspect of the toolset developed is the automatic classification of gestural zones. Using machine learning algorithms and a set of manually defined rules, the research team has devised a system that segments and categorizes gestures based on predefined spatial parameters. This enhances annotation reliability and scalability, paving the way for broader applications in multimodal research.
 

High-Performance Computing: Enabling Large-Scale Multimodal Processing
This research is only made possible through the computational power provided by NHR@FAU. Given the vast computational demands of processing video, audio, and textual data, HPC accelerates annotation workflows by distributing tasks across multiple processing nodes. It is particularly the video analysis with OpenPose that placed large computational demands on the Alex GPU cluster run by NHR@FAU for the analysis of many years of RT videos captured just before RT was blocked from YouTube.
HPC resources also support deep learning experiments aimed at refining multimodal recognition models. The ability to train and test AI models on powerful computing clusters ensures that annotation algorithms continue to improve, enhancing the robustness of automated multimodal analysis. 
 

By integrating cutting-edge AI with HPC support from NHR@FAU, this study represents a significant advancement in multimodal communication research. The collaboration between cognitive and computational linguists as well as machine learning specialists exemplifies the potential of interdisciplinary research to drive innovation. The combination of scalable computing resources both with machine learning and rule-based annotation shows great potential for efficient but accurate annotation of video data for the humanities.

References of our publications resulting from this computer time project:
Wilson Anna, Pavlova Irina, Payne Elinor, Burenko Ilya and Uhrig Peter (2024) World futures through RT’s eyes: multimodal dataset and interdisciplinary methodology. Front. Commun. 9:1356702. doi: 10.3389/fcomm.2024.1356702
 

Affiliation:
FAU Erlangen-Nürnberg
Image:
Figure 1: Speech is naturally accompanied by gestures. Using machine learning algorithms and a set of manually defined rules, the research team has devised a system that segments and categorizes gestures based on predefined spatial parameters. It is particularly the video analysis with OpenPose that placed large computational demands on the Alex GPU cluster run by NHR@FAU for the analysis of many years of Russia Today (RT) videos captured just before RT was blocked from YouTube.