Why PATS?

PATS was collected to study correlation of co-speech gestures with audio and text signals. The dataset consists of a diverse and large amount of aligned pose, audio and transcripts. With this dataset, we hope to provide a benchmark which would help develop technologies for virtual agents which generate natural and relevant gestures.

What can you find in this dataset?

Transcribed Pose data with aligned Audio and Transcriptions
- 25 Speakers with different Styles
- Includes 10 speakers from Ginosar, et al. (CVPR 2019)
- 15 talk show hosts, 5 lecturers, 3 YouTubers, and 2 televangelists
251 hours of data
- Around ~ 84000 intervals
- Mean: 10.7s per interval
- Standard Deviation: 13.5s per interval

Tasks

Three modalities -i.e. Pose, Audio, Transcriptions- in many Styles available in the PATS dataset present a unique opportunity for the following tasks,

Cross-Modal Translation	Style Transfer	Grounding
Pose Audio and Transcripts are aligned making PATS a good test-bench for cross-modal translation tasks, Audio to Pose Transcript to Pose Audio + Transcript to Pose	As pose styles and language styles are often idiosyncratic to each speaker, PATS offers a unique dataset to perform research on style transfer of various modalities. Gesture Style Transfer Language Style Transfer	An important question is learning the pairing of multiple modalities: grounding. With aligned pose audio and transcripts, PATS provides a good platform for conducting research for multimodal grounding of conversational language to hand gestures.

Features

Features	Available Representations	Collection Methodology
Audio	Log-mel Spectrograms	Audio scraped from Youtube
Transcripts	Word Tokens	Transcribed using Google ASR with an estimated word error rate of 0.29* (*-estimated on available transcripts for a subset of videos)
	Bert Embeddings	Pre-trained model 'bert_base_uncased' from HuggingFace, based on Devlin, et al. (NAACL 2019)
	Word2Vec Embeddings	Word2Vec based on Mikolov et al. (NIPS 2013)
Pose	2D Skeletal Keypoints	Processed using OpenPose

Navigating Through Speakers

The speakers in PATS have diverse lexical content in their transcripts along with diverse gestures. The following graphs will help you navigate through the speakers in the dataset, should you want to work with specific speakers with a different gesture and/or lexical diversities.

Fig 1 shows speakers clustered hierarchically based on the content of their transcripts and Fig 2 shows each speaker's position on a lexical diversity vs spatial extent plot.

As shown in Fig. 1, speakers in the same domain (i.e. TV Show Hosts) share similar language, as demonstrated in the clusters. Furthermore, in Fig. 2, we can see that TV show speakers are generally more expressive with their hands and and words while Televangelists are less so. Speakers on the top right corner of Fig 2, are more challenging to model in the task of gesture generation as they have a greater diversity of vocabulary as well as gestures.

Reference(s)

If you found this dataset helpful, please consider citing the following paper(s):

                    
                    No Gestures Left Behind: Learning Relationships between Spoken Language and Freeform Gestures
EMNLP Findings 2020 - [code]                    
@inproceedings{ahuja2020no,
  title={No Gestures Left Behind: Learning Relationships between Spoken Language and Freeform Gestures},
  author={Ahuja, Chaitanya and Lee, Dong Won and Ishii, Ryo and Morency, Louis-Philippe},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings},
  pages={1884--1895},
  year={2020}
}

                      Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach
ECCV 2020 - [website][code]                    
@inproceedings{ahuja2020style,
  title={Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach},
  author={Chaitanya Ahuja and Dong Won Lee and Yukiko I. Nakano and Louis-Philippe Morency},
  venue = {European Conference on Computer Vision (ECCV)}
  year={2020},
  month = {August},
  year = {2020},
  url={https://arxiv.org/abs/2007.12553}
}

Learning individual styles of conversational gesture
CVPR 2019 - [website]
@inproceedings{ginosar2019learning,
  title={Learning individual styles of conversational gesture},
  author={Ginosar, Shiry and Bar, Amir and Kohavi, Gefen and Chan, Caroline and Owens, Andrew and Malik, Jitendra},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={3497--3506},
  year={2019}
}
We kindly ask you to cite Ginosar et. al. as well, whose 10 speakers' pose and audio files are used for our dataset.