Why PATS?

PATS was collected to study correlation of co-speech gestures with audio and text signals. The dataset consists of a diverse and large amount of aligned pose, audio and transcripts. With this dataset, we hope to provide a benchmark which would help develop technologies for virtual agents which generate natural and relevant gestures.

What can you find in this dataset?

Tasks

Three modalities -i.e. Pose, Audio, Transcriptions- in many Styles available in the PATS dataset present a unique opportunity for the following tasks,
Cross-Modal Translation Style Transfer Grounding
Pose Audio and Transcripts are aligned making PATS a good test-bench for cross-modal translation tasks,
  • Audio to Pose
  • Transcript to Pose
  • Audio + Transcript to Pose
  • As pose styles and language styles are often idiosyncratic to each speaker, PATS offers a unique dataset to perform research on style transfer of various modalities.
  • Gesture Style Transfer
  • Language Style Transfer
  • An important question is learning the pairing of multiple modalities: grounding. With aligned pose audio and transcripts, PATS provides a good platform for conducting research for multimodal grounding of conversational language to hand gestures.

    Features

    Features Available Representations Collection Methodology
    Audio Log-mel Spectrograms Audio scraped from Youtube
    Transcripts Word Tokens Transcribed using Google ASR with an estimated word error rate of 0.29* (*-estimated on available transcripts for a subset of videos)
    Bert Embeddings Pre-trained model 'bert_base_uncased' from HuggingFace, based on Devlin, et al. (NAACL 2019)
    Word2Vec Embeddings Word2Vec based on Mikolov et al. (NIPS 2013)
    Pose 2D Skeletal Keypoints Processed using OpenPose

    Navigating Through Speakers

    The speakers in PATS have diverse lexical content in their transcripts along with diverse gestures. The following graphs will help you navigate through the speakers in the dataset, should you want to work with specific speakers with a different gesture and/or lexical diversities.

    Fig 1 shows speakers clustered hierarchically based on the content of their transcripts and Fig 2 shows each speaker's position on a lexical diversity vs spatial extent plot.

    As shown in Fig. 1, speakers in the same domain (i.e. TV Show Hosts) share similar language, as demonstrated in the clusters. Furthermore, in Fig. 2, we can see that TV show speakers are generally more expressive with their hands and and words while Televangelists are less so. Speakers on the top right corner of Fig 2, are more challenging to model in the task of gesture generation as they have a greater diversity of vocabulary as well as gestures.

    Reference(s)

    If you found this dataset helpful, please consider citing the following paper(s):

                        
    1. Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach
      ECCV 2020 - [website][code] @inproceedings{ahuja2020style, title={Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach}, author={Chaitanya Ahuja and Dong Won Lee and Yukiko I. Nakano and Louis-Philippe Morency}, venue = {European Conference on Computer Vision (ECCV)} year={2020}, month = {August}, year = {2020}, url={https://arxiv.org/abs/2007.12553} }
    2. Learning individual styles of conversational gesture
      CVPR 2019 - [website] @inproceedings{ginosar2019learning, title={Learning individual styles of conversational gesture}, author={Ginosar, Shiry and Bar, Amir and Kohavi, Gefen and Chan, Caroline and Owens, Andrew and Malik, Jitendra}, booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition}, pages={3497--3506}, year={2019} } We kindly ask you to cite Ginosar et. al. as well, whose 10 speakers' pose and audio files are used for our dataset.

    Authors


    Chaitanya Ahuja

    Dong Won Lee

    Yukiko Nakano

    Louis-Philippe Morency