Andrew Owens


Assistant Professor at The University of Michigan in the department of Electrical Engineering and Computer Science

Learning by Audio-Visual Analogy



Today’s machine perception systems rely extensively on supervision provided to them by humans, such as labels and natural language. In this talk, I will discuss our work on making systems that, instead, learn from cross-modal associations between images and sound. In particular, I’ll focus on our efforts to learn from “audio-visual analogies”: the problem of estimating a sound for a video that relates to it “in the same way” as that of another given audio-visual pair. I will show that this framework can be used to create conditional Foley generation methods that use user-provided hints to add sound effects to video. It can also be used to create methods that jointly learn to solve two related audio and visual analysis tasks: localizing sounds and estimating camera rotation. I will also talk about how this work relates to the problem of aligning an audio and visual signal, and to other problems in 3D sound perception. Finally, I will briefly discuss how the tools from audio-visual analysis can be applied to other sensory modalities.



Andrew Owens is an assistant professor at The University of Michigan in the department of Electrical Engineering and Computer Science. Prior to that, he was a postdoctoral scholar at UC Berkeley. He received a Ph.D. in Electrical Engineering and Computer Science from MIT in 2016. He is a recipient of a Computer Vision and Pattern Recognition (CVPR) Best Paper Honorable Mention Award, and a Microsoft Research Ph.D. Fellowship.

Björn Schuller


Professor of Artificial Intelligence, Imperial College London/UK & Professor of Embedded Intelligence for Health Care and Wellbeing at the University of Augsburg/Germany

The Paralinguistics Challenges: Sound as Seen Through the Speech Looking Glasses



Computational Paralinguistics and in particular speech analysis have been featured in competitive data challenges across the international conferences ACII, ACM Multimedia, ICMI, ICML, Interspeech, NeurIPS, and beyond in foundation-laying series such as AVEC, ComParE, MuSe, or HEAR (co-)organised by the presenter over the last decade and a half. Here, a perspective talk based on the outcomes of these events is presented to the DCASE community, given how both fields have grown considerably in the recent years and open data, public benchmarks, and data challenges have had an important role in the development of both fields. A key aim is to identify significant differences in the approaches, to spark ideas across the tasks involved. To this end, the challenges will be presented in a nutshell including the field’s move from expert to deep representations and ultimately foundation models. In particular, insights on the most competitive approaches will be distilled based on the results of the participant field. On a final note, the talk will lend a potential future perspective on acoustic scenes and event analysis in a “paralinguistic” style.



Björn W. Schuller received his diploma, doctoral degree, habilitation, and Adjunct Teaching Professor in Machine Intelligence and Signal Processing all in EE/IT from TUM in Munich/Germany. He is Full Professor of Artificial Intelligence and the Head of GLAM - the Group on Language, Audio, & Music - at Imperial College London/UK, Full Professor and Chair of Embedded Intelligence for Health Care and Wellbeing at the University of Augsburg/Germany, co-founding CEO and current CSO of audEERING – an Audio Intelligence company based near Munich and in Berlin/Germany, amongst other Professorships and Affiliations. Previous stays include Full Professor at the University of Passau/Germany, Key Researcher at Joanneum Research in Graz/Austria, and the CNRS-LIMSI in Orsay/France. He is a Fellow of the IEEE and Golden Core Awardee of the IEEE Computer Society, Fellow of the BCS, Fellow of the ELLIS, Fellow of the ISCA, Fellow and President-Emeritus of the AAAC, Elected Full Member Sigma Xi, and Senior Member of the ACM. He (co-)authored 1,200+ publications (50,000+ citations, h-index=100+), is Field Chief Editor of Frontiers in Digital Health and was Editor in Chief of the IEEE Transactions on Affective Computing amongst manifold further commitments and service to the community.