Democratizing Audio Machine Learning with the Huggingface Ecosystem


Presenters: Jose Omar Giraldo (Barcelona Supercomputing Center), Vaibhav Srivastav (Hugging Face)

Time: 14:00 - 15:30, Wednesday 20th of September

Tutorial content:

The Hugging Face ecosystem provides a powerful platform for hosting, sharing, and utilizing DL models. In this tutorial, we will walk you through the process of uploading your audio datasets to the Hugging Face Datasets library, making them easily accessible to researchers and practitioners worldwide. Additionally, we will guide you in sharing your custom audio models with the Hugging Face Model Hub, enabling others to benefit from your trained models and promoting collaboration within the audio processing community.


  • Loading datasets: Learn techniques and best practices for efficiently loading and preprocessing large audio datasets.
  • Writing efficient dataloaders for audio data: Explore optimization techniques, including parallelization, caching, and memory management, to speed up the training process.
  • Uploading your audio dataset to Hugging Face: Walk through the process of sharing and making your audio datasets accessible to the wider research community.
  • Writing a custom model to Transformers: Discover how to develop and integrate your own audio models into the Huggingface ecosystem.

This tutorial is targeted to beginners and intermediate users who want to benefit from the Huggingface ecosystem, and want to share their work with a vibrant community of ML Practitioners.

Tutorial material

GitHub Repository

Datasets related to the tutorial in HF2 format:

Biographies of the presenters:

Jose Omar Giraldo is a sound engineer passionate about coding and AI. He is part of the speech team at the Barcelona Supercomputing Center but started his career in environmental audio classification and bioacoustics. He has collaborated with rainforest connection and orcasound projects to develop new methods of sound dataset exploration.

Vaibhav Srivastav is an Machine Learning Developer Advocate at Hugging Face focusing on democratising Audio through open source. He has been a freelancer, tax analyst, consultant, tech speaker and advisor for five years. In the past three years, he has invested significant time volunteering for open source and science organisations like Hugging Face, EuroPython Society, PyCon's across APAC, Google Cloud, and Facebook Developer Circles.

Monitoring Environmental Impact of DCASE Systems: Why and How?


Presenters: Constance Douwes (IRCAM), Francesca Ronchini (Politecnico di Milano), Romain Serizel (Université de Lorraine)

Time: 16:00 - 17:30, Wednesday 20th of September

Tutorial content:

With the increasingly complex models used in machine learning and the large amount of data needed to train these models, machine learning based solutions can have a large environmental impact. Even if a few hundred experiments are sometimes needed to train a working model, the cost of the training phase represents only 10% to 20% of the total CO2 emissions of the related machine learning usage (the rest lying in the inference phase). Yet, as machine listening researchers the largest part of our energy consumption lays in the training phase. Even though models used in machine listening are smaller than those used in natural language processing or image generation, they still present similar problems. Comparing the energy consumption of system trained on different site can be a complex task and the relation between the system performance and its energy footprint can be uneasy to interpret. The aim of this tutorial is to present an overview of existing tools that can be used to monitor energy consumption and computational costs of neural network based models.

The tutorial will be structured as follows:

  • We will first present the different metric that can be used to get an incentive about systems computational footprint.
  • We will address the usability of the metric at a community level.
  • We will propose a case study on submissions to DCASE task 4 in 2022 and 2023
  • We will close the tutorial with a “hands-on” demonstration of how to use some of the tools presented during the tutorial and their limitations.

Tutorial material

GitHub Repository

Biographies of the presenters:

Constance Douwes graduated with a PhD in Computer Science in 2023 at IRCAM (Institut de Recherche en Coordination Acoustique/Musique), where she worked on the energy and environmental impact of specialised deep learning models for audio generation.

Francesca Ronchini is a PhD student in the Image and Sound Processing Lab (ISPLab) at Politecnico di Milano as a Ph.D. student. Her research explores sustainable Deep Learning approaches for Machine Listening applications and their interdisciplinary utilization. She has been co-organizing DCASE task 4 since 2021. Since 2023 she has been a member of the DCASE steering group.

Romain Serizel is an Associate Professor with Université de Lorraine (Nancy, France) doing research on robust speech communications and ambient sound analysis. He has been co-organizing DCASE tasks since 2018, including task 4 which includes the evaluation of the submissions energy consumption since 2022. Since 2019 he is general co-chair of the DCASE challenge together with Annamaria Mesaros and he is member of the DCASE steering group. He was DCASE workshop general co-chair in 2022.