Language-Based Audio Retrieval


Task description

Audio retrieval with human written captions.

Challenge has ended. Full results for this task can be found in the Results page.

If you are interested in the task, you can join us on the dedicated slack channel

Description

This subtask is concerned with retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions will be used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.

This subtask will allow using pre-trained models and external data for training models. This includes pre-trained models for embedding extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, the participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, ESC-50.

Audio dataset

The development dataset for this subtask is Clotho v2, the same as in the DCASE 2021 Challenge (task 6), but repurposed for language-based retrieval. Specifically, each caption will be treated as a text query, and non-corresponding audio files as non-relevant retrieved items when measuring the retrieval performance.

The Clotho v2 dataset consists of audio samples of 15 to 30 seconds duration, with each audio sample having five captions of eight to 20 words length. There is a total of 6974 audio samples, with 34870 captions (i.e. 6974 audio samples * 5 captions per each sample). All audio samples are from the Freesound platform, and captions are crowd-sourced using Amazon Mechanical Turk and a three-step framework. For complete details on the data recording and processing see

Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

PDF

Clotho: an Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

PDF

The data collection of Clotho received funding from the European Research Council, grant agreement 637422 EVERYSOUND.

ERC

Task setup

Development dataset

The Clotho v2 dataset is currently divided into a development split of 3839 audio clips with 19195 captions, a validation split of 1045 audio clips with 5225 captions, and an evaluation split of 1045 audio clips with 5225 captions. These splits are created by first constructing the sets of unique words of the captions of each audio clip. These sets of words are combined to form the bag of words of the whole dataset, from which the frequency of a given word can be derived. With the unique words of audio files as classes, multi-label stratification is applied. More information on the splits of Clotho v2 can be found here.

Please note that the name of the splits for Clotho v2 differ from the DCASE terminology. To avoid confusion for participants, the correspondence of splits between Clotho v2 and DCASE challenge is:

Clotho naming of splits DCASE Challenge naming of splits
development training development
validation validation
evaluation testing

For the rest of this text, the DCASE challenge terminology will be used. For differentiating between Clotho development and evaluation, the terms development-training, development-validation, and development-testing will be used, wherever necessary. Development-training refers to Clotho development split, development-validation refers to Clotho validation split, and development-testing refers to Clotho evaluation split.

The development data can be found at the online Zenodo repository. Make sure that you download Clotho v2.1, as there were some minor fixes in the dataset (fixing of file naming and some corrupted files).


Development-training data are:

  • clotho_audio_development.7z: The development-training audio clips.
  • clotho_captions_development.csv: The captions of the development-training audio clips.
  • clotho_metadata_development.csv: The meta-data of the development-training audio clips.

Development-validation data are:

  • clotho_audio_validation.7z: The development-validation audio clips.
  • clotho_captions_validation.csv: The captions of the development-validation audio clips.
  • clotho_metadata_validation.csv: The meta-data of the development-validation audio clips.

Development-testing data are:

  • clotho_audio_evaluation.7z: The development-testing audio clips.
  • clotho_captions_evaluation.csv: The captions of the development-testing audio clips.
  • clotho_metadata_evaluation.csv: The meta-data of the development-testing audio clips.

Evaluation dataset

The evaluation dataset will consist of 1000 audio samples, with one caption per each. The data are collected following the same procedure as Clotho v2.


Task rules

Participants are allowed to:

  • Use external data (e.g. audio files, text corpus, annotations).
  • Use pre-trained models (e.g. text models like Word2Vec, audio tagging models, sound event detection models).
  • Augment the development dataset (i.e. development-training and development-testing) with or without the use of external data.
  • Use all the available metadata provided, but they must explicitly state it and indicate if they use the available metadata. This will not affect the rating of their method.

Participants are NOT allowed to:

  • Make subjective judgments of the evaluation data, nor to annotate it.
  • Use additional information of the evaluation data for their method, apart the provided audio files and captions from the evaluation data.

Submission

All participants should submit:

  • the output of their audio retrieval with the evaluation data (*.csv file),
  • metadata for their submission (*.yaml file), and
  • a technical report for their submission (*.pdf file).

We allow up to 4 system output submissions per participant/team. For each system, metadata should be provided in a separate file, containing the task specific information. All files should be packaged into a zip file for submission. Please make a clear connection between the system name in the submitted metadata (the .yaml file), submitted system output (the .csv file), and the technical report (the .pdf file)! For indicating the connection of your files, you can consider using the following naming convention:

<author>_<institute>_task6b_submission_<submission_index>_<output or metadata or report>.<csv or yaml or pdf>

For example:

xie_tau_task6b_submission_1_output.csv
xie_tau_task6b_submission_1_metadata.yaml
xie_tau_task6b_submission_1_report.pdf

The field <submission_index> is to differentiate your submissions in case that you have multiple submissions.

System output file

The system output file should be a *.csv file, and should have the following 11 columns:

  1. caption: query caption.
  2. file_name_1: file name of the audio file that is most relevant to the query caption in caption.
  3. file_name_2: file name of the audio file that is 2nd most relevant to the query caption in caption.
  4. file_name_3: file name of the audio file that is 3rd most relevant to the query caption in caption.
  5. file_name_4: file name of the audio file that is 4th most relevant to the query caption in caption.
  6. file_name_5: file name of the audio file that is 5th most relevant to the query caption in caption.
  7. file_name_6: file name of the audio file that is 6th most relevant to the query caption in caption.
  8. file_name_7: file name of the audio file that is 7th most relevant to the query caption in caption.
  9. file_name_8: file name of the audio file that is 8th most relevant to the query caption in caption.
  10. file_name_9: file name of the audio file that is 9th most relevant to the query caption in caption.
  11. file_name_10: file name of the audio file that is 10th most relevant to the query caption in caption.

Example output:

caption,				file_name_1,	file_name_2,	file_name_3,	file_name_4,	file_name_5,	file_name_6,	file_name_7,	file_name_8,	file_name_9,	file_name_10
The person is rummaging through the pans while looking for something,	fn1.wav,	fn2.wav,	fn3.wav,	fn4.wav,	fn5.wav,	fn6.wav,	fn7.wav,	fn8.wav,	fn9.wav,	fn10.wav

In the example output, "fn2.wav" is the file name of the audio file that is 2nd most relevant to the query caption "The person is rummaging through the pans while looking for something".

Metadata file

For each system, metadata should be provided in a separate file. The file format should be as indicated below.

# Submission information for task 6 - subtask B
submission:
  # Submission label
  # Label is used to index submissions.
  # Generate your label following way to avoid
  # overlapping codes among submissions:
  # [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
  label: xie_tau_task6b_1
  #
  # Submission name
  # This name will be used in the results tables when space permits
  name: DCASE2022 baseline system
  #
  # Submission name abbreviated
  # This abbreviated name will be used in the result table when space is tight.
  # Use maximum 10 characters.
abbreviation: Baseline

  # Authors of the submitted system. Mark authors in
  # the order you want them to appear in submission lists.
  # One of the authors has to be marked as corresponding author,
  # this will be listed next to the submission in the results tables.
  authors:
    # First author
    - lastname: Xie
      firstname: Huang
      email: huang.xie@tuni.fi                    # Contact email address
      corresponding: true                         # Mark true for one of the authors

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences            # Optional
        location: Tampere, Finland

    # Second author
    - lastname: Lipping
      firstname: Samuel
      email: samuel.lipping@tuni.fi                # Contact email address

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences            # Optional
        location: Tampere, Finland

    # Third author
    - lastname: Virtanen
      firstname: Tuomas
      email: tuomas.virtanen@tuni.fi

      # Affiliation information for the author
      affiliation:
        abbreviation: TAU
        institute: Tampere University
        department: Computing Sciences
        location: Tampere, Finland

# System information
system:
  # System description, meta-data provided here will be used to do
  # meta analysis of the submitted system.
  # Use general level tags, when possible use the tags provided in comments.
  # If information field is not applicable to the system, use "!!null".
  description:

    # Audio input / sampling rate
    # e.g. 16kHz, 22.05kHz, 44.1kHz, 48.0kHz
    input_sampling_rate: 44.1kHz

    # Acoustic representation
    # Here you should indicate what can or audio representation
    # you used. If your system used hand-crafted features (e.g.
    # mel band energies), then you can do:
    #
    # `acoustic_features: mel energies`
    #
    # Else, if you used some pre-trained audio feature extractor, 
    # you can indicate the name of the system, for example:
    #
    # `acoustic_features: audioset`
    acoustic_features: log-mel energies

    # Word embeddings
    # Here you can indicate how you treated word embeddings.
    # If your method learned its own word embeddings (i.e. you
    # did not use any pre-trained word embeddings) then you can
    # do:
    #
    # `word_embeddings: learned`
    #  
    # Else, specify the pre-trained word embeddings that you used
    # (e.g. Word2Vec, BERT, etc).
    word_embeddings: Word2Vec

    # Data augmentation methods
    # e.g. mixup, time stretching, block mixing, pitch shifting, ...
    data_augmentation: !!null

    # Method scheme
    # Here you should indicate the scheme of the method that you
    # used. For example:
    machine_learning_method: cross-modal alignment

    # Learning scheme
    # Here you should indicate the learning scheme. 
    # For example, you could specify either
    # supervised, self-supervised, or even 
    # reinforcement learning. 
    learning_scheme: self-supervised

    # Ensemble
    # Here you should indicate if you used ensemble
    # of systems or not.
    ensemble: No

    # Audio modelling
    # Here you should indicate the type of system used for
    # audio modelling. For example, if you used some stacked CNNs, then
    # you could do:
    #
    # audio_modelling: cnn
    #
    # If you used some pre-trained system for audio modelling,
    # then you should indicate the system used (e.g. COALA, COLA,
    # transformer).
    audio_modelling: crnn

    # Word modelling
    # Similarly, here you should indicate the type of system used
    # for word modelling. For example, if you used some RNNs,
    # then you could do: 
    #
    # word_modelling: rnn
    #
    # If you used some pre-trained system for word modelling,
    # then you should indicate the system used (e.g. transformer).
    word_modelling: word2vec

    # Loss function
    # Here you should indicate the loss function that you employed.
    loss_function: triplet loss

    # Optimizer
    # Here you should indicate the name of the optimizer that you
    # used. 
    optimizer: adam

    # Learning rate
    # Here you should indicate the learning rate of the optimizer
    # that you used.
    learning_rate: 1e-3

    # Gradient clipping
    # Here you should indicate if you used any gradient clipping. 
    # You do this by indicating the value used for clipping. Use
    # 0 for no clipping.
    gradient_clipping: 0

    # Gradient norm
    # Here you should indicate the norm of the gradient that you
    # used for gradient clipping. This field is used only when 
    # gradient clipping has been employed.
    gradient_norm: !!null

    # Metric monitored
    # Here you should report the monitored metric
    # for optimizing your method. For example, did you
    # monitored the loss on the validation data (i.e. validation
    # loss)? Or you monitored the SPIDEr metric? Maybe the training
    # loss?
    metric_monitored: validation_loss

  # System complexity, meta-data provided here will be used to evaluate
  # submitted systems from the computational load perspective.
  complexity:
    # Total amount of parameters used in the acoustic model.
    # For neural networks, this information is usually given before training process
    # in the network summary.
    # For other than neural networks, if parameter count information is not directly
    # available, try estimating the count as accurately as possible.
    # In case of ensemble approaches, add up parameters for all subsystems.
    # In case embeddings are used, add up parameter count of the embedding
    # extraction networks and classification network
    # Use numerical value (do not use comma for thousands-separator).
    total_parameters: 732354

  # List of datasets used for training the system.
  # Development-training data is used here only as example.
  training_datasets:
    - name: Clotho-development

      # Dataset access url
      url: https://doi.org/10.5281/zenodo.4783391

      # Has audio:
      has_audio: Yes

      # Has images
      has_images: No

      # Has video
      has_video: No

      # Has captions
      has_captions: Yes

      # Number of captions per audio
      nb_captions_per_audio: 5

      # Number of audio clips per caption
      nb_clips_per_caption: 1

      # Total amount durations (in seconds) of audio used
      total_audio_length: 86353

      # Total amount of captions used
      total_captions: 3839

  # List of datasets used for validating the system, for example, optimizing hyperparameter.
  # Development-validation data is used here only as example.
  validation_datasets:
    - name: Clotho-validation

      # Dataset access url
      url: https://doi.org/10.5281/zenodo.4783391

      # Has audio:
      has_audio: Yes

      # Has images
      has_images: No

      # Has video
      has_video: No

      # Has captions
      has_captions: Yes

      # Number of captions per audio
      nb_captions_per_audio: 5

      # Number of audio clips per caption
      nb_clips_per_caption: 1

      # Total amount durations (in seconds) of audio used
      total_audio_length: 23636

      # Total amount of captions used
      total_captions: 1045

  # List of external datasets used in the submission.
  # Development dataset is used here only as example, list only external datasets
  external_datasets:
    # Dataset name
    - name: Clotho

      # Dataset access url
      url: https://doi.org/10.5281/zenodo.4783391

      # Has audio:
      has_audio: Yes

      # Has images
      has_images: No

      # Has video
      has_video: No

      # Has captions
      has_captions: Yes

      # Number of captions per audio
      nb_captions_per_audio: 5

      # Number of audio clips per caption
      nb_clips_per_caption: 1

      # Total amount durations (in seconds) of audio used
      total_audio_length: 133442

      # Total amount of captions used
      total_captions: 29645

      # Used for (e.g. audio_modelling, word_modelling, audio_and_word_modelling)
      used_for: audio_and_word_modelling

      # URL to the source code of the system [optional]
      source_code: https://github.com/xieh97/dcase2022-audio-retrieval

# System results
results:
  development_testing:
    # System results for development testing split.
    # Full results are not mandatory, however, they are highly recommended
    # as they are needed for through analysis of the challenge submissions.
    # If you are unable to provide all results, also incomplete
    # results can be reported.
    R@1: 0.03
    R@5: 0.11
    R@10: 0.19
    mAP@10: 0.07

Open and reproducible research

Finally, for supporting open and reproducible research, we kindly ask from each participant/team to consider making available the code of their method (e.g. in GitHub) and pre-trained models, after the challenge is over.

Evaluation

The submitted systems will be evaluated according to their performance, i.e., recall at K (R@K) and mean average precision at K (mAP@K), on the withheld the evaluation dataset. An explanation of these metrics can be found here. Specifically, the following metrics will be reported for every submitted method:

  1. R@1: Recall score among the top-1 retrieved result, averaged across all caption queries.
  2. R@5: Recall score among the top-5 retrieved results, averaged across all caption queries.
  3. R@10: Recall score among the top-10 retrieved results, averaged across all caption queries.
  4. mAP@10: Average precision among the top-10 retrieved results, averaged across all caption queries.

The R@K is a rank-unaware evaluation metric, which measures the proportion of relevant audio files among the top-K retrieved results to all relevant audio files in the evaluation dataset for the caption query. The mAP@K is a rank-aware metric, which gives the mean of the averaged precision of relevant audio files among the top-K retrieved results for all caption queries. Submitted methods will be ranked by the mAP@10 metric.

Results

Complete results and technical reports can be found in the results page

Baseline system

To provide a starting point and some initial results for the challenge, there is a baseline system for the language-based audio retrieval subtask.

The baseline system is a simpler version of the audio-text aligning framework presented in this paper, which calculates relevant scores between encoded textual descriptions and encoded audio signals. A convolutional recurrent neural network (CRNN) is used as the audio encoder, which extracts frame-wise acoustic embeddings from audio signals. Then, an audio signal is represented by the average of its frame-wise acoustic embeddings. A pre-trained Word2Vec (published here by Google) is used as the text encoder, which converts textual descriptions into sequences of word embeddings. A textual description is represented by the average of its word embeddings. The relevant score between an audio signal and a textual description is calculated by the dot product of their vector representations. The baseline system is optimized with a triplet ranking loss criterion.

For more details on the audio-text aligning framework, please see

Publication

Huang Xie, Okko Räsänen, Konstantinos Drossos, and Tuomas Virtanen. Unsupervised audio-caption aligning learns correspondences between individual sound events and textual phrases. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 8867–8871. 2022.

PDF

Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases

Abstract

We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip. We align originally unaligned and unannotated audio clips and their captions by scoring the similarities between audio frames and words, as encoded by modality-specific encoders and using a ranking-loss criterion to optimize the model. After training, we obtain clip-caption similarity by averaging frame-word similarities and estimate event-phrase correspondences by calculating frame-phrase similarities. We evaluate the method with two cross-modal tasks: audio-caption retrieval, and phrase-based sound event detection (SED). Experimental results show that the proposed method can globally associate audio clips with captions as well as locally learn correspondences between individual sound events and textual phrases in an unsupervised manner.

PDF

For details on the CRNN audio encoder, please see

Publication

Xuenan Xu, Heinrich Dinkel, Mengyue Wu, and Kai Yu. A CRNN-GRU based reinforcement learning approach to audio captioning. In Proc. Detect. Classif. Acoust. Scenes Events Work. (DCASE), 225–229. 2020.

PDF

A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning

Abstract

Audio captioning aims at generating a natural sentence to describe the content in an audio clip. This paper proposes the use of a powerful CRNN encoder combined with a GRU decoder to tackle this multi-modal task. In addition to standard cross-entropy, reinforcement learning is also investigated for generating richer and more accurate captions. Our approach significantly improves against the baseline model on all shown metrics achieving a relative improvement of at least 34%. Results indicate that our proposed CRNN-GRU model with reinforcement learning achieves a SPIDEr of 0.190 on the Clotho evaluation set. With data augmentation, the performance is further boosted to 0.223. In the DCASE challenge Task 6 we ranked fourth based on SPIDEr, second on 5 metrics including BLEU, ROUGE-L and METEOR, without ensemble or data augmentation while maintaining a small model size (only 5 Million parameters).

PDF

For the triplet ranking loss used to train the baseline system, please see

Publication

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a "siamese" time delay neural network. In Proc. 6th Int. Conf. Neural Inf. Process. Syst. (NIPS), 737–744. 1993.

PDF

Signature Verification Using a "Siamese" Time Delay Neural Network

Abstract

This paper describes an algorithm for verification of signatures written on a pen-input tablet. The algorithm is based on a novel, artificial neural network, called a 'Siamese' neural network. This network consists of two identical sub-networks joined at their outputs. During training the two sub-networks extract features from two signatures, while the joining neuron measures the distance between the two feature vectors. Verification consists of comparing an extracted feature vector with a stored feature vector for the signer. Signatures closer to this stored representation than a chosen threshold are accepted, all other signatures are rejected as forgeries.

PDF

For the pre-trained Word2Vec:

Repository

The PyTorch implementation of the baseline system is freely available online, and can be found at GitHub.


Results for the development dataset

The results of the baseline system on the development-evaluation split are shown below. Additionally, jackknife resampling is used to estimate the 95% confidence intervals. For more information on jackknife resampling, please check this paper.

Metric Value with 95% confidence interval
R1 0.03 (0.03 - 0.04)
R5 0.11 (0.10 - 0.12)
R10 0.19 (0.18 - 0.20)
mAP10 0.07 (0.06 - 0.07)

Citations

If you participate in this task, you might want to check the following papers. If you find a paper that need to be cited here, please contact us and report it to us.

  • The task summary:
Publication

Huang Xie, Samuel Lipping, and Tuomas Virtanen. Language-based audio retrieval task in dcase 2022 challenge. Accepted at DCASE 2022 Workshop, 2022.

PDF

Language-based Audio Retrieval Task in DCASE 2022 Challenge

Abstract

Language-based audio retrieval is a task, where natural language textual captions are used as queries to retrieve audio signals from a dataset. It has been first introduced into DCASE 2022 Challenge as Subtask 6B of task 6, which aims at developing computational systems to model relationships between audio signals and free-form textual descriptions. Compared with audio captioning (Subtask 6A), which is about generating audio captions for audio signals, language-based audio retrieval (Subtask 6B) focuses on ranking audio signals according to their relevance to natural language textual captions. In DCASE 2022 Challenge, the provided baseline system for Subtask 6B was significantly outperformed, with top performance being 0.276 in mAP@10. This paper presents the outcome of Subtask 6B in terms of submitted systems' performance and analysis.

PDF
  • The Clotho dataset:
Publication

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.

PDF

Clotho: an Audio Captioning Dataset

Abstract

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).

PDF
  • The original audio-text aligning framework:
Publication

Huang Xie, Okko Räsänen, Konstantinos Drossos, and Tuomas Virtanen. Unsupervised audio-caption aligning learns correspondences between individual sound events and textual phrases. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 8867–8871. 2022.

PDF

Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases

Abstract

We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip. We align originally unaligned and unannotated audio clips and their captions by scoring the similarities between audio frames and words, as encoded by modality-specific encoders and using a ranking-loss criterion to optimize the model. After training, we obtain clip-caption similarity by averaging frame-word similarities and estimate event-phrase correspondences by calculating frame-phrase similarities. We evaluate the method with two cross-modal tasks: audio-caption retrieval, and phrase-based sound event detection (SED). Experimental results show that the proposed method can globally associate audio clips with captions as well as locally learn correspondences between individual sound events and textual phrases in an unsupervised manner.

PDF
  • The CRNN audio encoder, used for the baseline system:
Publication

Xuenan Xu, Heinrich Dinkel, Mengyue Wu, and Kai Yu. A CRNN-GRU based reinforcement learning approach to audio captioning. In Proc. Detect. Classif. Acoust. Scenes Events Work. (DCASE), 225–229. 2020.

PDF

A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning

Abstract

Audio captioning aims at generating a natural sentence to describe the content in an audio clip. This paper proposes the use of a powerful CRNN encoder combined with a GRU decoder to tackle this multi-modal task. In addition to standard cross-entropy, reinforcement learning is also investigated for generating richer and more accurate captions. Our approach significantly improves against the baseline model on all shown metrics achieving a relative improvement of at least 34%. Results indicate that our proposed CRNN-GRU model with reinforcement learning achieves a SPIDEr of 0.190 on the Clotho evaluation set. With data augmentation, the performance is further boosted to 0.223. In the DCASE challenge Task 6 we ranked fourth based on SPIDEr, second on 5 metrics including BLEU, ROUGE-L and METEOR, without ensemble or data augmentation while maintaining a small model size (only 5 Million parameters).

PDF
  • The triplet ranking loss, used for the baseline system:
Publication

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a "siamese" time delay neural network. In Proc. 6th Int. Conf. Neural Inf. Process. Syst. (NIPS), 737–744. 1993.

PDF

Signature Verification Using a "Siamese" Time Delay Neural Network

Abstract

This paper describes an algorithm for verification of signatures written on a pen-input tablet. The algorithm is based on a novel, artificial neural network, called a 'Siamese' neural network. This network consists of two identical sub-networks joined at their outputs. During training the two sub-networks extract features from two signatures, while the joining neuron measures the distance between the two feature vectors. Verification consists of comparing an extracted feature vector with a stored feature vector for the signer. Signatures closer to this stored representation than a chosen threshold are accepted, all other signatures are rejected as forgeries.

PDF
  • The jackknife resampling, used for estimating the 95% confidence intervals:
Publication

Annamaria Mesaros, Aleksandr Diment, Benjamin Elizalde, Toni Heittola, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen. Sound event detection in the dcase 2017 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 992–1006, 2019.

PDF

Sound Event Detection in the DCASE 2017 Challenge

PDF