Audio retrieval with human written captions.
Challenge has ended. Full results for this task can be found in the Results page.
Description
This subtask is concerned with retrieving audio signals using their sound content textual descriptions (i.e., audio captions). Human written audio captions will be used as text queries. For each text query, the goal of this task is to retrieve 10 audio files from a given dataset and sort them based their match with the query. Through this subtask, we aim to inspire further research into language-based audio retrieval with unconstrained textual descriptions.
This subtask will allow using pre-trained models and external data for training models. This includes pre-trained models for embedding extraction from audio and/or captions, and pre-optimized methods for natural language processing like part-of-speech (POS) tagging. Additionally, the participants can use external audio and/or textual data, e.g., external text corpus for learning a language model or additional audio data like AudioSet, ESC-50.
Audio dataset
The development dataset for this subtask is Clotho v2, the same as in the DCASE 2021 Challenge (task 6), but repurposed for language-based retrieval. Specifically, each caption will be treated as a text query, and non-corresponding audio files as non-relevant retrieved items when measuring the retrieval performance.
The Clotho v2 dataset consists of audio samples of 15 to 30 seconds duration, with each audio sample having five captions of eight to 20 words length. There is a total of 6974 audio samples, with 34870 captions (i.e. 6974 audio samples * 5 captions per each sample). All audio samples are from the Freesound platform, and captions are crowd-sourced using Amazon Mechanical Turk and a three-step framework. For complete details on the data recording and processing see
Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.
Clotho: an Audio Captioning Dataset
Abstract
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).
The data collection of Clotho received funding from the European Research Council, grant agreement 637422 EVERYSOUND.
Task setup
Development dataset
The Clotho v2 dataset is currently divided into a development split of 3839 audio clips with 19195 captions, a validation split of 1045 audio clips with 5225 captions, and an evaluation split of 1045 audio clips with 5225 captions. These splits are created by first constructing the sets of unique words of the captions of each audio clip. These sets of words are combined to form the bag of words of the whole dataset, from which the frequency of a given word can be derived. With the unique words of audio files as classes, multi-label stratification is applied. More information on the splits of Clotho v2 can be found here.
Please note that the name of the splits for Clotho v2 differ from the DCASE terminology. To avoid confusion for participants, the correspondence of splits between Clotho v2 and DCASE challenge is:
Clotho naming of splits | DCASE Challenge naming of splits | |
development | training | development |
validation | validation | |
evaluation | testing |
For the rest of this text, the DCASE challenge terminology will be used. For differentiating between Clotho development and evaluation, the terms development-training, development-validation, and development-testing will be used, wherever necessary. Development-training refers to Clotho development split, development-validation refers to Clotho validation split, and development-testing refers to Clotho evaluation split.
The development data can be found at the online Zenodo repository. Make sure that you download Clotho v2.1, as there were some minor fixes in the dataset (fixing of file naming and some corrupted files).
Development-training data are:
clotho_audio_development.7z
: The development-training audio clips.clotho_captions_development.csv
: The captions of the development-training audio clips.clotho_metadata_development.csv
: The meta-data of the development-training audio clips.
Development-validation data are:
clotho_audio_validation.7z
: The development-validation audio clips.clotho_captions_validation.csv
: The captions of the development-validation audio clips.clotho_metadata_validation.csv
: The meta-data of the development-validation audio clips.
Development-testing data are:
clotho_audio_evaluation.7z
: The development-testing audio clips.clotho_captions_evaluation.csv
: The captions of the development-testing audio clips.clotho_metadata_evaluation.csv
: The meta-data of the development-testing audio clips.
Evaluation dataset
The evaluation dataset will consist of 1000 audio samples, with one caption per each. The data are collected following the same procedure as Clotho v2.
Task rules
Participants are allowed to:
- Use external data (e.g. audio files, text corpus, annotations).
- Use pre-trained models (e.g. text models like Word2Vec, audio tagging models, sound event detection models).
- Augment the development dataset (i.e. development-training and development-testing) with or without the use of external data.
- Use all the available metadata provided, but they must explicitly state it and indicate if they use the available metadata. This will not affect the rating of their method.
Participants are NOT allowed to:
- Make subjective judgments of the evaluation data, nor to annotate it.
- Use additional information of the evaluation data for their method, apart the provided audio files and captions from the evaluation data.
Submission
All participants should submit:
- the output of their audio retrieval with the evaluation data (
*.csv
file), - metadata for their submission (
*.yaml
file), and - a technical report for their submission (
*.pdf
file).
We allow up to 4 system output submissions per participant/team.
For each system, metadata should be provided in a separate file, containing the task specific information.
All files should be packaged into a zip file for submission.
Please make a clear connection between the system name in the submitted metadata (the .yaml
file), submitted system output (the .csv
file), and the technical report (the .pdf
file)!
For indicating the connection of your files, you can consider using the following naming convention:
<author>_<institute>_task6b_submission_<submission_index>_<output or metadata or report>.<csv or yaml or pdf>
For example:
xie_tau_task6b_submission_1_output.csv
xie_tau_task6b_submission_1_metadata.yaml
xie_tau_task6b_submission_1_report.pdf
The field <submission_index>
is to differentiate your submissions in case that you have multiple submissions.
System output file
The system output file should be a *.csv
file, and should have the following 11 columns:
caption
: query caption.file_name_1
: file name of the audio file that is most relevant to the query caption incaption
.file_name_2
: file name of the audio file that is 2nd most relevant to the query caption incaption
.file_name_3
: file name of the audio file that is 3rd most relevant to the query caption incaption
.file_name_4
: file name of the audio file that is 4th most relevant to the query caption incaption
.file_name_5
: file name of the audio file that is 5th most relevant to the query caption incaption
.file_name_6
: file name of the audio file that is 6th most relevant to the query caption incaption
.file_name_7
: file name of the audio file that is 7th most relevant to the query caption incaption
.file_name_8
: file name of the audio file that is 8th most relevant to the query caption incaption
.file_name_9
: file name of the audio file that is 9th most relevant to the query caption incaption
.file_name_10
: file name of the audio file that is 10th most relevant to the query caption incaption
.
Example output:
caption, file_name_1, file_name_2, file_name_3, file_name_4, file_name_5, file_name_6, file_name_7, file_name_8, file_name_9, file_name_10 The person is rummaging through the pans while looking for something, fn1.wav, fn2.wav, fn3.wav, fn4.wav, fn5.wav, fn6.wav, fn7.wav, fn8.wav, fn9.wav, fn10.wav
In the example output, "fn2.wav" is the file name of the audio file that is 2nd most relevant to the query caption "The person is rummaging through the pans while looking for something".
Metadata file
For each system, metadata should be provided in a separate file. The file format should be as indicated below.
# Submission information for task 6 - subtask B
submission:
# Submission label
# Label is used to index submissions.
# Generate your label following way to avoid
# overlapping codes among submissions:
# [Last name of corresponding author]_[Abbreviation of institute of the corresponding author]_task[task number]_[index number of your submission (1-4)]
label: xie_tau_task6b_1
#
# Submission name
# This name will be used in the results tables when space permits
name: DCASE2022 baseline system
#
# Submission name abbreviated
# This abbreviated name will be used in the result table when space is tight.
# Use maximum 10 characters.
abbreviation: Baseline
# Authors of the submitted system. Mark authors in
# the order you want them to appear in submission lists.
# One of the authors has to be marked as corresponding author,
# this will be listed next to the submission in the results tables.
authors:
# First author
- lastname: Xie
firstname: Huang
email: huang.xie@tuni.fi # Contact email address
corresponding: true # Mark true for one of the authors
# Affiliation information for the author
affiliation:
abbreviation: TAU
institute: Tampere University
department: Computing Sciences # Optional
location: Tampere, Finland
# Second author
- lastname: Lipping
firstname: Samuel
email: samuel.lipping@tuni.fi # Contact email address
# Affiliation information for the author
affiliation:
abbreviation: TAU
institute: Tampere University
department: Computing Sciences # Optional
location: Tampere, Finland
# Third author
- lastname: Virtanen
firstname: Tuomas
email: tuomas.virtanen@tuni.fi
# Affiliation information for the author
affiliation:
abbreviation: TAU
institute: Tampere University
department: Computing Sciences
location: Tampere, Finland
# System information
system:
# System description, meta-data provided here will be used to do
# meta analysis of the submitted system.
# Use general level tags, when possible use the tags provided in comments.
# If information field is not applicable to the system, use "!!null".
description:
# Audio input / sampling rate
# e.g. 16kHz, 22.05kHz, 44.1kHz, 48.0kHz
input_sampling_rate: 44.1kHz
# Acoustic representation
# Here you should indicate what can or audio representation
# you used. If your system used hand-crafted features (e.g.
# mel band energies), then you can do:
#
# `acoustic_features: mel energies`
#
# Else, if you used some pre-trained audio feature extractor,
# you can indicate the name of the system, for example:
#
# `acoustic_features: audioset`
acoustic_features: log-mel energies
# Word embeddings
# Here you can indicate how you treated word embeddings.
# If your method learned its own word embeddings (i.e. you
# did not use any pre-trained word embeddings) then you can
# do:
#
# `word_embeddings: learned`
#
# Else, specify the pre-trained word embeddings that you used
# (e.g. Word2Vec, BERT, etc).
word_embeddings: Word2Vec
# Data augmentation methods
# e.g. mixup, time stretching, block mixing, pitch shifting, ...
data_augmentation: !!null
# Method scheme
# Here you should indicate the scheme of the method that you
# used. For example:
machine_learning_method: cross-modal alignment
# Learning scheme
# Here you should indicate the learning scheme.
# For example, you could specify either
# supervised, self-supervised, or even
# reinforcement learning.
learning_scheme: self-supervised
# Ensemble
# Here you should indicate if you used ensemble
# of systems or not.
ensemble: No
# Audio modelling
# Here you should indicate the type of system used for
# audio modelling. For example, if you used some stacked CNNs, then
# you could do:
#
# audio_modelling: cnn
#
# If you used some pre-trained system for audio modelling,
# then you should indicate the system used (e.g. COALA, COLA,
# transformer).
audio_modelling: crnn
# Word modelling
# Similarly, here you should indicate the type of system used
# for word modelling. For example, if you used some RNNs,
# then you could do:
#
# word_modelling: rnn
#
# If you used some pre-trained system for word modelling,
# then you should indicate the system used (e.g. transformer).
word_modelling: word2vec
# Loss function
# Here you should indicate the loss function that you employed.
loss_function: triplet loss
# Optimizer
# Here you should indicate the name of the optimizer that you
# used.
optimizer: adam
# Learning rate
# Here you should indicate the learning rate of the optimizer
# that you used.
learning_rate: 1e-3
# Gradient clipping
# Here you should indicate if you used any gradient clipping.
# You do this by indicating the value used for clipping. Use
# 0 for no clipping.
gradient_clipping: 0
# Gradient norm
# Here you should indicate the norm of the gradient that you
# used for gradient clipping. This field is used only when
# gradient clipping has been employed.
gradient_norm: !!null
# Metric monitored
# Here you should report the monitored metric
# for optimizing your method. For example, did you
# monitored the loss on the validation data (i.e. validation
# loss)? Or you monitored the SPIDEr metric? Maybe the training
# loss?
metric_monitored: validation_loss
# System complexity, meta-data provided here will be used to evaluate
# submitted systems from the computational load perspective.
complexity:
# Total amount of parameters used in the acoustic model.
# For neural networks, this information is usually given before training process
# in the network summary.
# For other than neural networks, if parameter count information is not directly
# available, try estimating the count as accurately as possible.
# In case of ensemble approaches, add up parameters for all subsystems.
# In case embeddings are used, add up parameter count of the embedding
# extraction networks and classification network
# Use numerical value (do not use comma for thousands-separator).
total_parameters: 732354
# List of datasets used for training the system.
# Development-training data is used here only as example.
training_datasets:
- name: Clotho-development
# Dataset access url
url: https://doi.org/10.5281/zenodo.4783391
# Has audio:
has_audio: Yes
# Has images
has_images: No
# Has video
has_video: No
# Has captions
has_captions: Yes
# Number of captions per audio
nb_captions_per_audio: 5
# Number of audio clips per caption
nb_clips_per_caption: 1
# Total amount durations (in seconds) of audio used
total_audio_length: 86353
# Total amount of captions used
total_captions: 3839
# List of datasets used for validating the system, for example, optimizing hyperparameter.
# Development-validation data is used here only as example.
validation_datasets:
- name: Clotho-validation
# Dataset access url
url: https://doi.org/10.5281/zenodo.4783391
# Has audio:
has_audio: Yes
# Has images
has_images: No
# Has video
has_video: No
# Has captions
has_captions: Yes
# Number of captions per audio
nb_captions_per_audio: 5
# Number of audio clips per caption
nb_clips_per_caption: 1
# Total amount durations (in seconds) of audio used
total_audio_length: 23636
# Total amount of captions used
total_captions: 1045
# List of external datasets used in the submission.
# Development dataset is used here only as example, list only external datasets
external_datasets:
# Dataset name
- name: Clotho
# Dataset access url
url: https://doi.org/10.5281/zenodo.4783391
# Has audio:
has_audio: Yes
# Has images
has_images: No
# Has video
has_video: No
# Has captions
has_captions: Yes
# Number of captions per audio
nb_captions_per_audio: 5
# Number of audio clips per caption
nb_clips_per_caption: 1
# Total amount durations (in seconds) of audio used
total_audio_length: 133442
# Total amount of captions used
total_captions: 29645
# Used for (e.g. audio_modelling, word_modelling, audio_and_word_modelling)
used_for: audio_and_word_modelling
# URL to the source code of the system [optional]
source_code: https://github.com/xieh97/dcase2022-audio-retrieval
# System results
results:
development_testing:
# System results for development testing split.
# Full results are not mandatory, however, they are highly recommended
# as they are needed for through analysis of the challenge submissions.
# If you are unable to provide all results, also incomplete
# results can be reported.
R@1: 0.03
R@5: 0.11
R@10: 0.19
mAP@10: 0.07
Open and reproducible research
Finally, for supporting open and reproducible research, we kindly ask from each participant/team to consider making available the code of their method (e.g. in GitHub) and pre-trained models, after the challenge is over.
Evaluation
The submitted systems will be evaluated according to their performance, i.e., recall at K (R@K) and mean average precision at K (mAP@K), on the withheld the evaluation dataset. An explanation of these metrics can be found here. Specifically, the following metrics will be reported for every submitted method:
R@1
: Recall score among the top-1 retrieved result, averaged across all caption queries.R@5
: Recall score among the top-5 retrieved results, averaged across all caption queries.R@10
: Recall score among the top-10 retrieved results, averaged across all caption queries.mAP@10
: Average precision among the top-10 retrieved results, averaged across all caption queries.
The R@K is a rank-unaware evaluation metric, which measures the proportion of relevant audio files among the top-K retrieved results to all relevant audio files in the evaluation dataset for the caption query. The mAP@K is a rank-aware metric, which gives the mean of the averaged precision of relevant audio files among the top-K retrieved results for all caption queries. Submitted methods will be ranked by the mAP@10 metric.
Results
Complete results and technical reports can be found in the results page
Baseline system
To provide a starting point and some initial results for the challenge, there is a baseline system for the language-based audio retrieval subtask.
The baseline system is a simpler version of the audio-text aligning framework presented in this paper, which calculates relevant scores between encoded textual descriptions and encoded audio signals. A convolutional recurrent neural network (CRNN) is used as the audio encoder, which extracts frame-wise acoustic embeddings from audio signals. Then, an audio signal is represented by the average of its frame-wise acoustic embeddings. A pre-trained Word2Vec (published here by Google) is used as the text encoder, which converts textual descriptions into sequences of word embeddings. A textual description is represented by the average of its word embeddings. The relevant score between an audio signal and a textual description is calculated by the dot product of their vector representations. The baseline system is optimized with a triplet ranking loss criterion.
For more details on the audio-text aligning framework, please see
Huang Xie, Okko Räsänen, Konstantinos Drossos, and Tuomas Virtanen. Unsupervised audio-caption aligning learns correspondences between individual sound events and textual phrases. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 8867–8871. 2022.
Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases
Abstract
We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip. We align originally unaligned and unannotated audio clips and their captions by scoring the similarities between audio frames and words, as encoded by modality-specific encoders and using a ranking-loss criterion to optimize the model. After training, we obtain clip-caption similarity by averaging frame-word similarities and estimate event-phrase correspondences by calculating frame-phrase similarities. We evaluate the method with two cross-modal tasks: audio-caption retrieval, and phrase-based sound event detection (SED). Experimental results show that the proposed method can globally associate audio clips with captions as well as locally learn correspondences between individual sound events and textual phrases in an unsupervised manner.
For details on the CRNN audio encoder, please see
Xuenan Xu, Heinrich Dinkel, Mengyue Wu, and Kai Yu. A CRNN-GRU based reinforcement learning approach to audio captioning. In Proc. Detect. Classif. Acoust. Scenes Events Work. (DCASE), 225–229. 2020.
A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning
Abstract
Audio captioning aims at generating a natural sentence to describe the content in an audio clip. This paper proposes the use of a powerful CRNN encoder combined with a GRU decoder to tackle this multi-modal task. In addition to standard cross-entropy, reinforcement learning is also investigated for generating richer and more accurate captions. Our approach significantly improves against the baseline model on all shown metrics achieving a relative improvement of at least 34%. Results indicate that our proposed CRNN-GRU model with reinforcement learning achieves a SPIDEr of 0.190 on the Clotho evaluation set. With data augmentation, the performance is further boosted to 0.223. In the DCASE challenge Task 6 we ranked fourth based on SPIDEr, second on 5 metrics including BLEU, ROUGE-L and METEOR, without ensemble or data augmentation while maintaining a small model size (only 5 Million parameters).
For the triplet ranking loss used to train the baseline system, please see
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a "siamese" time delay neural network. In Proc. 6th Int. Conf. Neural Inf. Process. Syst. (NIPS), 737–744. 1993.
Signature Verification Using a "Siamese" Time Delay Neural Network
Abstract
This paper describes an algorithm for verification of signatures written on a pen-input tablet. The algorithm is based on a novel, artificial neural network, called a 'Siamese' neural network. This network consists of two identical sub-networks joined at their outputs. During training the two sub-networks extract features from two signatures, while the joining neuron measures the distance between the two feature vectors. Verification consists of comparing an extracted feature vector with a stored feature vector for the signer. Signatures closer to this stored representation than a chosen threshold are accepted, all other signatures are rejected as forgeries.
For the pre-trained Word2Vec:
Repository
The PyTorch implementation of the baseline system is freely available online, and can be found at GitHub.
Results for the development dataset
The results of the baseline system on the development-evaluation split are shown below. Additionally, jackknife resampling is used to estimate the 95% confidence intervals. For more information on jackknife resampling, please check this paper.
Metric | Value with 95% confidence interval |
R1 | 0.03 (0.03 - 0.04) |
R5 | 0.11 (0.10 - 0.12) |
R10 | 0.19 (0.18 - 0.20) |
mAP10 | 0.07 (0.06 - 0.07) |
Citations
If you participate in this task, you might want to check the following papers. If you find a paper that need to be cited here, please contact us and report it to us.
- The task summary:
Huang Xie, Samuel Lipping, and Tuomas Virtanen. Language-based audio retrieval task in dcase 2022 challenge. Accepted at DCASE 2022 Workshop, 2022.
Language-based Audio Retrieval Task in DCASE 2022 Challenge
Abstract
Language-based audio retrieval is a task, where natural language textual captions are used as queries to retrieve audio signals from a dataset. It has been first introduced into DCASE 2022 Challenge as Subtask 6B of task 6, which aims at developing computational systems to model relationships between audio signals and free-form textual descriptions. Compared with audio captioning (Subtask 6A), which is about generating audio captions for audio signals, language-based audio retrieval (Subtask 6B) focuses on ranking audio signals according to their relevance to natural language textual captions. In DCASE 2022 Challenge, the provided baseline system for Subtask 6B was significantly outperformed, with top performance being 0.276 in mAP@10. This paper presents the outcome of Subtask 6B in terms of submitted systems' performance and analysis.
- The Clotho dataset:
Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 736–740. 2020.
Clotho: an Audio Captioning Dataset
Abstract
Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (https://zenodo.org/record/3490684).
- The original audio-text aligning framework:
Huang Xie, Okko Räsänen, Konstantinos Drossos, and Tuomas Virtanen. Unsupervised audio-caption aligning learns correspondences between individual sound events and textual phrases. In Proc. IEEE Int. Conf. Acoustic., Speech and Signal Process. (ICASSP), 8867–8871. 2022.
Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases
Abstract
We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip. We align originally unaligned and unannotated audio clips and their captions by scoring the similarities between audio frames and words, as encoded by modality-specific encoders and using a ranking-loss criterion to optimize the model. After training, we obtain clip-caption similarity by averaging frame-word similarities and estimate event-phrase correspondences by calculating frame-phrase similarities. We evaluate the method with two cross-modal tasks: audio-caption retrieval, and phrase-based sound event detection (SED). Experimental results show that the proposed method can globally associate audio clips with captions as well as locally learn correspondences between individual sound events and textual phrases in an unsupervised manner.
- The CRNN audio encoder, used for the baseline system:
Xuenan Xu, Heinrich Dinkel, Mengyue Wu, and Kai Yu. A CRNN-GRU based reinforcement learning approach to audio captioning. In Proc. Detect. Classif. Acoust. Scenes Events Work. (DCASE), 225–229. 2020.
A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning
Abstract
Audio captioning aims at generating a natural sentence to describe the content in an audio clip. This paper proposes the use of a powerful CRNN encoder combined with a GRU decoder to tackle this multi-modal task. In addition to standard cross-entropy, reinforcement learning is also investigated for generating richer and more accurate captions. Our approach significantly improves against the baseline model on all shown metrics achieving a relative improvement of at least 34%. Results indicate that our proposed CRNN-GRU model with reinforcement learning achieves a SPIDEr of 0.190 on the Clotho evaluation set. With data augmentation, the performance is further boosted to 0.223. In the DCASE challenge Task 6 we ranked fourth based on SPIDEr, second on 5 metrics including BLEU, ROUGE-L and METEOR, without ensemble or data augmentation while maintaining a small model size (only 5 Million parameters).
- The triplet ranking loss, used for the baseline system:
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a "siamese" time delay neural network. In Proc. 6th Int. Conf. Neural Inf. Process. Syst. (NIPS), 737–744. 1993.
Signature Verification Using a "Siamese" Time Delay Neural Network
Abstract
This paper describes an algorithm for verification of signatures written on a pen-input tablet. The algorithm is based on a novel, artificial neural network, called a 'Siamese' neural network. This network consists of two identical sub-networks joined at their outputs. During training the two sub-networks extract features from two signatures, while the joining neuron measures the distance between the two feature vectors. Verification consists of comparing an extracted feature vector with a stored feature vector for the signer. Signatures closer to this stored representation than a chosen threshold are accepted, all other signatures are rejected as forgeries.
- The jackknife resampling, used for estimating the 95% confidence intervals: