Audio Captioning-1 任务介绍 羊羊羊 6 人赞同了该文章 Automatic Audio Captioning (AAC)是一种将音频用自然语言进行表述的任务,听起来似乎是语音识别(Automatic Speech Recognition, ASR)的相关工作,实则不然。以语音“你吃了没?”举例,ASR针对的是语音信号,并将语音转换为其音节逐个单词对应的自然语言文本信息(...
事实上,Konstantinos Drossos提出Automatic Audio Captioning这一任务时便采用了这一方面为任务提供了求解方案作为baseline[1]。同时,Konstantinos Drossos在其团队所提供的Clotho数据集当中也采用了基于GRU的方法作为验证数据集性能[2],并提供给DCASE比赛作为其Audio Captioning赛道的baseline。 方法设计 首先定义模型的输入输出...
To start using the audio captioning DCASE 2020 baseline system, firstly you have to set-up the code. Please note bold that the code in this repository is tested with Python 3.7. To set-up the code, you have to do the following: Clone this repository. Use either pip or conda to install...
model=AutoModel.from_pretrained("wsntxxn/cnn14rnn-tempgru-audiocaps-captioning",trust_remote_code=True).to(device)tokenizer=PreTrainedTokenizerFast.from_pretrained("wsntxxn/audiocaps-simple-tokenizer")wav,sr=torchaudio.load("/path/to/file.wav")wav=torchaudio.functional.resample(wav,sr,model.confi...
Extensive experiments on two audio captioning datasets Clotho and AudioCaps show that our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics and using the semantic information improves the captioning performance. Keywords: Audio captioning; PANNs; VGGish...
Audio Captioning is the task of describing audio using text. The general approach is to use an audio encoder to encode the audio (example: PANN, CAV-MAE), and to use a decoder (example: transformer) to generate the text. To judge the quality of audio captions, though machine translation ...
However, CNNs can be limited in modelling temporal relationships among the time frames in an audio signal, while RNNs can be limited in modelling the long-range dependencies among the time frames. In this paper, we propose an Audio Captioning Transformer (ACT), which is a full Transformer ...
We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need...
Jan Berg: Continual Learning in Automated Audio Captioning M.Sc. Thesis Tampere University Master's Degree Programme in Computer Science November 2021 Teaching neural network models to classify new tasks and old tasks on new domains is a process, where a common problem is the forgetting of previou...
Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio content without any prior training for this task. Audio captioning is commonly concerned with ambient sounds, or sounds produced by a human performing an a