This is necessary because in speech input and output are of different modalities meaning that they should not be treated by the same padding function. Analogous to the common data collators, the padding tokens in the labels with -100 so that those tokens are not taken into account...
We understand that accent markers have substantial meaning in some languages, but felt that the benefits of reducing the effective vocabulary make up for this. Generally the strong contextual models of BERT should make up for any ambiguity introduced by stripping accent markers. ### List of ...
This is necessary because in speech input and output are of different modalities meaning that they should not be treated by the same padding function. Analogous to the common data collators, the padding tokens in the labels with -100 so that those tokens are not taken into account ...
In contrast to most NLP models, Wav2Vec2-BERT has a much larger input length than output length. Given the large input sizes, it is much more efficient to pad the training batches dynamically meaning that all training samples should only be padded to the longest sample in their...
does not. Also in order to understand the meaning of a speech signal, it is usually not necessary to include special characters in the transcription. Let's simply remove all characters that don't contribute to the meaning of a word and cannot really be represented by an acoustic ...