Similar to the RP embeddings, the spectrogram of the three modes are combined at all the four levels mentioned above. The input level fusion uses stacked spectrograms of 432 \(\times\) 288 \(\times\) 3 pixels as input to the ResNet-50 model. For the feature level fusion, ResNet-50 ...