Predicted utterances are then wrapped with special tokens, and a pre-trained summarizer is fine-tuned on the augmented input. As depicted in Fig. 1, we postulate that pointing out the meaningful source spans allows the generative model to better direct its attention to them; learning such ...
In addition, one can also directly concatenate the tokens of aligned features to frame tokens, resulting 2× to- kens in spatial attention. The results are shown in Table 9. It can be observed that both element-wise addition and direct concatenation per- for...