The BERT-based pre-trained model has 12 layers with a hidden size of 768 and 12 self-attention heads with deep bidirectional representations (repr.) from an unlabeled text that jointly conditions each layer’s left and rights contexts. Since BNShCNs consider multiple simultaneously semantics, a ...