intra-codeword parallelism to reduce the latency performance of the GPU-based BP decoder. To avoid introducing the synchronization overhead among thread blocks, one codeword is mapped to one thread block other than multiple ones, since all L2R or R2L messages in the current stage should be up...