...Popcount assembly / sum indexes of set bits - Stack Overflow
_MM_SHUFFLE(3,3, 1,1)); // pshufd or movhlps weighted_nibblecounts = _mm_add_epi32(weighted_nibblecounts, counts01); // add to the bottom dword of each qword, in parallel with pshufd latency weighted_nibblecounts = _mm_add_...