we opted for the poclbm kernel. The optimized test used a modified CUDA-capable kernel. This kernel was allowed to auto-configure for the Nvidia GeForce cards, but we also tested various manual settings for the number of threads and grid size. Hand-tuning these options failed to meaningfully ...