Selecting targets
- Transferring data from scalar to vector registers is expensive (pre-AVX2)
- We get around this by separately reading values four at a time as vectors but also individually as scalars
- This naturally leads to a little loop unrolling: one loop iteration finds four targets, and the index (mod 4) is hardcoded