Introduction - If you have any usage issues, please Google them yourself
Copy between host and device -- start with the cudaMallocAndMemcpy template.
The first part allocates memory for the indexes d_a and d_b on the device.
The second part: copy the h_a on the host to the d_a on the device.
The third part: copy the device from d_a to d_b.
The fourth part: copy the d_b from the device back to the h_a on the host.
The fifth part: release d_a and d_b on the host.