Hi,
I could develop a code using C and CUDA for allocating a page-locked (pinned) memory and transfer data from host to device in a much more speed. Now I’m trying to port it to Java using JCuda, but I’m not sure how to do it. I have follow others posts here like Mapped Memory. But the problem is how can I use the pitch (stride) like I have used in C code:
float * h_instances = NULL;
cudaMallocHost((void **) &h_instances, NUM_INSTANCES * NUM_ATT * sizeof(float));
//....
// Load some data into h_instances
//....
float *d_instances;
size_t d_pitchBytes;
size_t h_pitchBytes = NUM_INSTANCES * sizeof(float);
cudaMallocPitch((void **) &d_instances, &d_pitchBytes, NUM_INSTANCES * sizeof(float), NUM_ATT)
// Then copy to device
cudaMemcpy2D(d_instances, d_pitchBytes, h_instances, h_pitchBytes, NUM_INSTANCES * sizeof(float), NUM_ATT, cudaMemcpyHostToDevice);
I have following the example at: https://developer.nvidia.com/content/how-optimize-data-transfers-cuda-cc for C code.
In Java I did:
Pointer h_instances = new Pointer();
JCuda.cudaHostAlloc(h_instances, NUM_INSTANCES * NUM_ATT * Sizeof.FLOAT, JCuda.cudaHostAllocDefault);
Pointer d_instances = new Pointer();
JCuda.cudaHostGetDevicePointer(d_instances, h_instances, 0); //Here d_instances has same size of h_instances, but I would like it padding for coalesced memory access.
FloatBuffer floatBuffer = h_instances.getByteBuffer(0, NUM_INSTANCES * NUM_ATT * Sizeof.FLOAT).order(ByteOrder.nativeOrder()).asFloatBuffer();
// ....
// Add data to floatBuffer (this data will be available at device???)
// ....
// The code bellow is what I did without using Pinned Memory...
// I need the pitch to propertly coalesce memory access...
//l ong[] d_pitchBytes = new long[1];
// long h_pitchBytes = NUM_INSTANCES * Sizeof.FLOAT;
// Transposed to coalesce memory access
// cudaMallocPitch(d_instances, d_pitchBytes, NUM_INSTANCES * Sizeof.FLOAT, NUM_ATT);
// JCuda.cudaMemcpy2D(d_instances, d_pitchBytes[0], Pointer.to(h_instances), h_pitchBytes, NUM_INSTANCES * Sizeof.FLOAT, NUM_ATT, cudaMemcpyKind.cudaMemcpyHostToDevice);
But, how Can I do it? I mean, allocate pinned memory considering the pitch (stride) ?? In C code this is very transparent, since I don’t need to considering the pitch outsite the device code for iterate over the data…I’m thiking in padding the host array manually, but I’m not confortable with doing it. There another way to create a pinned memory in JCuda that uses the stride in considerationg when allocating memory?? Am I doing some thing wrong ??
thansk.