Pinned memory in JCuda for a 2D array

Hi,

I could develop a code using C and CUDA for allocating a page-locked (pinned) memory and transfer data from host to device in a much more speed. Now I’m trying to port it to Java using JCuda, but I’m not sure how to do it. I have follow others posts here like Mapped Memory. But the problem is how can I use the pitch (stride) like I have used in C code:

float * h_instances = NULL;
cudaMallocHost((void **) &h_instances, NUM_INSTANCES * NUM_ATT * sizeof(float));

//....
// Load some data into h_instances
//....

float *d_instances;
size_t d_pitchBytes;
size_t h_pitchBytes = NUM_INSTANCES * sizeof(float);
cudaMallocPitch((void **) &d_instances, &d_pitchBytes, NUM_INSTANCES * sizeof(float), NUM_ATT)
// Then copy to device
cudaMemcpy2D(d_instances, d_pitchBytes, h_instances, h_pitchBytes, NUM_INSTANCES * sizeof(float), NUM_ATT, cudaMemcpyHostToDevice);

I have following the example at: https://developer.nvidia.com/content/how-optimize-data-transfers-cuda-cc for C code.

In Java I did:

Pointer h_instances = new Pointer();
JCuda.cudaHostAlloc(h_instances, NUM_INSTANCES * NUM_ATT * Sizeof.FLOAT, JCuda.cudaHostAllocDefault);

Pointer d_instances = new Pointer();
JCuda.cudaHostGetDevicePointer(d_instances, h_instances, 0); //Here d_instances has same size of h_instances, but I would like it padding for coalesced memory access.

FloatBuffer floatBuffer = h_instances.getByteBuffer(0, NUM_INSTANCES * NUM_ATT * Sizeof.FLOAT).order(ByteOrder.nativeOrder()).asFloatBuffer();
// ....
// Add data to floatBuffer (this data will be available at device???)
// ....

// The code bellow is what I did without using Pinned Memory...
// I need the pitch to propertly coalesce memory access...

//l ong[] d_pitchBytes = new long[1];
// long h_pitchBytes = NUM_INSTANCES * Sizeof.FLOAT;
// Transposed to coalesce memory access
// cudaMallocPitch(d_instances, d_pitchBytes, NUM_INSTANCES * Sizeof.FLOAT, NUM_ATT);
// JCuda.cudaMemcpy2D(d_instances, d_pitchBytes[0], Pointer.to(h_instances), h_pitchBytes, NUM_INSTANCES * Sizeof.FLOAT, NUM_ATT, cudaMemcpyKind.cudaMemcpyHostToDevice);

But, how Can I do it? I mean, allocate pinned memory considering the pitch (stride) ?? In C code this is very transparent, since I don’t need to considering the pitch outsite the device code for iterate over the data…I’m thiking in padding the host array manually, but I’m not confortable with doing it. There another way to create a pinned memory in JCuda that uses the stride in considerationg when allocating memory?? Am I doing some thing wrong ??

thansk.

Hello

Sorry, I’m not sure whether I entirely understood the question.

The code that you commented out at the end should roughly correspond to the C code. However, in the C code, you are allocating new (pitched) device data, whereas in the Java code, you are mapping the host data to the device (and this data will not be pitched), AND allocating device data (which is pitched again) using the same pointer…

Apart from the fact that the third parameter of cudaMemcpy2D should probably h_instances (and not a pointer to h_instances), I think that this part of the code should work as it is…

long[] d_pitchBytes = new long[1];
long h_pitchBytes = NUM_INSTANCES * Sizeof.FLOAT;
JCuda.cudaMallocPitch(d_instances, d_pitchBytes, NUM_INSTANCES * Sizeof.FLOAT, NUM_ATT);
JCuda.cudaMemcpy2D(d_instances, d_pitchBytes[0], h_instances, h_pitchBytes, NUM_INSTANCES * Sizeof.FLOAT, NUM_ATT, cudaMemcpyKind.cudaMemcpyHostToDevice);

Is your intention to allocate a host buffer with a pitch? I think this is not possible even in CUDA-C. Or is your question particularly about filling the host buffer?

I can try to have a closer look at all this beginning of next week. Maybe I can port the example from the link that you posted. It could be interesting as a benchmark anyhow.

bye
Marco

Hello,

thanks for the answer.

Just after post the question I realize that it in fact not possible, how you said, to allocate a host buffer with a pitch, even in CUDA-C.
And what I was doing in C was reallocating memory and copy from host buffer to the new allocated memory, it was not a pinned memory so…

Well, thanks for answer. I will use pitch memory without buffer allocation…

So the speedup that you observed was solely due to the pitched memcopy? I’ll definitely have to create some benchmarks (and look for existing ones). I have read several times that the memory alignment is important, but until now, have not really taken that into account in practice.

EDIT: It could particularly interesting to see details about how pitching, pinning and combinations thereof influence the performance. That is, to see whether manually padding the (pinned) host memory yields a higher performance for copy operations.