Whether you can achieve a speedup for a particular operation depends on several factors. A while ago, I wrote some general words about that on stackoverflow (also some words of “warning”: Parallel programming on the GPU is rather “low level” compared to other forms of parallel programming in Java. However, I’ve never actively used a message passing system, so can not say much about that…)
Matrix-Vector operations in general can be very well suited for the GPU. There are still some guidelines to be followed in order to achieve a good performance. Most importantly, to make sure that the memory is not copied unnecessarily back and forth between the device (GPU memory) and the host (main memory).
But fortunately, you don’t have to write your own kernels for things like this (which makes using it muuuuch easier) : CUBLAS (and in this case, JCublas) already offers the full BLAS Level 1,2 and 3 routines. You might want to have a look at the “JCublas2Sample” (the second one on the http://jcuda.org/samples/samples.html page). It performs a Matrix-Matrix multiplication with CUBLAS. For Matrix-Vector operations, the speedup will not be as large as in this case, but depending on the size of the matrices and vectors, may still be considerable.