Show HN: GPULlama3.java Llama Compilied to PTX/OpenCL Now Integrated in Quarkus

23 points | by mikepapadim 2 days ago ago

5 comments

mikepapadim 2 days ago
https://github.com/beehive-lab/GPULlama3.java
lostmsu a day ago
Does it support flash attention? Use tensor cores? Can I write custom kernels?
UPD. found no evidence that it supports tensor cores, so it's going to be many times slower than implementations that do.
[-]
- mikepapadim 20 hours ago
  Yes, when you use the PTX backend it supports Tensor Cores.It has also implementation for flash attention. You can also write your own kernels, have a look here: https://github.com/beehive-lab/GPULlama3.java/blob/main/src/... https://github.com/beehive-lab/GPULlama3.java/blob/main/src/...
  [-]
  - lostmsu 18 hours ago
    TornadoVM GitHub has no mentions of tensor cores or WMMA instructions. The only mention of tensor cores is in 2024 and states they are not used: https://github.com/beehive-lab/TornadoVM/discussions/393
    [-]
    - mikepapadim 15 hours ago
      https://github.com/beehive-lab/TornadoVM/pull/732 https://github.com/beehive-lab/TornadoVM/pull/313