Commit a30143fc authored by Jeffrey Phillips Freeman's avatar Jeffrey Phillips Freeman 💥
Browse files

Updated package to com.aparapi.

parent a8e0cd13
# Aparapi
[![License](http://img.shields.io/:license-apache-blue.svg?style=flat-square)](http://www.apache.org/licenses/LICENSE-2.0.html)
[![Javadocs](http://www.javadoc.io/badge/com.syncleus.aparapi/aparapi.svg)](http://www.javadoc.io/doc/com.syncleus.aparapi/aparapi)
[![Javadocs](http://www.javadoc.io/badge/com.aparapi/aparapi.svg)](http://www.javadoc.io/doc/com.aparapi/aparapi)
[![Gitter](https://badges.gitter.im/Syncleus/aparapi.svg)](https://gitter.im/Syncleus/aparapi?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
A framework for executing native Java code on the GPU.
......@@ -14,7 +14,7 @@ A GPU has a unique architecture that causes them to behave differently than a CP
Aparapi was originally a project conceived and developed by AMD corporation. It was later abandoned by AMD and sat mostly-idle for several years. Despite this there were some failed efforts by the community to keep the project alive, but without a clear community leader no new releases ever came. Eventually we came along and rescued the project, and modernized the project. Finally after such a long wait the first Aparapi release in 5 years was published and the community continues to push forward with renewed excitement.
For detailed documentation see [Aparapi.com](http://Aparapi.com) or check out the [latest Javadocs](http://www.javadoc.io/doc/com.syncleus.ferma/ferma).
For detailed documentation see [Aparapi.com](http://Aparapi.com) or check out the [latest Javadocs](http://www.javadoc.io/doc/com.aparapi.ferma/ferma).
For support please use [Gitter](https://gitter.im/Syncleus/aparapi) or the [official Aparapi mailing list](https://groups.google.com/a/syncleus.com/d/forum/aparapi-list).
......@@ -27,7 +27,7 @@ To include Aparapi in your project of choice include the following Maven depende
```xml
<dependency>
<groupId>com.syncleus.aparapi</groupId>
<groupId>com.aparapi</groupId>
<artifactId>aparapi</artifactId>
<version>1.0.0</version>
</dependency>
......
#AccessingMultiDimNDRangeProposal
*A proposal for accessing multi-dim ND range execution Updated Dec 14, 2011 by frost.g...@gmail.com*
We can discuss this proposal either here (in comments) or via the discussion list here.
Note this is nothing to do with accessing Java 2D arrays in Aparapi. This discussion is focused on the ability to expose the execution of kernels over 1, 2 or 3 dimensions. The memory in each case is a single contiguous region (like a single dimension primitive array).
At present an Aparapi kernel can only be executed using a single dimension. If we wish to represent execution over WIDTH x HEIGHT element grid we would execute over the range (WIDTH*HEIGHT) and manually divide/mod getGlobalID() by WIDTH to determine the x and y for each.
Similarly we would multiply y by WIDTH and add x (y*WIDTH+x) to convert an X,Y location to a linear global id
final static int WIDTH=128;
final static int HEIGHT=64;
final int in[] = new int[WIDTH*HEIGHT];
final int out[] = new int[WIDTH*HEIGHT];
Kernel kernel = new Kernel(){
public void run(){
int x = getGlobaId()%WIDTH;
int y = getGlobalID()/WIDTH;
if (!(x==1 || x==(WIDTH-1) || y==1 || y==(HEIGHT-1)){
int sum = 0;
for (int dx =-1; dx<2; dx++){
for (int dy =-1; dy<2; dy++){
sum+=in[(y+dy)*WIDTH+(x+dx)];
}
}
out[y*WIDTH+x] = sum/9;
// or out[getGlobalID()] = sum/9;
}
}
};
kernel.execute(WIDTH*HEIGHT);
OpenCL natively allows the user to execute over 1, 2 or 3 dimension grids via the clEnqueueNDRangeKernel() method.
We chose not to expose this in Aparapi but there have been requests for us to allow it.
There are a number of things to consider here:
1. Extending the syntax of kernel.execute() to allow multi dimensional grids.
1. Mapping Kernel methods to OpenCL's get_local_id(int dim), get_local_size(int dim), get_group_id(int_dim), etc. At present we map kernel.getGlobalId() to get_local_id(0).
1. Handling all of these when an application drops back to JTP mode.
##Extending Kernel.execute(int range)
Sadly we can't overload Kernel.execute(int range), Kernel.execute(int xrange, int yrange) and Kernel.execute(int xrange, int yrange, int zrange) because we already have kernel.execute(int, int) mapped for executing mutiple passes over the linear range.
Remember
for (int pass=0; pass<20; pass++){
kernel(1024);
}
Is equivalent to
kernel(1024, 20);
I think I would prefer
Kernel.execute(int range)
Kernel.execute(int range, int passes)
Kernel.executeXY(int xrange, int yrange)
Kernel.executeXY(int xrange, int yrange, int passes)
Kernel.executeXYZ(int xrange, int yrange, int zrange)
Kernel.executeXYZ(int xrange, int yrange, int zrange, int passes)
Obviously in the above calls we are only supplying the global bounds for the grid. We could also provide mappings allowing local ranges. I think I would prefer
Kernel.executeLocal(int range, int local)
Kernel.executeLocal(int range, int local, int passes)
Kernel.executeXYLocal(int xrange, int yrange, int xlocalrange, int ylocalrange)
Kernel.executeXYLocal(int xrange, int yrange, int xlocalrange, int ylocalrange, int passes)
Kernel.executeXYZLocal(int xrange, int yrange, int zrange, int xlocalrange, int ylocalrange, int zlocalrange)
Kernel.executeXYZLocal(int xrange, int yrange, int zrange, int xlocalrange, int ylocalrange, int zlocalrange, int passes)
Another alternative may be to create Range classes
class Range{
int passes;
int width;
static Range create(int width);
static Range create(int width, int passes);
}
class Range2D extends Range{
int height;
static Range create(int width, int height);
static Range create(int width, int height, int passes);
}
class Range3D extends Range2D{
int depth;
static Range create(int width, int height);
static Range create(int width, int height, int passes);
}
With appropriate constructors (or factory methods) to allow
Kernel.execute(Range range)
Then execute would be simply.
Kernel.execute(Range.create(1,1))
We can also arrange for the group size to be placed in the base Range class.
class Range{
int groupSize;
int passes;
int width;
static Range create(int width);
static Range create(int width, int passes);
}
##Mapping to OpenCL multi dim methods. i.e get_global_id(1), get_local_size(2) etc
We could just add getGlobalId(int dim), getLocalSize(int dim) etc to replicate OpenCL methods.
I would prefer to offer the following global mappings
|Kernel | OpenCL|
|-----|------|
|getGlobalId()| get_global_id(0)|
|getGlobalX()| get_global_id(0)|
|getGlobalY()| get_global_id(1)|
|getGlobalZ()| get_global_id(2)|
|getGlobalSize()| get_global_size(0)|
|getGlobalWidth()| get_global_size(0)|
|getGlobalHeight()| get_global_size(1)|
|getGlobalDepth()| get_global_size(2)|
And the following local mappings
|Kernel| OpenCL|
|-----|-------|
|getLocalId()| get_local_id(0)|
|getLocalX()| get_local_id(0)|
|getLocalY()| get_local_id(1)|
|getLocalZ()| get_local_id(2)|
|getLocalSize()| get_local_size(0)|
|getLocalWidth()| get_local_size(0)|
|getLocalHeight()| get_local_size(1)|
|getLocalDepth()| get_local_size(2)|
##An example
final static int WIDTH=128;
final static int HEIGHT=64;
final int in[] = new int[WIDTH*HEIGHT];
final int out[] = new int[WIDTH*HEIGHT];
Kernel kernel = new Kernel(){
public void run(){
int x = getGlobalX();
int y = getGlobalY();
if (!(x==1 || x==(getGlobalWidth()-1) || y==1 || y==(getGlobalHeight()-1)){
int sum = 0;
for (int dx =-1; dx<2; dx++){
for (int dy =-1; dy<2; dy++){
sum+=in[(y+dy)*getGlobalWidth()+(x+dx)];
}
}
out[y*getGlobalWidth()+x] = sum/9;
// or out[getGlobalID()] = sum/9;
}
}
};
kernel.executeXY(WIDTH, HEIGHT);
Or if we choose the Range class approach.
final static int WIDTH=128;
final static int HEIGHT=64;
final int in[] = new int[WIDTH*HEIGHT];
final int out[] = new int[WIDTH*HEIGHT];
Kernel kernel = new Kernel(){
public void run(){
int x = getGlobalX();
int y = getGlobalY();
if (!(x==1 || x==(getGlobalWidth()-1) || y==1 || y==(getGlobalHeight()-1)){
int sum = 0;
for (int dx =-1; dx<2; dx++){
for (int dy =-1; dy<2; dy++){
sum+=in[(y+dy)*getGlobalWidth()+(x+dx)];
}
}
out[y*getGlobalWidth()+x] = sum/9;
// or out[getGlobalID()] = sum/9;
}
}
};
kernel.execute(Range2D.create(WIDTH, HEIGHT));
##Handling this from JTP mode
Mapping to OpenCL for this is all fairly straightforward.
In Java JTP mode we will have to emulate this. For get_global_id(0..3) (getGlobalX(), getGlobalY() and getGlobalZ() using our proposed Aparapi Java mappings) we can of course easily offer reasonable implementations, this just requires the Java code to essentially nest 3 loops (or emulate) and set globalX, globalY, globalZ inside each nesting.
For get_local_size(0..3) (getLocalWidth(), getLocalHeight() and getLocalDepth() using our proposed Aparapi Java mappings) we will need to break the globalWidth/globalHeight and globalDepth into some arbitrary equal 'chunks' (note I am avoiding using the word groups here to avoid confusion with get_group_size(0..3)!
At present we always create a synthetic group in JTP mode which is the the # or cores. This will need to be changed. If the user requests a grid (64,64,8,8) (global width 64, global height 64, local width 8, local height 8) then we will have to create a JTP group of 64 (8x8) and just in case the kernel code contains a barrier, we will need to ensure we launch 64 threads for this group. From our experience it is best to launch one thread per core, so we may lose some JTP performance executing in this mode.
\ No newline at end of file
#AddingLambdasToAparapi
*Adding Java 8 Lambda Support to Aparapi Updated Jun 24, 2013 by frost.g...@gmail.com*
In the recently added ''lambda'' branch we have been experimenting with adding lambda support to Aparapi. We believe that this upcomming Java 8 feature will be a natural way to express parallel algorithms which can be executed on the GPU.
A link to the branch can be found here preview.
You will need to get the latest binary build of ''Project Lambda'' to experiment with these new features. The 'Project Lambda' preview can be found here.
Once you have a Lambda enabled Java 8 JDK Java set JAVA_HOME to your Java8 Lambda enabled compiler and build Aparapi.
So from the root of SumatraExperiments just use
$ ant
We are slowly walking through some of the Aparapi demos and converting them. At present NBody and Mandel have been converted.
With Lambda enabled Aparapi we remove the need to derive from a base Kernel class, we will allow the user to express their code as a lambda using the following basic pattern
Device.bestGPU().forEach(int range, IntConsumer lambda);
The Java 8 stream API defines a type called java.util.function.IntConsumer. This is essentially an interface with a Single Abstract Method (these types are referred to as SAM types in the stream API code).
IntConsumer looks something like....
interface IntConsumer{
public void accept(int Id);
}
So you can run the familiar 'squares' kernel using
int in[] = ..//
int out[] = .../
Device.bestGPU().forEach(in.length, (i)->{
out[i] = in[i] * in[i];
});
Instead of
int in[] = ..//
int out[] = .../
Device.bestGPU().forEach(in.length, new IntConsumer(){
public void accept(int i){
out[i] = in[i] * in[i];
}
});
To accomodate lambda's we created Device.forEach(int range, IntConsumer ic) which converts the bytecode of the ic parameter to OpenCL at runtime. The captured args (in, out and i - in this case) are passed to the GPU and the kernel executed.
During our early experiments we encountered an interesting issue. The new 'lambdafied' javac uses Java 7 method handles and invoke dynamic instructions to dispatch the lambda code. It does this by injecting a call to a MethodHandle factory into the call site. At runtime, this factory creates a synthetic class (to capture call-site args) and passes this to our Device.forEach().
We needed to analyse this synthetically generated class in order to work out which args need to be sent to the GPU. Of course we have a bunch of tools already in Aparapi for analyzing bytecode, but this code expects to find bytecode in class files (either in a Jar or on the disk), we had to find a way to access these classfile bytes to Aparapi.
We have a couple of proposed solutions for solving this. The most promising is to turn the aparapi.dll/aparapi.so native library (used by Aparapi at runtime) into a JVMTI agent (like hprof). JVMTI agents are native libraries which have access to some aspects of a running JVM (via the JVM Tool Interface). We havea prototype JVMTI agent which 'listens' for classfiles which represent these 'synthetic lambda helpers' and allows us to get hold of the bytecode for these classes.
This will mean that in future we will change how Aparapi is launched.
Instead of
$ java -Djava.library.path=path/to/aparapi -classpath path/to/aparapi/aparapi.jar:your.jar YourClass
We will use
$ java -agentlib=path/to/aparapi/aparapi.dll -classpath path/to/aparapi/aparapi.jar:your.jar YourClass
We are also looking into the possibility of having this agent provide the bytecode for all Aparapi classes. We believe that this will enable us to ultimately remove MethodModel/ClassModel and even the InstructionSet classes and handling all of this in JNI.
We would welcome comments on these proposals. Either here, or in the discussion list. Let us know what you think.
##Consequences of lambdification of Aparapi.
* No support for local memory, group size or barriers in Lambda form
* Calls to Kernel base class methods (such as getGlobalId()) will not be allowed. The 'global id' will be passed as an arg to the lambda.
* We will need to add support for calling static methods (of course the bytecode for the called methods cannot violate Aparapi restrictions).
* We might need to drop support for multi dimension dispatch. This is more a convergence story with Sumatra (which is unlikely to support this)
* Unlikely that explicit buffer management will be simple.
* We can use lambda's for control as well as the kernel itself. See examples below.
##Alternate forms for kernel dispatch
This version would allow us to carry over Aparapi's device selection
Device.bestGPU().forEach(1024, i->{lambda});
This version would allow us to carry over Aparapi's Range selection
Device.bestGPU().range2D(width, height).forEach(1024, rid->{lambda});
This version would allow us to mimic Kernel.execute(1024, 5)
Device.bestGPU().forEach(1024, 5, (id, passid)->{lambda});
We could even have the range iterated over until some other lambda determines we are done
Device.bestGPU().forEachUntil(1024, id->{lambda}, ->{predicate lambda});
Explicit buffer handling could be removed in many cases by allowing the bytecode of the 'until' predicate to be snooped for buffer references.
int lotsOfData[] = ...;
boolean found[false] = new boolean[1];
Device.bestGPU().forEachUntil(1024, 5,
(id, passid)->{ /* mutate lotsOfData, found[0]=true when done */ }
->{found[0]]});
In the above cases Aparapi can determine that between each pass it needs to ''ONLY'' copy found[] back from the device.
There is no reason that the range itself needs to be constant, we can use a collection/iterable. This helps with some reductions.
int range[] = new int[]{1024,512,128,64,32,16,8,4,2,1,0};
Device.bestGPU().forEach(range,{lambda});
or the range can be a lambda itself, here we specify a start and end value for the range itself, and a lambda to provide each step.
Device.bestGPU().forEach(1024, 1, r->{return(r/2);},(pass, r, id)->{lambda});
// or
Device.bestGPU().forEach(1, 1024, r->{return(r*2);},(pass, r, id)->{lambda});
#AddressSpacesUsingBuffers
*Proposal For OpenCL address space support using java Buffers instead of arrays. Updated Dec 8, 2011 by frost.g...@gmail.com*
The general idea is to have a AS_PRIMTYPE_Buffer for each AS=address space and PRIM=primitive type. Here is an example for LocalFloatBuffer which would be a buffer for floats that got mapped to OpenCL local address space.
As with normal FloatBuffers, the float elements are accessed using get and put methods
Although a LocalFloatBuffer conceptually exists only for the lifetime of a workgroup, it is still constructed in the enclosing Kernel, not in the Kernel.Entry.run method. (Aparapi does not support constructing new objects inside the Kernel.Entry.run method).
A typical declaration would be:
LocalFloatBuffer locbuf = new LocalFloatBuffer{12);
The argument 12 here means that 12 floats would be used by each workitem in the workgroup. So the total buffer would be LocalSize*12 floats. Aparapi would at runtime allocate a total local OpenCL buffer to be this size. Note how this removes the need for the programmer to specify localSize anywhere.
Note: For each Kernel.Entry.execute(globalSize) call, the runtime will determine an appropriate workgroup size, also called localSize, depending on the capabilities of the device, and on the globalSize. The localSize will always evenly divide the globalSize, in other words all workgroups for an execute context will be the same size. A workitem can determine localSize by calling getLocalSize().
Because workitems operate simultaneously and in an undetermined order, workitems will generally only use put on its own portion of the LocalFloatBuffer between the LocalBarriers, and will generally only use get outside the LocalBarriers.
Some example code (from NBody) follows. Here each workitem copies a "BODY" consisting of 4 floats. The global array contains 4*globalSize floats, and we want to iterate thru this global array, copying it into local memory and operating on it there. This will take globalSize/localSize "tiles". For each tile, each workitem fills in one "BODY"'s worth or 4 elements
// outside run method...
final int BODYSIZE = 4;
LocalFloatBuffer pos_xyzm_local = new LocalFloatBuffer(BODYSIZE);
//
// inside run method...
int numTiles = globalSize / localSize;
for (int i = 0; i < numTiles; ++i) {
// load one tile into local memory
int idx = i * localSize + localId; // index into a global memory array
localBarrier();
pos_xyzm_local.put(localId * BODYSIZE + 0, pos_xyzm[idx * BODYSIZE + 0]);
pos_xyzm_local.put(localId * BODYSIZE + 1, pos_xyzm[idx * BODYSIZE + 1]);
pos_xyzm_local.put(localId * BODYSIZE + 2, pos_xyzm[idx * BODYSIZE + 2]);
pos_xyzm_local.put(localId * BODYSIZE + 3, pos_xyzm[idx * BODYSIZE + 3]);
// Synchronize to make sure data is available for processing
localBarrier();
// now the entire LocalFloatBuffer has been filled.
// each workitem might use the entire Buffer
// which consists of localSize BODYs
for (int j = 0; j < localSize; ++j) {
float r_x = pos_xyzm_local.get(j * BODYSIZE + 0) - myPos_x;
float r_y = pos_xyzm_local.get(j * BODYSIZE + 1) - myPos_y;
float r_z = pos_xyzm_local.get(j * BODYSIZE + 2) - myPos_z;
// ...etc
\ No newline at end of file
#AparapiExtensionProposal
*A proposed aparapi extension mechanism. Updated Feb 29, 2012 by frost.g...@gmail.com*
##Here is a proposed Aparapi extension mechanism
This would allow a developer to create a library that could be used by Aparapi Kernel code. The library would include OpenCL and Java implementations.
We will treat this as a live document. Please join the discussions at http://groups.google.com/group/aparapi-discuss/browse_thread/thread/7ec81ecb2169aa4 and I will update this page to reflect what I think the latest decisions are:-
Currently Aparapi allows Java bytecode to be converted to OpenCL at runtime. Only the OpenCL generated by this conversion process is made available. Sometimes for performance reasons we might want to allow hand coded OpenCL to be called from Aparapi kernel code.
Here we will present a strawman API which would allow extension points to be added by an end user or by a library provider.
We will use an FFT usecase to walk through the steps.
The FFT (Fast Fourier Transform) algorithm can be coded in Aparapi, but for performance reasons handcrafted OpenCL is likely to be more performant. The goal is to allow Aparapi to do what it does best, i.e. manage the host buffer allocations and provide a mechanism for binding arbitrary opencl code at runtime.
So lets assume we wanted an Aparapi Kernel to be able to call an Aparapi extension for computing FFT (forward and reverse). The Kernel implementation might look like this.
public static class BandStopFilter extends Kernel{
FFT fft = new FFT(); // Create an instance of the Extension point.
float[] real;
float[] imaginary;
BandStopFilter (float[] _real){
real = _real;
imaginary = new float[_real.length];
}
@Override public void run() {
fft.forward(real, imaginary);
}
}
The main method then would just execute the Kernel using the familiar kernel.execute() method :-
public static void main(String[] args) {
float[] data = new float[1024];
BandStopFilter kernel = new BandStopFilter (data);
kernel.execute(data.length);
}
Essentially we want the FFT.forward(float[] _real, float[] _imaginary) and FFT.reverse(float[] _real, float[] _imaginary) methods to be callable from Aparapi Kernel code. We want Aparapi to handle the call-forwarding and the argument/buffer mapping transfers. We want Aparapi to call the Java methods normally if OpenCL is not available but would like Aparapi to use the implementor provided OpenCL if it is. So the implementor will be required to provide both a Java and an OpenCL version of the callable methods because Aparapi will decide which version needs to be called ant runtime.
Any extension point is required to implement the AparapiExtensionPoint interface.
public class AparapiExtensionPoint
public String getOpenCL();
}
Here is a possible (although incomplete) FFT implementation.
public class FFT implements AparapiExtensionPoint{
@AparapiCallable public void forward(
@Global @ReadWrite float[] _data,
@Global @ReadWrite float[] _imaginary) {
// java implementation
}
@AparapiCallable public void reverse(
@Global @ReadWrite float[] _data,
@Global @ReadWrite float[] _imaginary) {
// java implementation
}
@Override public String getOpenCL() {
return ""
+"void my_package_FFT_forward("
+" __global float* _real,"
+" __global float* _imaginary )"
+" {"
+" // OpenCL implemention"
+" }"
+"void my_package_FFT_reverse("
+" __global float* _real,"
+" __global float* _imaginary )"
+" {"
+" // OpenCL implemention"
+" }";
}
}
The implementer’s class will be required to define the callable aparapi methods as well as implement the `getOpenCL()` method so that the OpenCL implementation of those methods can be extracted at run-time.
Aparapi will provide annotations to decorate the methods and args/parameters of the exposed callable methods . These annotations provide information so that Aparapi locate the callable methods as well as parameter hints to help coordinate buffer types (global, local, constant) and transfer directions (read,write, readWrite) when executing the methods from a Kernel. This information is consulted during the normal bytecode analysis that Aparapi provides when Aparapi hits the call site.
Note that the Java code inside the `@AparapiCallable` functions (or code executed from it) is not constrained to the normal Aparapi subset. It can be any legitimate Java code, but should be thread safe (because it will be called from JTP mode!).
Note also that the OpenCL code yielded from the `getOpenCL()` method is assumed to be complete, Aparapi does not attempt to parse this code. If the code fails to compile Aparapi will fallback and execute the whole Kernel in JTP mode.
BTW we show getOpenCL() returning a String literal. This is most likely to be how code is returned. However, it could be extracted from a File? a resource in the Jar file? or dynamically generated based on some state. For example an FFT implementation might choose to use different code for radix2 or radix4 implementations (based on a paramater passed to `FFT()` constructor - say `FFT(FFT.RADIX2))` in which case the getOpenCL() method might yield different code.
The above proposal covers the case where a third party might want to provide an Aparapi extension point as a library.
We might also consider allowing single methods within the Kernel to be optimized, where the OpenCL is made available via the AparapiCallable annotation. The method would still use the same Annotations for the args (to allow buffer txfers to be optimized).
Kernel k = new Kernel(){
@AparapiCallable(” /* opencl code for sum() goes here */”)
int sum(@Global @ReadWrite int[] data, int length){
int sum = 0;
for (int v:data){
sum+=v;
}
}
@Override public void run(){
sum(data);
}
}
Here are the proposed new interfaces/annotations
public interface AparapiExtensionPoint {
public String getOpenCL();
}
@Retention(RetentionPolicy.RUNTIME) @Target(ElementType.METHOD)
public @interface AparapiCallable {
String value default NULL;
}
@Retention(RetentionPolicy.RUNTIME) @Target(ElementType.PARAMETER)
public @interface Global {}
@Retention(RetentionPolicy.RUNTIME) @Target(ElementType.PARAMETER)
public @interface Local {}
@Retention(RetentionPolicy.RUNTIME) @Target(ElementType.PARAMETER)
public @interface Constant {}
@Retention(RetentionPolicy.RUNTIME) @Target(ElementType.PARAMETER)
public @interface ReadWrite {}
@Retention(RetentionPolicy.RUNTIME) @Target(ElementType.PARAMETER)
public @interface ReadOnly {}
@Retention(RetentionPolicy.RUNTIME) @Target(ElementType.PARAMETER)
public @interface WriteOnly {}
And here is the example code in one chunk
public class FFT implements AparapiExtensionPoint{
@AparapiCallable public void forward(
@Global @ReadWrite float[] _data,
@Global @ReadWrite float[] _imaginary) {
// java implementation
}
@AparapiCallable public void reverse(
@Global @ReadWrite float[] _data,
@Global @ReadWrite float[] _imaginary) {
// java implementation
}
@Override public String getOpenCL() {
return ""
+"void my_package_FFT_forward("
+" __global float* _real,"
+" __global float* _imaginary )"
+" {"
+" // OpenCL implemention"
+" }"
+"void my_package_FFT_reverse("
+" __global float* _real,"
+" __global float* _imaginary )"
+" {"
+" // OpenCL implemention"
+" }";
}
}
public class BandStopFilter extends Kernel{
FFT fft = new FFT();
float[] real;
float[] imaginary;
BandStopFilter (float[] _real){
real = _real;
imaginary = new float[_real.length];
}
@Override public void run() {
fft.forward(real, imaginary);
}
}
public static void main(String[] args) {
float[] data = new float[1024];
BandStopFilter kernel = new BandStopFilter (data);
kernel.execute(data.length);
}
After discussion I think we are converging on a less complex solution. This is based on Witold's feedback suggestion (see below) where we use OpenCL annotations rather than forcing the implementation of the interface and the `getOpenCL()` method as originally suggested.
So we will create an `@OpenCL` annotation for classes/methods.
The `@OpenCL` annotation on the methods will contain the OpenCL source replacement for a specific method. The arg list will be created by Aparapi.
The @OpenCL annotation on a class allows us to optionally introduce common code (helper methods, #pragmas, constants) which will precede the method declarations in the OpenCL code.
So an FFT example whereby forward() and reverse() methods both called a common foo() method might look like this.
@OpenCL(common="/* common void foo(){} + maybe #pragmas + accessable
global fields declared here */")
public class FFT extends AparapiExtensionPoint {
@OpenCL(signature="//function signature - OPTIONAL", body="{ /* uses foo(); */ }")
public void forward(
@Global @ReadWrite float[] _data,
@Global @ReadWrite float[] _imaginary) {
// java implementation
}
@OpenCL(function="{ /*uses foo(); */) }")
public void reverse(
@Global @ReadWrite float[] _data,
@Global @ReadWrite float[] _imaginary) {
// java implementation
}
}
}
To invoke from an Aparapi kernel. We should be able to do something like
public class BandStopFilter extends Kernel{
FFT fft = new FFT();
float[] real;
float[] imaginary;
BandStopFilter (float[] _real){
real = _real;
imaginary = new float[_real.length];
}
@Override public void run() {
fft.forward(this, real, imaginary);
}
}
public static void main(String[] args) {
float[] data = new float[1024];
BandStopFilter kernel = new BandStopFilter (data);
kernel.execute(data.length);
}
Ideally we would also like to invoke FFT directly (instead of via a Kernel). This is tricky because the forward()} and reverse() methods will need to be invoked across a range and of course the dispatch across the range needs to be initiated from Aparapi.
The only way I can see how to do this is to force the creation of an interface so we can use Java's existing Proxy mechanism to create a wrapper.
@OpenCL(wraps=FFT.class);
interface FFTInterface{
public void forward( Range _range, float[] _data, float[] _imaginary);
public void reverse( Range _range, float[] _data, float[] _imaginary);