diff --git a/doc/ExplicitBufferHandling.md b/doc/ExplicitBufferHandling.md index a30129205210e1d5a170fb5ee8819f48ea05dc82..5f0e70112dfda3c99d87b0e9103adcb20fec8045 100644 --- a/doc/ExplicitBufferHandling.md +++ b/doc/ExplicitBufferHandling.md @@ -1,6 +1,6 @@ #ExplicitBufferHandling -How to minimize buffer transfers Updated Jul 24, 2012 by frost.g...@gmail.com -Aparapi is designed to shield the Java developer from dealing with the underlying movement of data between the OpenCL host and device. Aparapi can analyze a kernel's run() method and run-reachable methods to determine which primitive arrays to transfer to the GPU prior to execution, and which arrays to transfer back when the GPU execution is complete. +*How to minimize buffer transfers Updated Jul 24, 2012 by frost.g...@gmail.com* +Aparapi is designed to shield the Java developer from dealing with the underlying movement of data between the OpenCL host and device. Aparapi can analyze a kernel's `run()` method and run-reachable methods to determine which primitive arrays to transfer to the GPU prior to execution, and which arrays to transfer back when the GPU execution is complete. Generally this strategy is both clean and performant. Aparapi will attempt to just do the right thing. @@ -16,7 +16,7 @@ However, occasionally the following code pattern is seen. This is a common pattern which unfortunately exposes an issue with Aparapi's normal buffer handling. -Although Aparapi does analyze the byte code of the Kernel.run() method (and any method reachable from Kernel.run()) Aparapi has no visibility to the call site. In the above code there is no way for Aparapi to detect that that hugeArray is not modified within the for loop body. Unfortunately, Aparapi must default to being 'safe' and copy the contents of hugeArray backwards and forwards to the GPU device. +Although Aparapi does analyze the byte code of the `Kernel.run()` method (and any method reachable from `Kernel.run()`) Aparapi has no visibility to the call site. In the above code there is no way for Aparapi to detect that that hugeArray is not modified within the for loop body. Unfortunately, Aparapi must default to being 'safe' and copy the contents of hugeArray backwards and forwards to the GPU device. Here we add comments to indicate where the unnecessary buffer transfers take place. @@ -42,7 +42,7 @@ Here we use comments to indicated the 'optimal' transfers. for (int loop=0; loop <MAXLOOP; loop++){ kernel.execute(HUGE); } - // Ideally transfer hugeArray to GPU here + // Ideally transfer hugeArray back from GPU here Consider another common pattern @@ -58,9 +58,9 @@ Consider another common pattern This is a common pattern in reduce stages of map-reduce type problems. Essentially the developer wants to keep executing a kernel until some condition is met. For example, this may be seen in bitonic sort implementations and various financial applications. -From the code it can be seen that the kernel reads and writes hugeArray[] array and uses the single item done[] array to indicate some form of convergence or completion. +From the code it can be seen that the kernel reads and writes `hugeArray[]` array and uses the single item `done[]` array to indicate some form of convergence or completion. -As we demonstrated above, by default Aparapi will transfer done[] and hugeArray[] to and from the GPU device each time Kernel.execute(HUGE) is executed. +As we demonstrated above, by default Aparapi will transfer `done[]` and `hugeArray[]` to and from the GPU device each time `Kernel.execute(HUGE)` is executed. To demonstrate which buffers are being transfered, these copies are shown as comments in the following version of the code. @@ -78,15 +78,15 @@ To demonstrate which buffers are being transfered, these copies are shown as com // Fetch hugeArray[] from GPU } -Further analysis of the code reveals that hugeArray[] is not accessed by the loop containing the kernel execution, so Aparapi is performing 999 unnecessary transfers to the device and 999 unnecessary transfers back. Only two transfers of hugeArray[] are needed; one to move the initial data to the GPU and one to move it back after the loop terminates. +Further analysis of the code reveals that `hugeArray[]` is not accessed by the loop containing the kernel execution, so Aparapi is performing 999 unnecessary transfers to the device and 999 unnecessary transfers back. Only two transfers of `hugeArray[]` are needed; one to move the initial data to the GPU and one to move it back after the loop terminates. -The done[] array is accessed during each iteration (although never written to within the loop), so it does need to be transferred back for each return from Kernel.execute(), however, it only needs to be sent once. +The `done[]` array is accessed during each iteration (although never written to within the loop), so it does need to be transferred back for each return from Kernel.execute(), however, it only needs to be sent once. -Clearly it is better to avoid unnecessary transfers, especially of large buffers like hugeArray[]. +Clearly it is better to avoid unnecessary transfers, especially of large buffers like `hugeArray[]`. Aparapi exposes a feature which allows the developer to control these situations and explicitly manage transfers. -To use this feature first the developer needs to 'turn on' explicit mode, using the kernel.setExplicit(true) method. Then the developer can request buffer/array transfers using either kernel.put() or kernel.get(). Kernel.put() forces a transfer to the GPU device and Kernel.get() transfers data back. +To use this feature first the developer needs to 'turn on' explicit mode, using the `kernel.setExplicit(true)` method. Then the developer can request buffer/array transfers using either `kernel.put()` or `kernel.get()`. `Kernel.put()` forces a transfer to the GPU device and Kernel.get() transfers data back. The following code illustrates the use of these new explicit buffer management APIs. @@ -107,7 +107,7 @@ The following code illustrates the use of these new explicit buffer management A Note that marking a kernel as explicit and failing to request the appropriate transfer is a programmer error. -We deliberately made Kernel.put(…), Kernel.get(…) and Kernel.execute(range) return an instance of the executing kernel to allow these calls be chained. Some may find this fluent style API more expressive. +We deliberately made `Kernel.put(...)`, `Kernel.get(...)` and `Kernel.execute(range)` return an instance of the executing kernel to allow these calls be chained. Some may find this fluent style API more expressive. final int[] hugeArray = new int[HUGE]; final int[] done = new int[]{0}; @@ -122,8 +122,8 @@ We deliberately made Kernel.put(…), Kernel.get(…) and Kernel.execute(range) } kernel.get(hugeArray); -An alternate approach for loops containing a single kernel.execute(range) call. -One variant of code which would normally suggest the use of Explicit Buffer Management can be handled differently. For cases where Kernel.execute(range) is the sole statement inside a loop and where the iteration count is known prior to the first iteration we offer an alternate (hopefully more elegant) way of minimizing buffer transfers. +An alternate approach for loops containing a single `kernel.execute(range)` call. +One variant of code which would normally suggest the use of Explicit Buffer Management can be handled differently. For cases where `Kernel.execute(range)` is the sole statement inside a loop and where the iteration count is known prior to the first iteration we offer an alternate (hopefully more elegant) way of minimizing buffer transfers. So for cases like:- @@ -136,20 +136,20 @@ So for cases like:- kernel.execute(HUGE); } -The developer can request that Aparapi perform the outer loop rather than coding the loop. This is achieved explicitly by passing the iteration count as the second argument to Kernel.execute(range, iterations). +The developer can request that Aparapi perform the outer loop rather than coding the loop. This is achieved explicitly by passing the iteration count as the second argument to `Kernel.execute(range, iterations)`. Now any form of code that looks like :- - int range=1024; - int loopCount=64; - for (int passId=0; passId<loopCount; passId++){ + int range = 1024; + int loopCount = 64; + for (int passId = 0; passId < loopCount; passId++){ kernel.execute(range); } Can be replaced with - int range=1024; - int loopCount=64; + int range = 1024; + int loopCount = 64; kernel.execute(range, loopCount); @@ -159,62 +159,62 @@ Sometimes kernel code using this loop-pattern needs to track the current iterati The code for this would have looked something like -int range=1024; -int loopCount=64; -final int[] hugeArray = new int[HUGE]; -final int[] passId = new int[0]; -Kernel kernel= new Kernel(){ - @Override public void run(){ - int id=getGlobalId(); - if (passId[0] == 0){ - // perform some initialization! - } - ... // reads/writes hugeArray - } -}; -Kernel.setExplicit(true); -kernel.put(hugeArray); -for (passId[0]=0; passId[0]<loopCount; passId[0]++){ - - kernel.put(passId).execute(range); -} -In the current version of Aparapi we added Kernel.getPassId() to allow a Kernel to determine the current ‘pass’ through the outer loop without having to use explicit buffer management. + int range = 1024; + int loopCount = 64; + final int[] hugeArray = new int[HUGE]; + final int[] passId = new int[0]; + Kernel kernel = new Kernel(){ + @Override public void run(){ + int id=getGlobalId(); + if (passId[0] == 0){ + // perform some initialization! + } + ... // reads/writes hugeArray + } + }; + Kernel.setExplicit(true); + kernel.put(hugeArray); + for (passId[0]=0; passId[0]<loopCount; passId[0]++){ + + kernel.put(passId).execute(range); + } +In the current version of Aparapi we added `Kernel.getPassId()` to allow a Kernel to determine the current ‘pass’ through the outer loop without having to use explicit buffer management. So the previous code can now be written without any explicit buffer management APIs:- -final int[] hugeArray = new int[HUGE]; -final int[] pass[] = new int[]{0}; -Kernel kernel= new Kernel(){ - @Override public void run(){ - int id=getGlobalId(); - int pass = getPassId(); - if (pass == 0){ - // perform some initialization! - } - ... // reads/writes both hugeArray - } -}; - -kernel.execute(HUGE, 1000); + final int[] hugeArray = new int[HUGE]; + final int[] pass[] = new int[]{0}; + Kernel kernel = new Kernel(){ + @Override public void run(){ + int id = getGlobalId(); + int pass = getPassId(); + if (pass == 0){ + // perform some initialization! + } + ... // reads/writes both hugeArray + } + }; + + kernel.execute(HUGE, 1000); One common use for Kernel.getPassId() is to avoid flipping buffers in the outer loop. It is common for kernels to process data from one buffer to another, and in the next invocation process the data back the other way. Now these kernels can use the passId (odd or even) to determine the direction of data transfer. -final int[] arr1 = new int[HUGE]; -final int[] arr2 = new int[HUGE]; -Kernel kernel= new Kernel(){ - int f(int v){ … } - - @Override public void run(){ - int id=getGlobalId(); - int pass = getPassId(); - if (pass%2==0){ - arr1[id] = f(arr2[id]); - }else{ - arr2[id] = f(arr1[id]); - - } - } -}; + final int[] arr1 = new int[HUGE]; + final int[] arr2 = new int[HUGE]; + Kernel kernel = new Kernel(){ + int f(int v){ … } + + @Override public void run(){ + int id = getGlobalId(); + int pass = getPassId(); + if (pass % 2 == 0){ + arr1[id] = f(arr2[id]); + }else{ + arr2[id] = f(arr1[id]); + + } + } + }; -kernel.execute(HUGE, 1000); \ No newline at end of file + kernel.execute(HUGE, 1000); \ No newline at end of file diff --git a/doc/privatememoryspace.md b/doc/privatememoryspace.md index 1901c02094e9973feef0ec51ed4cb4fe5f0b9191..51fee39e43f50ec6c2c93cbde20315c3345a041a 100644 --- a/doc/privatememoryspace.md +++ b/doc/privatememoryspace.md @@ -1,6 +1,8 @@ PrivateMemorySpace ================== +*Using `__private` memory space in Aparapi kernels. Phase-Implementation Updated Sep 14, 2014 by barneydp...@gmail.com* + ## Introduction The private memory space identifier (just "private" is also recognised) can be applied to struct fields in order to indicate that the data is not shared with/accessible to other kernel instances. Whilst this is the default for non-array data, it must be explicitly applied to array fields in order to make them private. Aparapi now supports arrays in the private memory space. diff --git a/doc/settinguplinuxhsamachineforaparapi.md b/doc/settinguplinuxhsamachineforaparapi.md index 14353654a23402946b6a3956a20d34ed2b7a0da8..8bf0981a091bfce0a09a1c6e6d395014eb1225e6 100644 --- a/doc/settinguplinuxhsamachineforaparapi.md +++ b/doc/settinguplinuxhsamachineforaparapi.md @@ -1,5 +1,5 @@ # Setting up Linus HSA machine for APARAPI - +*How to setup a Linux HSA machine for testing HSA enabled Aparapi Updated May 22, 2014 by frost.g...@gmail.com* ## Introduction Now that HSA hardware is generally available I figured it was time to describe how to setup a HSA enabled Linux platform so that it can run Aparapi.