diff --git a/.gitignore b/.gitignore
index 202ac33242205c08f7d20fafac35d38ee6024f83..8a8717d5df41d15bf703b40693a5711633bad079 100644
--- a/.gitignore
+++ b/.gitignore
@@ -27,3 +27,4 @@ hs_err_pid*
 **/classes/
 **/dist/
 **/include/
+**/nbproject/
diff --git a/README.md b/README.md
index f8a09db984eb52b698cef0753ac8738f253f8f25..f28ea63fba21a14b814954ad9ab769525c0aa51a 100644
--- a/README.md
+++ b/README.md
@@ -3,7 +3,7 @@ aparapi
 
 This is the new home of Aparapi.
 
-Until all code, issue tickets, wiki pages, etc. are moved please refer to [Aparapi Google Code](https://code.google.com/p/aparapi/) for documentation.
+Please refer to the [current documentation](doc/README.md) or the older docs on google code [Aparapi Google Code](https://code.google.com/p/aparapi/) for documentation.
 
 We plan to implement Binary downloads at [GitHub Releases](https://help.github.com/articles/about-releases).
 
diff --git a/doc/AccessingMultiDimNDRangeProposal.md b/doc/AccessingMultiDimNDRangeProposal.md
new file mode 100644
index 0000000000000000000000000000000000000000..188cf9bab385a6267eb0723857ff5807c45609c4
--- /dev/null
+++ b/doc/AccessingMultiDimNDRangeProposal.md
@@ -0,0 +1,197 @@
+#AccessingMultiDimNDRangeProposal
+*A proposal for accessing multi-dim ND range execution Updated Dec 14, 2011 by frost.g...@gmail.com*
+
+We can discuss this proposal either here (in comments) or via the discussion list here.
+
+Note this is nothing to do with accessing Java 2D arrays in Aparapi. This discussion is focused on the ability to expose the execution of kernels over 1, 2 or 3 dimensions. The memory in each case is a single contiguous region (like a single dimension primitive array).
+
+At present an Aparapi kernel can only be executed using a single dimension. If we wish to represent execution over WIDTH x HEIGHT element grid we would execute over the range (WIDTH*HEIGHT) and manually divide/mod getGlobalID() by WIDTH to determine the x and y for each.
+
+Similarly we would multiply y by WIDTH and add x (y*WIDTH+x) to convert an X,Y location to a linear global id
+
+    final static int WIDTH=128;
+    final static int HEIGHT=64;
+    final int in[] = new int[WIDTH*HEIGHT];
+    final int out[] = new int[WIDTH*HEIGHT];
+    Kernel kernel = new Kernel(){
+       public void run(){
+          int x = getGlobaId()%WIDTH;
+          int y = getGlobalID()/WIDTH;
+          if (!(x==1 || x==(WIDTH-1) || y==1 || y==(HEIGHT-1)){
+             int sum = 0;
+             for (int dx =-1; dx<2; dx++){
+               for (int dy =-1; dy<2; dy++){
+                 sum+=in[(y+dy)*WIDTH+(x+dx)];
+               }
+             }
+             out[y*WIDTH+x] = sum/9;
+             // or out[getGlobalID()] = sum/9;
+          }
+       }
+
+    };
+    kernel.execute(WIDTH*HEIGHT);
+
+OpenCL natively allows the user to execute over 1, 2 or 3 dimension grids via the clEnqueueNDRangeKernel() method.
+
+We chose not to expose this in Aparapi but there have been requests for us to allow it.
+
+There are a number of things to consider here:
+
+1. Extending the syntax of kernel.execute() to allow multi dimensional grids.
+1. Mapping Kernel methods to OpenCL's get_local_id(int dim), get_local_size(int dim), get_group_id(int_dim), etc. At present we map kernel.getGlobalId() to get_local_id(0).
+1. Handling all of these when an application drops back to JTP mode.
+
+##Extending Kernel.execute(int range)
+Sadly we can't overload Kernel.execute(int range), Kernel.execute(int xrange, int yrange) and Kernel.execute(int xrange, int yrange, int zrange) because we already have kernel.execute(int, int) mapped for executing mutiple passes over the linear range.
+
+Remember
+
+    for (int pass=0; pass<20; pass++){
+       kernel(1024);
+    }
+Is equivalent to
+
+    kernel(1024, 20);
+I think I would prefer
+
+    Kernel.execute(int range)
+    Kernel.execute(int range, int passes)
+    Kernel.executeXY(int xrange, int yrange)
+    Kernel.executeXY(int xrange, int yrange, int passes)
+    Kernel.executeXYZ(int xrange, int yrange, int zrange)
+    Kernel.executeXYZ(int xrange, int yrange, int zrange, int passes)
+    Obviously in the above calls we are only supplying the global bounds for the grid. We could also provide mappings allowing local ranges. I think I would prefer
+
+    Kernel.executeLocal(int range, int local)
+    Kernel.executeLocal(int range, int local, int passes)
+    Kernel.executeXYLocal(int xrange, int yrange, int xlocalrange, int ylocalrange)
+    Kernel.executeXYLocal(int xrange, int yrange, int xlocalrange, int ylocalrange, int passes)
+    Kernel.executeXYZLocal(int xrange, int yrange, int zrange, int xlocalrange, int ylocalrange, int zlocalrange)
+    Kernel.executeXYZLocal(int xrange, int yrange, int zrange, int xlocalrange, int ylocalrange, int zlocalrange, int passes)
+Another alternative may be to create Range classes
+
+    class Range{
+      int passes;
+      int width;
+      static Range create(int width);
+      static Range create(int width, int passes);
+    }
+
+    class Range2D extends Range{
+       int height;
+       static Range create(int width, int height);
+       static Range create(int width, int height, int passes);
+
+    }
+
+    class Range3D extends Range2D{
+       int depth;
+       static Range create(int width, int height);
+       static Range create(int width, int height, int passes);
+    }
+With appropriate constructors (or factory methods) to allow
+
+    Kernel.execute(Range range)
+
+Then execute would be simply.
+
+    Kernel.execute(Range.create(1,1))
+
+We can also arrange for the group size to be placed in the base Range class.
+
+    class Range{
+      int groupSize;
+      int passes;
+      int width;
+      static Range create(int width);
+      static Range create(int width, int passes);
+    }
+
+##Mapping to OpenCL multi dim methods. i.e get_global_id(1), get_local_size(2) etc
+We could just add getGlobalId(int dim), getLocalSize(int dim) etc to replicate OpenCL methods.
+
+I would prefer to offer the following global mappings
+
+|Kernel	| OpenCL|
+|-----|------|
+|getGlobalId()|	get_global_id(0)|
+|getGlobalX()|	get_global_id(0)|
+|getGlobalY()|	get_global_id(1)|
+|getGlobalZ()|	get_global_id(2)|
+|getGlobalSize()|	get_global_size(0)|
+|getGlobalWidth()|	get_global_size(0)|
+|getGlobalHeight()|	get_global_size(1)|
+|getGlobalDepth()|	get_global_size(2)|
+
+And the following local mappings
+
+|Kernel|	OpenCL|
+|-----|-------|
+|getLocalId()|	get_local_id(0)|
+|getLocalX()|	get_local_id(0)|
+|getLocalY()|	get_local_id(1)|
+|getLocalZ()|	get_local_id(2)|
+|getLocalSize()|	get_local_size(0)|
+|getLocalWidth()|	get_local_size(0)|
+|getLocalHeight()|	get_local_size(1)|
+|getLocalDepth()|	get_local_size(2)|
+
+##An example
+
+    final static int WIDTH=128;
+    final static int HEIGHT=64;
+    final int in[] = new int[WIDTH*HEIGHT];
+    final int out[] = new int[WIDTH*HEIGHT];
+    Kernel kernel = new Kernel(){
+       public void run(){
+          int x = getGlobalX();
+          int y = getGlobalY();
+          if (!(x==1 || x==(getGlobalWidth()-1) || y==1 || y==(getGlobalHeight()-1)){
+             int sum = 0;
+             for (int dx =-1; dx<2; dx++){
+               for (int dy =-1; dy<2; dy++){
+                 sum+=in[(y+dy)*getGlobalWidth()+(x+dx)];
+               }
+             }
+             out[y*getGlobalWidth()+x] = sum/9;
+             // or out[getGlobalID()] = sum/9;
+          }
+       }
+
+    };
+    kernel.executeXY(WIDTH, HEIGHT);
+
+Or if we choose the Range class approach.
+
+    final static int WIDTH=128;
+    final static int HEIGHT=64;
+    final int in[] = new int[WIDTH*HEIGHT];
+    final int out[] = new int[WIDTH*HEIGHT];
+    Kernel kernel = new Kernel(){
+       public void run(){
+          int x = getGlobalX();
+          int y = getGlobalY();
+          if (!(x==1 || x==(getGlobalWidth()-1) || y==1 || y==(getGlobalHeight()-1)){
+             int sum = 0;
+             for (int dx =-1; dx<2; dx++){
+               for (int dy =-1; dy<2; dy++){
+                 sum+=in[(y+dy)*getGlobalWidth()+(x+dx)];
+               }
+             }
+             out[y*getGlobalWidth()+x] = sum/9;
+             // or out[getGlobalID()] = sum/9;
+          }
+       }
+
+    };
+    kernel.execute(Range2D.create(WIDTH, HEIGHT));
+
+##Handling this from JTP mode
+Mapping to OpenCL for this is all fairly straightforward.
+
+In Java JTP mode we will have to emulate this. For get_global_id(0..3) (getGlobalX(), getGlobalY() and getGlobalZ() using our proposed Aparapi Java mappings) we can of course easily offer reasonable implementations, this just requires the Java code to essentially nest 3 loops (or emulate) and set globalX, globalY, globalZ inside each nesting.
+
+For get_local_size(0..3) (getLocalWidth(), getLocalHeight() and getLocalDepth() using our proposed Aparapi Java mappings) we will need to break the globalWidth/globalHeight and globalDepth into some arbitrary equal 'chunks' (note I am avoiding using the word groups here to avoid confusion with get_group_size(0..3)!
+
+At present we always create a synthetic group in JTP mode which is the the # or cores. This will need to be changed. If the user requests a grid (64,64,8,8) (global width 64, global height 64, local width 8, local height 8) then we will have to create a JTP group of 64 (8x8) and just in case the kernel code contains a barrier, we will need to ensure we launch 64 threads for this group. From our experience it is best to launch one thread per core, so we may lose some JTP performance executing in this mode.
\ No newline at end of file
diff --git a/doc/AddingLambdasToAparapi.md b/doc/AddingLambdasToAparapi.md
new file mode 100644
index 0000000000000000000000000000000000000000..07e9ab9fed7d3bc3de78308393a13c2d8fd6c55e
--- /dev/null
+++ b/doc/AddingLambdasToAparapi.md
@@ -0,0 +1,106 @@
+#AddingLambdasToAparapi
+*Adding Java 8 Lambda Support to Aparapi Updated Jun 24, 2013 by frost.g...@gmail.com*
+
+In the recently added ''lambda'' branch we have been experimenting with adding lambda support to Aparapi. We believe that this upcomming Java 8 feature will be a natural way to express parallel algorithms which can be executed on the GPU.
+
+A link to the branch can be found here preview.
+
+You will need to get the latest binary build of ''Project Lambda'' to experiment with these new features. The 'Project Lambda' preview can be found here.
+
+Once you have a Lambda enabled Java 8 JDK Java set JAVA_HOME to your Java8 Lambda enabled compiler and build Aparapi.
+
+So from the root of SumatraExperiments just use
+
+    $ ant
+We are slowly walking through some of the Aparapi demos and converting them. At present NBody and Mandel have been converted.
+
+With Lambda enabled Aparapi we remove the need to derive from a base Kernel class, we will allow the user to express their code as a lambda using the following basic pattern
+
+    Device.bestGPU().forEach(int range, IntConsumer lambda);
+The Java 8 stream API defines a type called java.util.function.IntConsumer. This is essentially an interface with a Single Abstract Method (these types are referred to as SAM types in the stream API code).
+
+IntConsumer looks something like....
+
+    interface IntConsumer{
+       public void accept(int Id);
+    }
+So you can run the familiar 'squares' kernel using
+
+    int in[] = ..//
+    int out[] = .../
+    Device.bestGPU().forEach(in.length, (i)->{
+       out[i] = in[i] * in[i];
+     });
+
+Instead of
+
+    int in[] = ..//
+    int out[] = .../
+    Device.bestGPU().forEach(in.length, new IntConsumer(){
+       public void accept(int i){
+           out[i] = in[i] * in[i];
+       }
+     });
+
+To accomodate lambda's we created Device.forEach(int range, IntConsumer ic) which converts the bytecode of the ic parameter to OpenCL at runtime. The captured args (in, out and i - in this case) are passed to the GPU and the kernel executed.
+
+During our early experiments we encountered an interesting issue. The new 'lambdafied' javac uses Java 7 method handles and invoke dynamic instructions to dispatch the lambda code. It does this by injecting a call to a MethodHandle factory into the call site. At runtime, this factory creates a synthetic class (to capture call-site args) and passes this to our Device.forEach().
+
+We needed to analyse this synthetically generated class in order to work out which args need to be sent to the GPU. Of course we have a bunch of tools already in Aparapi for analyzing bytecode, but this code expects to find bytecode in class files (either in a Jar or on the disk), we had to find a way to access these classfile bytes to Aparapi.
+
+We have a couple of proposed solutions for solving this. The most promising is to turn the aparapi.dll/aparapi.so native library (used by Aparapi at runtime) into a JVMTI agent (like hprof). JVMTI agents are native libraries which have access to some aspects of a running JVM (via the JVM Tool Interface). We havea prototype JVMTI agent which 'listens' for classfiles which represent these 'synthetic lambda helpers' and allows us to get hold of the bytecode for these classes.
+
+This will mean that in future we will change how Aparapi is launched.
+
+Instead of
+
+    $ java -Djava.library.path=path/to/aparapi -classpath path/to/aparapi/aparapi.jar:your.jar YourClass
+    
+We will use
+
+    $ java -agentlib=path/to/aparapi/aparapi.dll -classpath path/to/aparapi/aparapi.jar:your.jar YourClass
+We are also looking into the possibility of having this agent provide the bytecode for all Aparapi classes. We believe that this will enable us to ultimately remove MethodModel/ClassModel and even the InstructionSet classes and handling all of this in JNI.
+
+We would welcome comments on these proposals. Either here, or in the discussion list. Let us know what you think.
+
+##Consequences of lambdification of Aparapi.
+
+* No support for local memory, group size or barriers in Lambda form
+* Calls to Kernel base class methods (such as getGlobalId()) will not be allowed. The 'global id' will be passed as an arg to the lambda.
+* We will need to add support for calling static methods (of course the bytecode for the called methods cannot violate Aparapi restrictions).
+* We might need to drop support for multi dimension dispatch. This is more a convergence story with Sumatra (which is unlikely to support this)
+* Unlikely that explicit buffer management will be simple.
+* We can use lambda's for control as well as the kernel itself. See examples below.
+
+##Alternate forms for kernel dispatch
+
+This version would allow us to carry over Aparapi's device selection
+
+    Device.bestGPU().forEach(1024, i->{lambda});
+This version would allow us to carry over Aparapi's Range selection
+
+    Device.bestGPU().range2D(width, height).forEach(1024, rid->{lambda});
+This version would allow us to mimic Kernel.execute(1024, 5)
+
+    Device.bestGPU().forEach(1024, 5, (id, passid)->{lambda});
+We could even have the range iterated over until some other lambda determines we are done
+
+    Device.bestGPU().forEachUntil(1024, id->{lambda}, ->{predicate lambda});
+Explicit buffer handling could be removed in many cases by allowing the bytecode of the 'until' predicate to be snooped for buffer references.
+
+    int lotsOfData[] = ...;
+    boolean found[false] = new boolean[1];
+    Device.bestGPU().forEachUntil(1024, 5,
+       (id, passid)->{ /* mutate lotsOfData, found[0]=true when done */ }
+       ->{found[0]]});
+In the above cases Aparapi can determine that between each pass it needs to ''ONLY'' copy found[] back from the device.
+
+There is no reason that the range itself needs to be constant, we can use a collection/iterable. This helps with some reductions.
+
+    int range[] = new int[]{1024,512,128,64,32,16,8,4,2,1,0};
+    Device.bestGPU().forEach(range,{lambda});
+or the range can be a lambda itself, here we specify a start and end value for the range itself, and a lambda to provide each step.
+
+    Device.bestGPU().forEach(1024, 1, r->{return(r/2);},(pass, r, id)->{lambda});
+    // or
+    Device.bestGPU().forEach(1, 1024, r->{return(r*2);},(pass, r, id)->{lambda});
diff --git a/doc/AddressSpacesUsingBuffers.md b/doc/AddressSpacesUsingBuffers.md
new file mode 100644
index 0000000000000000000000000000000000000000..a311db2f4e98dce9b71fec915c0b19c5859358ae
--- /dev/null
+++ b/doc/AddressSpacesUsingBuffers.md
@@ -0,0 +1,44 @@
+#AddressSpacesUsingBuffers
+*Proposal For OpenCL address space support using java Buffers instead of arrays. Updated Dec 8, 2011 by frost.g...@gmail.com*
+The general idea is to have a AS_PRIMTYPE_Buffer for each AS=address space and PRIM=primitive type. Here is an example for LocalFloatBuffer which would be a buffer for floats that got mapped to OpenCL local address space.
+
+As with normal FloatBuffers, the float elements are accessed using get and put methods
+
+Although a LocalFloatBuffer conceptually exists only for the lifetime of a workgroup, it is still constructed in the enclosing Kernel, not in the Kernel.Entry.run method. (Aparapi does not support constructing new objects inside the Kernel.Entry.run method).
+
+A typical declaration would be:
+
+    LocalFloatBuffer locbuf = new LocalFloatBuffer{12);
+The argument 12 here means that 12 floats would be used by each workitem in the workgroup. So the total buffer would be LocalSize*12 floats. Aparapi would at runtime allocate a total local OpenCL buffer to be this size. Note how this removes the need for the programmer to specify localSize anywhere.
+
+Note: For each Kernel.Entry.execute(globalSize) call, the runtime will determine an appropriate workgroup size, also called localSize, depending on the capabilities of the device, and on the globalSize. The localSize will always evenly divide the globalSize, in other words all workgroups for an execute context will be the same size. A workitem can determine localSize by calling getLocalSize().
+
+Because workitems operate simultaneously and in an undetermined order, workitems will generally only use put on its own portion of the LocalFloatBuffer between the LocalBarriers, and will generally only use get outside the LocalBarriers.
+
+Some example code (from NBody) follows. Here each workitem copies a "BODY" consisting of 4 floats. The global array contains 4*globalSize floats, and we want to iterate thru this global array, copying it into local memory and operating on it there. This will take globalSize/localSize "tiles". For each tile, each workitem fills in one "BODY"'s worth or 4 elements
+
+      // outside run method...
+      final int BODYSIZE = 4;
+      LocalFloatBuffer pos_xyzm_local = new LocalFloatBuffer(BODYSIZE);
+      //
+      // inside run method...
+      int numTiles = globalSize / localSize;
+      for (int i = 0; i < numTiles; ++i) {
+         // load one tile into local memory
+         int idx = i * localSize + localId;  // index into a global memory array
+         localBarrier();
+         pos_xyzm_local.put(localId * BODYSIZE + 0, pos_xyzm[idx * BODYSIZE + 0]);
+         pos_xyzm_local.put(localId * BODYSIZE + 1, pos_xyzm[idx * BODYSIZE + 1]);
+         pos_xyzm_local.put(localId * BODYSIZE + 2, pos_xyzm[idx * BODYSIZE + 2]);
+         pos_xyzm_local.put(localId * BODYSIZE + 3, pos_xyzm[idx * BODYSIZE + 3]);
+         // Synchronize to make sure data is available for processing
+         localBarrier();
+
+         // now the entire LocalFloatBuffer has been filled.
+         // each workitem might use the entire Buffer
+         // which consists of localSize BODYs
+         for (int j = 0; j < localSize; ++j) {
+            float r_x = pos_xyzm_local.get(j * BODYSIZE + 0) - myPos_x;
+            float r_y = pos_xyzm_local.get(j * BODYSIZE + 1) - myPos_y;
+            float r_z = pos_xyzm_local.get(j * BODYSIZE + 2) - myPos_z;
+            // ...etc
\ No newline at end of file
diff --git a/doc/AparapiExtensionProposal.md b/doc/AparapiExtensionProposal.md
new file mode 100644
index 0000000000000000000000000000000000000000..087695cb29f920e57d084439b1a4cd616b8251a8
--- /dev/null
+++ b/doc/AparapiExtensionProposal.md
@@ -0,0 +1,258 @@
+#AparapiExtensionProposal
+*A proposed aparapi extension mechanism. Updated Feb 29, 2012 by frost.g...@gmail.com*
+
+##Here is a proposed Aparapi extension mechanism
+This would allow a developer to create a library that could be used by Aparapi Kernel code. The library would include OpenCL and Java implementations.
+
+We will treat this as a live document. Please join the discussions at http://groups.google.com/group/aparapi-discuss/browse_thread/thread/7ec81ecb2169aa4 and I will update this page to reflect what I think the latest decisions are:-
+
+Currently Aparapi allows Java bytecode to be converted to OpenCL at runtime. Only the OpenCL generated by this conversion process is made available. Sometimes for performance reasons we might want to allow hand coded OpenCL to be called from Aparapi kernel code.
+
+Here we will present a strawman API which would allow extension points to be added by an end user or by a library provider.
+
+We will use an FFT usecase to walk through the steps.
+
+The FFT (Fast Fourier Transform) algorithm can be coded in Aparapi, but for performance reasons handcrafted OpenCL is likely to be more performant. The goal is to allow Aparapi to do what it does best, i.e. manage the host buffer allocations and provide a mechanism for binding arbitrary opencl code at runtime.
+
+So lets assume we wanted an Aparapi Kernel to be able to call an Aparapi extension for computing FFT (forward and reverse). The Kernel implementation might look like this.
+
+    public static class BandStopFilter extends Kernel{
+       FFT fft = new FFT(); // Create an instance of the Extension point.
+       float[] real;
+       float[] imaginary;
+
+      BandStopFilter (float[] _real){
+         real = _real;
+         imaginary = new float[_real.length];
+
+      }
+
+      @Override public void run() {
+         fft.forward(real, imaginary);
+      }
+    }
+
+The main method then would just execute the Kernel using the familiar kernel.execute() method :-
+
+    public static void main(String[] args) {
+       float[] data = new float[1024];
+       BandStopFilter  kernel = new BandStopFilter (data);
+       kernel.execute(data.length);
+    }
+
+Essentially we want the FFT.forward(float[] _real, float[] _imaginary) and FFT.reverse(float[] _real, float[] _imaginary) methods to be callable from Aparapi Kernel code. We want Aparapi to handle the call-forwarding and the argument/buffer mapping transfers. We want Aparapi to call the Java methods normally if OpenCL is not available but would like Aparapi to use the implementor provided OpenCL if it is. So the implementor will be required to provide both a Java and an OpenCL version of the callable methods because Aparapi will decide which version needs to be called ant runtime.
+
+Any extension point is required to implement the AparapiExtensionPoint interface.
+
+public class AparapiExtensionPoint
+   public String getOpenCL();
+}
+Here is a possible (although incomplete) FFT implementation.
+
+    public class FFT implements AparapiExtensionPoint{
+        @AparapiCallable public void forward(
+            @Global @ReadWrite float[] _data,
+            @Global @ReadWrite float[] _imaginary) {
+              // java implementation
+           }
+
+        @AparapiCallable public void reverse(
+            @Global @ReadWrite float[] _data,
+            @Global @ReadWrite float[] _imaginary) {
+              // java implementation
+            }
+
+        @Override public String getOpenCL() {
+              return ""
+              +"void my_package_FFT_forward("
+              +"   __global float* _real,"
+              +"   __global float* _imaginary )"
+              +"   {"
+              +"       // OpenCL implemention"
+              +"   }"
+              +"void my_package_FFT_reverse("
+              +"   __global float* _real,"
+              +"   __global float* _imaginary )"
+              +"   {"
+              +"       // OpenCL implemention"
+              +"   }";
+           }
+    }
+
+The implementer’s class will be required to define the callable aparapi methods as well as implement the `getOpenCL()` method so that the OpenCL implementation of those methods can be extracted at run-time.
+
+Aparapi will provide annotations to decorate the methods and args/parameters of the exposed callable methods . These annotations provide information so that Aparapi locate the callable methods as well as parameter hints to help coordinate buffer types (global, local, constant) and transfer directions (read,write, readWrite) when executing the methods from a Kernel. This information is consulted during the normal bytecode analysis that Aparapi provides when Aparapi hits the call site.
+
+Note that the Java code inside the `@AparapiCallable` functions (or code executed from it) is not constrained to the normal Aparapi subset. It can be any legitimate Java code, but should be thread safe (because it will be called from JTP mode!).
+
+Note also that the OpenCL code yielded from the `getOpenCL()` method is assumed to be complete, Aparapi does not attempt to parse this code. If the code fails to compile Aparapi will fallback and execute the whole Kernel in JTP mode.
+
+BTW we show getOpenCL() returning a String literal. This is most likely to be how code is returned. However, it could be extracted from a File? a resource in the Jar file? or dynamically generated based on some state. For example an FFT implementation might choose to use different code for radix2 or radix4 implementations (based on a paramater passed to `FFT()` constructor - say `FFT(FFT.RADIX2))` in which case the getOpenCL() method might yield different code.
+
+The above proposal covers the case where a third party might want to provide an Aparapi extension point as a library.
+
+We might also consider allowing single methods within the Kernel to be optimized, where the OpenCL is made available via the AparapiCallable annotation. The method would still use the same Annotations for the args (to allow buffer txfers to be optimized).
+
+    Kernel k = new Kernel(){
+          @AparapiCallable(” /* opencl code for sum() goes here */”)
+           int sum(@Global @ReadWrite int[] data, int length){
+                 int  sum = 0;
+                 for (int v:data){
+                        sum+=v;
+                 }
+          }
+         @Override public void run(){
+                sum(data);
+         }
+    }
+
+Here are the proposed new interfaces/annotations
+
+    public interface AparapiExtensionPoint {
+       public String getOpenCL();
+    }
+    @Retention(RetentionPolicy.RUNTIME) @Target(ElementType.METHOD)
+    public @interface AparapiCallable {
+         String value default NULL;
+    }
+
+    @Retention(RetentionPolicy.RUNTIME) @Target(ElementType.PARAMETER)
+    public @interface Global {}
+
+    @Retention(RetentionPolicy.RUNTIME) @Target(ElementType.PARAMETER)
+    public @interface Local {}
+
+    @Retention(RetentionPolicy.RUNTIME) @Target(ElementType.PARAMETER)
+    public @interface Constant {}
+
+    @Retention(RetentionPolicy.RUNTIME) @Target(ElementType.PARAMETER)
+    public @interface ReadWrite {}
+
+    @Retention(RetentionPolicy.RUNTIME) @Target(ElementType.PARAMETER)
+    public @interface ReadOnly {}
+
+    @Retention(RetentionPolicy.RUNTIME) @Target(ElementType.PARAMETER)
+    public @interface WriteOnly {}
+
+And here is the example code in one chunk
+
+    public class FFT implements AparapiExtensionPoint{
+        @AparapiCallable public void forward(
+            @Global @ReadWrite float[] _data,
+            @Global @ReadWrite float[] _imaginary) {
+              // java implementation
+           }
+
+      @AparapiCallable public void reverse(
+          @Global @ReadWrite float[] _data,
+          @Global @ReadWrite float[] _imaginary) {
+            // java implementation
+          }
+
+      @Override public String getOpenCL() {
+            return ""
+            +"void my_package_FFT_forward("
+            +"   __global float* _real,"
+            +"   __global float* _imaginary )"
+            +"   {"
+            +"       // OpenCL implemention"
+            +"   }"
+            +"void my_package_FFT_reverse("
+            +"   __global float* _real,"
+            +"   __global float* _imaginary )"
+            +"   {"
+            +"       // OpenCL implemention"
+            +"   }";
+         }
+    }
+
+    public class BandStopFilter extends Kernel{
+       FFT fft = new FFT();
+       float[] real;
+       float[] imaginary;
+
+       BandStopFilter (float[] _real){
+          real = _real;
+          imaginary = new float[_real.length];
+
+       }
+
+       @Override public void run() {
+          fft.forward(real, imaginary);
+       }
+    }
+
+    public static void main(String[] args) {
+       float[] data = new float[1024];
+       BandStopFilter  kernel = new BandStopFilter (data);
+       kernel.execute(data.length);
+    }
+
+After discussion I think we are converging on a less complex solution. This is based on Witold's feedback suggestion (see below) where we use OpenCL annotations rather than forcing the implementation of the interface and the `getOpenCL()` method as originally suggested.
+
+So we will create an `@OpenCL` annotation for classes/methods.
+
+The `@OpenCL` annotation on the methods will contain the OpenCL source replacement for a specific method. The arg list will be created by Aparapi.
+
+The @OpenCL annotation on a class allows us to optionally introduce common code (helper methods, #pragmas, constants) which will precede the method declarations in the OpenCL code.
+
+So an FFT example whereby forward() and reverse() methods both called a common foo() method might look like this.
+
+    @OpenCL(common="/* common void foo(){} + maybe #pragmas + accessable
+    global fields declared here */")
+    public class FFT extends AparapiExtensionPoint {
+          @OpenCL(signature="//function signature - OPTIONAL", body="{ /* uses foo(); */ }")
+          public void forward(
+              @Global @ReadWrite float[] _data,
+              @Global @ReadWrite float[] _imaginary) {
+                // java implementation
+             }
+          @OpenCL(function="{  /*uses foo(); */) }")
+          public void reverse(
+              @Global @ReadWrite float[] _data,
+              @Global @ReadWrite float[] _imaginary) {
+                // java implementation
+              }
+       }
+    }
+
+To invoke from an Aparapi kernel. We should be able to do something like
+
+    public class BandStopFilter extends Kernel{
+         FFT fft = new FFT();
+         float[] real;
+         float[] imaginary;
+
+         BandStopFilter (float[] _real){
+            real = _real;
+            imaginary = new float[_real.length];
+
+         }
+
+         @Override public void run() {
+            fft.forward(this, real, imaginary);
+         }
+      }
+
+      public static void main(String[] args) {
+         float[] data = new float[1024];
+         BandStopFilter  kernel = new BandStopFilter (data);
+         kernel.execute(data.length);
+      }
+
+Ideally we would also like to invoke FFT directly (instead of via a Kernel). This is tricky because the forward()} and reverse() methods will need to be invoked across a range and of course the dispatch across the range needs to be initiated from Aparapi.
+
+The only way I can see how to do this is to force the creation of an interface so we can use Java's existing Proxy mechanism to create a wrapper.
+
+    @OpenCL(wraps=FFT.class);
+    interface FFTInterface{
+     public void forward(  Range _range, float[] _data,  float[] _imaginary);
+         public void reverse( Range _range, float[] _data, float[] _imaginary);
+    }
+    Then provide a mechanism for extracting a proxy and invoking it.
+
+    float[] real = //??
+    float[] imag = //??
+    Aparapi.wrap<FFT>(FFTInterface.class).forward(range, real, imag);
+
+I can't see a cleaner solution.
diff --git a/doc/AparapiPatterns.md b/doc/AparapiPatterns.md
new file mode 100644
index 0000000000000000000000000000000000000000..7baf1cbb8589c533aae1724508974ad1c8ec08fe
--- /dev/null
+++ b/doc/AparapiPatterns.md
@@ -0,0 +1,129 @@
+#AparapiPatterns
+*Examples and code fragments to demonstrate Aparapi fetaures. Updated Jul 24, 2012 by frost.g...@gmail.com*
+
+##Aparapi Patterns
+
+The following suggestions help solve some common problems found in using Aparapi.
+
+Additional suggestions and solutions to extend this list would be welcome.
+
+##How do I return data from a kernel if I can’t write to kernel fields?
+
+Use a small array buffer (possibly containing a single element) and assign it from the kernel.
+
+For example, the following kernel code detects whether the buffer[] contains the value 1234. The flag (true or false) is returned in found[0].
+
+    final int buffer[] = new int[HUGE];
+    final boolean found[] = new boolean[]{false};
+    // fill buffer somehow
+     kernel kernel = new kernel(){
+        @Override public void run(){
+              if (buffer[getGlobald()]==1234){
+                    found[0]=true;
+              }
+        }
+    };
+    kernel.execute(buffer.length);
+
+This code does include a race condition, whereby more than one value of `Kernel.getGlobalId()` might contain 1234 and try to set `found[0]`. This is not a problem here, because we don't care if multiple kernel executions match, provided one flips the value of `found[0]`.
+
+##How can I use Aparapi and still maintain an object-oriented view of my data?
+
+See the NewFeatures page. Aparapi can now handle simple arrays of objects, which minimizes the amount of refactoring required to experiment with Aparapi. However, performance is still likely to be better if your algorithm operates on data held in parallel primitive arrays. To get higher performance from Aparapi with minimal exposure to data in this parallel primitive array form, we can (with a little work) allow both forms of data to co-exist. Let’s reconsider the NBody problem (http://en.wikipedia.org/wiki/N-body_problem) .
+
+A Java developer writing an NBody solution would most likely create a Body class:
+
+    class Body{
+      float x,y,z;
+      float getX(){return x;}
+      void setX(float _x){ x = _x;}
+      float getY(){return y;}
+      void setY(float _y){ y = _y;}
+      float getZ(){return z;}
+      void setZ(float _z){ z = _z;}
+
+
+      // other data related to Body unused by positioning calculations
+    }
+
+The developer would also likely create a container class (such as NBodyUniverse), that manages the positions of multiple Body instances.
+
+    class NBodyUniverse{
+         final Body[] bodies = null;
+         NBodyUniverse(final Bodies _bodies[]){
+            bodies = _bodies;
+            for (int i=0; i<bodies.length; i++){
+               bodies[i].setX(Math.random()*100);
+               bodies[i].setY(Math.random()*100);
+               bodies[i].setZ(Math.random()*100);
+            }
+         }
+         void adjustPositions(){
+           // can use new array of object Aparapi features, but is not performant
+         }
+    }
+    Body bodies = new Body[BODIES];
+    for (int i=0; i<bodies; i++){
+        bodies[i] = new Body();
+    }
+    NBodyUniverse universe = new NBodyUniverse(bodies);
+    while (true){
+       universe.adjustPositions();
+       // display NBodyUniverse
+    }
+
+The NBodyUniverse.adjustPostions() method contains the nested loops (adjusting each body position based on forces impinging on it from all of the other bodies), making it an ideal Aparapi candidate.
+
+Even though this code can now be written by accessing the x, y and z ordinates of Body[] via getters/setters, the most performant Aparapi implementation is the one that operates on parallel arrays of floats containing x, y and z ordinates, with Body[10]’s state conceptually stored across x[10], y[10] and z[10].
+
+So for performance reasons, you can do something like this:
+
+    class Body{
+        int idx;
+        NBodyUniverse universe;
+        void setUniverseAndIndex(NBodyUniverse _universe, int _idx){
+            universe = _universe;
+            idx = _idx;
+        }
+
+        // other fields not used by layout
+
+        void setX(float _x){ layout.x[idx]=_x;}
+        void setY(float _y){ layout.y[idx]=_y;}
+        void setZ(float _z){ layout.z[idx]=_z;}
+        float getX(){ return layout.x[idx];}
+        float getY(){ return layout.y[idx];}
+        float getZ(){ return layout.z[idx];}
+    }
+    class NBodyUniverse {
+         final Body[] bodies;
+         final int[] x, y, z;
+         NBodyUniverse(Body[] _bodies){
+            bodies = _bodies;
+            for (int i=0; i<bodies.length; i++){
+               bodies[i].setUniverseAndIndex(this, i);
+               bodies[i].setX(Math.random()*100);
+               bodies[i].setY(Math.random()*100);
+               bodies[i].setZ(Math.random()*100);
+            }
+         }
+         void adjustPositions(){
+             // can now more efficiently use Aparapi
+         }
+    }
+
+
+
+    Body bodies = new Body[BODIES];
+    for (int i=0; i<bodies; i++){
+        bodies[i] = new Body();
+    }
+    NBodyUniverse universe = new NBodyUniverse(bodies);
+    while (true){
+       universe.adjustPositions();
+       // display NBodyUniverse
+    }
+
+This example allows Javaâ„¢ code to treat each Body in a traditional object-oriented fashion and also allows Aparapi kernels to act on the parallel primitive array form, in order to access/mutate the position of the bodies.
+
+[Attribution](Attribution.md)
\ No newline at end of file
diff --git a/doc/Attribution.md b/doc/Attribution.md
new file mode 100644
index 0000000000000000000000000000000000000000..52ab3813e7eec37fd5751196eae30b8a18c0ae59
--- /dev/null
+++ b/doc/Attribution.md
@@ -0,0 +1,26 @@
+#Attribution
+*Attribution Updated Sep 13, 2011 by frost.g...@gmail.com*
+
+##Attribution
+
+AMD, AMD Radeon, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc.
+
+OpenCL is a trademark of Apple Inc used under license to the Khronos Group, Inc.
+
+NVIDIA, the NVIDIA logo, and CUDA are trademarks or registered trademarks of NVIDIA Corporation.
+
+Java , JVM, JDK and “Write Once, Run Anywhere" are trademarks of Oracle and/or its affiliates.
+
+Eclipse and the related logos are a trademark of The Eclipse Foundation in the United States, other countries, or both.
+
+Microsoft, Windows, Visual Studio, Visual Studio Express Edition are trademarks of Microsoft Corporation in the United States, other countries, or both.
+
+Linux is a registered trademark of Linus Torvalds
+
+Ubuntu is a trademark of Canonical Ltd
+
+Red Hat is a registered trademark of Red Hat, Inc. in the United States and other countries.
+
+OpenGL® and the oval logo are trademarks or registered trademarks of Silicon Graphics, Inc. in the United States and/or other countries worldwide.
+
+All other names used in this documentation are for identification purposes only and may be trademarks of their respective owners.
diff --git a/doc/BuildingNBody.md b/doc/BuildingNBody.md
new file mode 100644
index 0000000000000000000000000000000000000000..092306ac1c0f1975d86b9c2beb6ddbcee596c02a
--- /dev/null
+++ b/doc/BuildingNBody.md
@@ -0,0 +1,40 @@
+#BuildingNBody
+*How to build the NBody example. Updated Nov 11, 2011 by frost.g...@gmail.com*
+##Building NBody
+The NBody example is located in the/ examples subdirectory under the Aparapi trunk:
+
+    trunk/
+       ...
+       examples/
+         ...
+         nbody/
+            src/java/com.amd.aparapi.nbody/
+            build.xml
+            nbody.sh
+            nbody.bat
+The NBody example requires a little more work to build because it depends on a third party project named ‘JOGL’.
+
+JOGL is a set of OpenGL™ bindings for Java® and the NBody example uses this library to render the particles/bodies (potentially many thousands of them) at runtime. More information about JOGL can be found here http://jogamp.org/jogl/www.
+
+The build.xml file build target will download the jars required to build and run the nbody example if the files do not exist.
+
+To build nbody, perform the following commands.
+
+    C:> ant clean build
+The NBody build.xml file includes a ‘run’ target so you can launch the application using.
+
+    C:> ant run
+Or if you prefer to launch from either the nbody.sh or nbody.bat script.
+
+For Linux® we also need to chmod nbody.sh in order to execute it.
+
+    chmod +x nbody.sh
+The nbody scripts take the execution mode as the first argument, the number of bodies as the second argument, and then the height and width (in pixels).
+
+Windows example:
+
+    C:> nbody  GPU 32768 800 800
+Linux example:
+
+    $ ./nbody.sh  GPU 32768 800 800
+Attribution
\ No newline at end of file
diff --git a/doc/ByteCode2OpenCL.pdf b/doc/ByteCode2OpenCL.pdf
new file mode 100644
index 0000000000000000000000000000000000000000..6d51d746f087b8e4fa74a42a1580e5fcb00f5336
Binary files /dev/null and b/doc/ByteCode2OpenCL.pdf differ
diff --git a/doc/ChoosingSpecificDevicesForExecution.md b/doc/ChoosingSpecificDevicesForExecution.md
new file mode 100644
index 0000000000000000000000000000000000000000..62b4ce089e982347ece86397f538793849c5918e
--- /dev/null
+++ b/doc/ChoosingSpecificDevicesForExecution.md
@@ -0,0 +1,58 @@
+#ChoosingSpecificDevicesForExecution
+*Using the new Device API's to choose Kernel execution on a specific device. Updated Sep 18, 2012 by frost.g...@gmail.com*
+
+Previously Aparapi chose the first GPU device when Kernel.execute() was called. This make it easy to execute simple Kernels, but was problematic when users wished finer control over which device should be chosen. Especially when the first device may be unsuitable. We recently added new classes and API's to allow the developer to specify exactly which device we intend to target.
+
+A new Device class has been added. This allows the user to select a specific device; either by calling a helper method Device.firstGPU() or Device.best(). Or by allowing the user to iterate through all devices and choose one based on some other criteria (capabilities? vendor name?).
+
+So selecting the 'best' (most performant) device could be achieved using.
+
+    Device device = Device.best();
+
+Alternatively if I wanted the first AMD GPU device I might use:-
+
+    Device chosen=null;
+    for (Device device: devices.getAll()){
+       if (device.getVendor().contains("AMD") && device.isGPU()){
+          chosen = device;
+          break;
+       }
+    }
+
+A Device can be queried `(isGPU(), isOpenCL(), isGroup(), isJava(), getOpenCLPlatform(), getMaxMemory(), getLocalSizes())` to yield it's characteristics.
+
+To execute on a specific device we must use the device to create our range.
+
+    Range range = device.createRange2D(width, height);
+
+This allows the Range to be created with knowledge of the underlying device. So for example device.createRange3D(1024, 1024, 1024, 16, 16, 16) will fail if the device does not allow a local size of (16x16x16).
+
+A range created using a device method captures the device which created it. The range instance has a device field which is set by the device which creates it.
+
+It's as if we had this code
+
+    Range range = Range.create(width, height);
+    range.setDevice(device);
+
+So the Range locks the device that it can be used with.
+
+Now when we have a Kernel.
+
+    Kernel kernel = new Kernel(){
+        @Override public void run(){
+          ...
+        }
+    }
+
+And we then use a device created range.
+
+    Device device = Device.firstGPU();
+    Kernel kernel = new Kernel(){
+        @Override public void run(){
+          // uses input[];
+        }
+    };
+    range = device.createRange2D(1024, 1024);
+    kernel.execute(range);
+
+We have forced execution on the first GPU.
diff --git a/doc/ContributionGuide.md b/doc/ContributionGuide.md
new file mode 100644
index 0000000000000000000000000000000000000000..c3f8d219c8b20f53f627d53bebf05b00ffa63757
--- /dev/null
+++ b/doc/ContributionGuide.md
@@ -0,0 +1,48 @@
+#ContributionGuide
+*How to contribute (bug fix or features). Updated Sep 13, 2011 by frost.g...@gmail.com*
+##Contribution Guide
+We welcome all contributions to add new features to Aparapi and make Aparapi more useful and high performing. These guidelines are intended to describe and streamline the contribution process.
+
+A patch can be a bug fix, a new feature, a new JUnit test case or a documentation change.
+
+Only members of the commit team are able to commit changes to the SVN repository.
+
+Only patches submitted through the process described below will be committed to SVN.
+
+The commit team will only applying patches which are submitted via the Aparapi project’s issue list.
+
+http://code.google.com/p/aparapi/issues/list
+The current commit team members are:
+* Eric Caspole (AMD)
+* Tom Deneau (AMD)
+* Gary Frost (AMD)
+
+If you would like to be considered for inclusion in the commit team, please send an email to anyone on the team and let them know.
+
+##Submitting a patch
+If the bug or enhancement does not yet appear in the issues list, please take the time add a new issue.
+
+Be sure to include sufficient detail to explain and recreate the bug or to justify the proposed enhancement.
+
+Ensure that your patch/fix does not regress any of existing JUnit tests. The UnitTestGuide wiki page describes how to run the various Aparapi unit tests.
+
+Ensure that your patch does not break any sample or example. Create a patch file (using SVN’s diff command) against a recently updated trunk, do not submit patches against branches. Name your patch file using the following filename convention
+
+     aparapi-<issue id>-<trunk revision id>.patch
+The following shows the sequence for creating a patch for issue number 1234.
+
+    $ cd aparapi-trunk
+    $ svn update
+    At revision 10339
+    $ svn -diff > aparapi-1234-10339.patch
+
+Attach your patch file to the issue via Issue List.
+
+## Attribution of contributions
+We want to correctly attribute all contributions and will maintain a CREDITS.txt file at the head of the trunk. We discourage including attribution as comments in the code, instead we intend to let the history feature of SVN be the primary method for tracking attributions. When patch is committed the commit team member will update the CREDITS.txt file and apply your patch, then will include your name (and email if you desire) as part of the SVN commit history.
+
+## Contributions made under a different license than the existing BSD derived license
+We cannot accept contributions or patches which are subject to other licenses.
+
+Attribution
+
diff --git a/doc/ConvertingBytecodeToOpenCL.md b/doc/ConvertingBytecodeToOpenCL.md
new file mode 100644
index 0000000000000000000000000000000000000000..74210bb10bebce69f644f896fb8bdd80ea39a355
--- /dev/null
+++ b/doc/ConvertingBytecodeToOpenCL.md
@@ -0,0 +1,282 @@
+#ConvertingBytecodeToOpenCL
+
+*How Aparapi converts bytecode to OpenCL Updated Aug 23, 2012 by frost.g...@gmail.com*
+
+##Introduction
+
+[try this](ByteCode2OpenCL.pdf)
+
+One of the unique Aparapi features is it's ability to convert Java bytecode to OpenCL automatically.
+
+In this page we will try to describe the process used to perform this conversion. If you are unfamiliar with bytecode consider visiting this page WhatIsBytecode.
+
+The command
+
+    javac Source.java
+
+Will compile the java source file Source.java to Source.class
+
+The classfile format is well documented here and we will not go into too much detail here, however it should be known that Aparapi must parse the classfile of each Kernel to extract the bytecode for the Kernel.run() and any method reachable from Kernel.run().
+
+Lets start with a simple Kernel.
+
+    import com.amd.aparapi.Kernel;
+
+    public class Squarer extends Kernel{
+       int[] in;
+       int[] out;
+       @Override public void run(){
+          int gid = getGlobalId(0);
+          out[gid] = in[gid] * in[gid];
+       }
+    }
+
+We will compile this
+
+    javac -g -cp path/to/aparapi/aparapi.jar Squarer.java
+
+and then we can look at the bytecode using javap
+
+    javap -c -classpath path/to/aparapi/aparapi.jar;. Squarer
+
+Compiled from "Squarer.java"
+
+    public class Squarer extends com.amd.aparapi.Kernel
+      SourceFile: "Squarer.java"
+      minor version: 0
+      major version: 50
+      Constant pool:
+    const #1 = Method       #6.#17; //  com/amd/aparapi/Kernel."<init>":()V
+    const #2 = Method       #5.#18; //  Squarer.getGlobalId:(I)I
+    const #3 = Field        #5.#19; //  Squarer.out:[I
+    const #4 = Field        #5.#20; //  Squarer.in:[I
+    const #5 = class        #21;    //  Squarer
+    const #6 = class        #22;    //  com/amd/aparapi/Kernel
+    const #7 = Asciz        in;
+    const #8 = Asciz        [I;
+    const #9 = Asciz        out;
+    const #10 = Asciz       <init>;
+    const #11 = Asciz       ()V;
+    const #12 = Asciz       Code;
+    const #13 = Asciz       LineNumberTable;
+    const #14 = Asciz       run;
+    const #15 = Asciz       SourceFile;
+    const #16 = Asciz       Squarer.java;
+    const #17 = NameAndType #10:#11;//  "<init>":()V
+    const #18 = NameAndType #23:#24;//  getGlobalId:(I)I
+    const #19 = NameAndType #9:#8;//  out:[I
+    const #20 = NameAndType #7:#8;//  in:[I
+    const #21 = Asciz       Squarer;
+    const #22 = Asciz       com/amd/aparapi/Kernel;
+    const #23 = Asciz       getGlobalId;
+    const #24 = Asciz       (I)I;
+
+    {
+    int[] in;
+
+    int[] out;
+
+    public Squarer();
+      Code:
+       Stack=1, Locals=1, Args_size=1
+       0:   aload_0
+       1:   invokespecial   #1; //Method com/amd/aparapi/Kernel."<init>":()V
+       4:   return
+
+
+    public void run();
+      Code:
+       Stack=5, Locals=2, Args_size=1
+       0:   aload_0
+       1:   iconst_0
+       2:   invokevirtual   #2; //Method getGlobalId:(I)I
+       5:   istore_1
+       6:   aload_0
+       7:   getfield        #3; //Field out:[I
+       10:  iload_1
+       11:  aload_0
+       12:  getfield        #4; //Field in:[I
+       15:  iload_1
+       16:  iaload
+       17:  aload_0
+       18:  getfield        #4; //Field in:[I
+       21:  iload_1
+       22:  iaload
+       23:  imul
+       24:  iastore
+       25:  return
+    }
+
+Here we see constant pool of the class and the disassembled bytecode of the default constructor Squarer() and the Squarer.run() method.
+
+The constant pool is a table of constant values that can be accessed from the bytecode of any methods from within this class. Some of the constants are String literals defined within the source (or literals used to name classes, fields, methods, variables or signatures), other slots represent Classes, Methods, Fields or Type signatures. These later constant pool entries cross-reference other constant pool entries to describe higher level artifact.
+
+For example constant pool entry #1 is
+
+    const #1 = Method       #6.#17; //  com/amd/aparapi/Kernel."<init>":()V
+
+So entry #1 defines a method. The class containing the method is defined in constant pool entry #6. So lets look at constant pool entry #6.
+
+    const #1 = Method       #6.#17; //  com/amd/aparapi/Kernel."<init>":()V
+
+    const #6 = class        #22;    //  com/amd/aparapi/Kernel
+
+At constant pool entry #6 we find a class definition which refers to entry #22
+
+    const #1 = Method       #6.#17; //  com/amd/aparapi/Kernel."<init>":()V
+
+    const #6 = class        #22;    //  com/amd/aparapi/Kernel
+
+    const #22 = Asciz       com/amd/aparapi/Kernel;
+
+Which just contains the String (Ascii) name of the class.
+
+Looking back at entry #1 again, we note that the Method also references entry #17 which contains a NameAndType entry for determining the method name and the signature.
+
+    const #1 = Method       #6.#17; //  com/amd/aparapi/Kernel."<init>":()V
+
+    const #6 = class        #22;    //  com/amd/aparapi/Kernel
+
+
+    const #17 = NameAndType #10:#11;//  "<init>":()V
+
+    const #22 = Asciz       com/amd/aparapi/Kernel;
+
+Entry #17's "NameAndType" references #10 for the method name.
+
+    const #1 = Method       #6.#17; //  com/amd/aparapi/Kernel."<init>":()V
+
+    const #6 = class        #22;    //  com/amd/aparapi/Kernel
+
+    const #10 = Asciz       <init>;
+
+    const #17 = NameAndType #10:#11;//  "<init>":()V
+
+    const #22 = Asciz       com/amd/aparapi/Kernel;
+
+And then references #11 to get the signature.
+
+    const #1 = Method       #6.#17; //  com/amd/aparapi/Kernel."<init>":()V
+
+    const #6 = class        #22;    //  com/amd/aparapi/Kernel
+
+    const #10 = Asciz       <init>;
+
+    const #11 = Asciz       ()V;
+
+    const #17 = NameAndType #10:#11;//  "<init>":()V
+
+    const #22 = Asciz       com/amd/aparapi/Kernel;
+
+So from constant pool #1 we ended up using slots 1,6,10,11,17 and 22 to fully resolve the method.
+
+This looks like a lot of work, however by breaking method and field references up like this, allows the various slots to be reused by other field/method descriptions.
+
+So when we see disassembled bytecode which references a constantpool slot the actual slot # (2 in the example below) will appear after the bytecode for invokevirtual.
+
+    2:   invokevirtual   #2; Method getGlobalId:(I)I
+
+Bytecode is basically able to access three things
+
+1. Constant pool entries
+2. Variable slots
+3. Stack operands
+
+Instructions are able to pop operands from the stack, push operands to the stack, load values from variable slots (to the stack), store values (from the stack) to variable slots, store values from accessed fields (to the stack) and call methods (popping args from the stack).
+
+Some instructions can only handle specific types (int, float, double, and object instances - arrays are special forms of objects) and usually the first character of the instruction helps determine which type the instruction acts upon. So imul would be a multiply instruction that operates on integers, fmul would multiply two floats, dmul for doubles. Instructions that begin with 'a' operate on object instances.
+
+So lets look at the first instruction.
+
+    0:   aload_0
+
+This instruction loads an object (a is the first character) from variable slot 0 (we'll come back to the variable slots in a moment) and pushes it on the stack.
+
+Variables are held in 'slots' that are reserved at compiled time.
+
+Consider this static method.
+
+    static int squareMe(int value){
+      value += value;
+      return(value);
+    }
+
+This method requires one variable slot. At any one time there is only one variable that is live, it just happens to be an argument to the method.
+
+The following method also contains one slot.
+
+    static int squareMe(){
+      int value=4;
+      value += value;
+      return(value);
+    }
+
+Here we need two slots
+
+    static int squareMe(int arg){
+      int value=arg*arg;
+      return(value);
+    }
+
+Suprisingly the following also only requires two slots.
+
+    static int squareMe(int arg){
+      {
+        int temp = arg*arg;
+      }
+      int value=arg*arg;
+      return(value);
+    }
+
+Note that in the above example the temp variable loses scope before the local variable value is used. So only two slots are required. Both temp and value can share a slot.
+
+If we have an instance method we always require one extra slot (always slot 0) for the this reference.
+
+So
+
+    int squareMe(int arg){
+      int value=arg*arg;
+      return(value);
+    }
+
+Requires three slots.
+
+Anyway back to our bytecode
+
+    0:   aload_0
+
+This loads the object instance in slot 0 (this) and pushes it on the stack.
+
+Next we have
+
+    1:   iconst_0
+
+Which pushes the int constant 0 on the stack. So the stack contains {this,0}
+
+Next we have
+
+    2:   invokevirtual   #2; //Method getGlobalId:(I)I
+
+This is the bytecode for calling a method. Basically the instruction itself references the constant pool (we'll come back to this ;) ) and pulls the method description in `constantPool2` which happens to be the description for a method called `getGlobalId()` which takes an integer and returns an `int`.
+
+So the VM will pop the top value `(int - const 0)` as the method arg, and then will pop an object reference (this!) and will call the method `this.getGlobalId(0)` and will push the result (an int) back on the stack.
+
+So our stack which contains `{this,0}` now contains the result of this.getGlobalId(0), lets assume it is {0}. We describe this invoke instruction as consuming two operands from the stack and producing one.
+
+Before we start executing our stack is empty {}, the slots are initialized with 'this' (if an instance method) and any arguments passed to the method.
+
+                                                                0   1
+                                                       slots=[this, ?  ]    stack={}
+
+                                                                0   1
+    0:   aload_0                                        slots=[this, ?  ]    stack={this}
+                                                                0   1
+    1:   iconst_0                                       slots=[this, ?  ]    stack={this, 0}
+                                                                0   1
+    2:   invokevirtual   #2; Method getGlobalId:(I)I    slots=[this, ?  ]  stack={result of this.getGlobalId(0) lets say 0}
+
+    5:   istore_1                                       slots=[this, 0  ]    stack={}
+
+    6:   aload_0                                        slots=[this, 0  ]    stack={this}
+
+    7:   getfield        #3; //Field out:[I
diff --git a/doc/DevelopersGuide.md b/doc/DevelopersGuide.md
new file mode 100644
index 0000000000000000000000000000000000000000..350226f8b262c3aacb631debc0b2042769d92d5c
--- /dev/null
+++ b/doc/DevelopersGuide.md
@@ -0,0 +1,29 @@
+#DevelopersGuide
+*Aparapi developers guide. Updated Sep 13, 2011 by frost.g...@gmail.com*
+##Developer Guide
+Although the vast majority of the Aparapi code is Java® we do include some to C++ code (accessed from Java™ via JNI) to interface with existing OpenCL™ C/C++ headers and libraries. Therefore to build Aparapi for a given platform (Microsoft® Windows® 32- or 64- bit and or Linux® 32- or 64- bit) we do require developers to setup a build environment containing both Java® and C++ development tools. In this documentation we will describe the tools required to build Aparapi for the various supported platforms.
+
+##Supported Platforms
+In general Aparapi can be used on any platform currently supported by AMD APP SDK v2.5 or later. Please check the AMD APP SDK site for details on supported platforms and installation help.
+
+[http://developer.amd.com/sdks/amdappsdk/downloads/pages/default.aspx](http://developer.amd.com/sdks/amdappsdk/downloads/pages/default.aspx)
+
+[http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_APP_SDK_Installation_Notes.pdf](http://developer.amd.com/sdks/amdappsdk/downloads/pages/default.aspx)
+
+* 32-bit Microsoft® Windows® 7
+* 32-bit Microsoft® Windows Vista®
+* 64-bit Microsoft® Windows® 7
+* 64-bit Microsoft® Windows Vista®
+* 32-bit Linux®
+* 64-bit Linux®
+
+Clearly we will also depend on platform specific Oracle® Java® JDK 6 components and C++ compilers along with some platform neutral tools (such as SVN, ant and Junit) .
+
+## Platform Specific Developer Guides
+We have broken the Developer Guide into two separate docs. One for Linux® (32- and 64- bit) and another for Microsoft® Windows® (32- and 64- bit). Please follow the appropriate link below.
+
+[DevelopersGuideLinux](DevelopersGuideLinux.md)
+
+[DevelopersGuideWindows](DevelopersGuideWindows.md)
+
+Attribution
\ No newline at end of file
diff --git a/doc/DevelopersGuideLinux.md b/doc/DevelopersGuideLinux.md
new file mode 100644
index 0000000000000000000000000000000000000000..c14da3a23fa09e6a72c55196b9937da13321e274
--- /dev/null
+++ b/doc/DevelopersGuideLinux.md
@@ -0,0 +1,181 @@
+#DevelopersGuideLinux
+
+*Developer guide for Linux. Updated Aug 23, 2012 by frost.g...@gmail.com*
+
+#Aparapi Developer Guide: Linux® 32- and 64-bit platforms
+
+##SVN Client
+
+To contribute to Aparapi you will need an SVN client to access the latest source code. This page lists a number of SVN client providers [http://subversion.apache.org/packages.html](http://subversion.apache.org/packages.html) Also you might want to consider one of the SVN-based plugins for Eclipse®. http://wiki.eclipse.org/SVN_Howto
+OpenJDK or Oracle® Java JDK install (JDK1.6 or later)
+
+http://OpenJDK.java.net http://www.oracle.com/technetwork/java/javase/downloads/index.html
+
+Many Linux® distributions come with Java JDK pre-installed or available as an optional install component. Sometimes the version that comes pre-installed is GCJ (http://gcc.gnu.org/java/). For Aparapi you will need to ensure that you have a copy of the JDK from either the OpenJDK project or from Oracle®.
+
+The Oracle® J2SE JDK site contains downloads and documentation showing how to install for various Linux distributions.
+
+http://www.oracle.com/technetwork/java/javase/index-137561.html
+
+Here is an example for my Ubuntu system:
+
+    $ sudo apt-get install sun-java6-jdk sun-java6-jre
+
+When the installation is complete, ensure that your JAVA_HOME environment variable is pointing to the install location (such as /usr/lib/jvm/java-6-sun-1.6.0.26).
+
+    $ export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.26
+
+You should also add ${JAVA_HOME}/bin to your path.
+
+    $ export PATH=$PATH}:${JAVA_HOME}/bin
+
+Double-check your path and ensure that there is not another JDK/JRE in your path.
+
+    $ java -version
+    java version "1.6.0_26"
+    Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
+    Java HotSpot(TM) Client VM (build 20.1-b02, mixed mode, sharing)
+
+##Apache Ant
+
+Apache Ant® can be downloaded from the apache project page http://ant.apache.org
+
+Aparapi has been tested using 1.7.1 version of Ant. It may work with earlier versions, but if you encounter issues we recommend updating to at least 1.7.1 before reporting issues. Here is an example for installing Ant on Ubuntu :
+
+    $ apt-get install ant
+
+Ensure that ANT_HOME is set to the install dir.
+
+    $ export ANT_HOME=/usr/local/ant
+
+Add `${ANT_HOME}/bin` to your path.
+
+    $ export PATH=$PATH}:${ANT_HOME}/bin
+
+Double-check the installation and environment vars.
+
+    ant -version
+    Apache Ant version 1.7.1 compiled ...
+
+##AMD APP SDK
+
+To compile Aparapi JNI code you need access to OpenCL headers and libraries. The instructions below assume that there is an available AMD APP SDK v2.5® (or later) installed and that your platform supports the required device drivers for your GPU card. Install the Catalyst driver first, and then install AMD APP SDK v2.5® or later.
+
+See http://developer.amd.com/sdks/AMDAPPSDK/pages/DriverCompatibility.aspx for help locating the appropriate driver for your AMD card. Make sure you install the catalyst driver that includes the OpenCLâ„¢ runtime components.
+
+    The OpenCLâ„¢ runtime is required for executing Aparapi or OpenCLâ„¢ on your GPU or GPU, but it is not necessary for building/compiling Aparapi.
+    The AMD APP SDK v2.5 is necessary for compiling the Aparapi JNI code against OpenCLâ„¢ APIs.
+
+Once you have a suitable driver, download a copy of AMD APP SDK v2.5 or later from http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx.
+
+Download the installation guide for Microsoft® Windows® (and Linux®) from http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_APP_SDK_Installation_Notes.pdf. Note that if you updating from a previous version of AMD APP SDK (or its predecessor ATI STREAM SDK), first uninstall the previous version.
+
+Download the release notes from: http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_APP_SDK_Release_Notes_Developer.pdf
+GCC compiler (G++) for your Linux 32-bit or 64-bit platform
+
+Aparapi has been tested with 32-bit and 64-bit Linux 4.1.2 or later GCC compilers.
+
+Ensure you have the g++ toolchain installed:
+
+    $ g++
+    no input files
+
+##JUnit
+
+The initial Open Source drop includes a suite of JUnit tests for validating bytecode to OpenCLâ„¢ code generation. These tests require JUnit 4.
+
+Download JUnit from http://www.junit.org/ and note the location of your JUnit installation; the location is needed to configure the test\codegen\build.xml file. Please see the UnitTestGuide page.
+
+##Eclipse
+
+Eclipse is not required to build Aparapi; however, the developers of Aparapi do use Eclipse and have made the Eclipse artifacts (.classpath and .project files) available so that projects can be imported into Eclipse. The com.amd.aparapi.jni subproject (containing C++ JNI source) should be imported as a resource project. We do not recommend importing com.amd.aparapi.jni as a CDT project, and we do not recommend trying to configure a CDT build, the existing build.xml files has been customized for multiplatform C++ compilations.
+
+##Building
+
+Check out the Aparapi SVN trunk:
+
+    $ svn checkout http://aparapi.googlecode.com/svn/trunk aparapi
+
+Checkout provides the following:
+
+    aparapi/
+       com.amd.aparapi/
+          src/java/com.amd.aparapi/*.java
+          build.xml
+       com.amd.aparapi.jni/
+          src/cpp/*.cpp
+          src/cpp/*.h
+          build.xml
+       test/
+          codegen/
+             src/java/
+                com.amd.aparapi/
+                com.amd.aparapi.test/
+             build.xml
+          runtime/
+             src/java/
+                com.amd.aparapi/
+                com.amd.aparapi.test/
+             build.xml
+       samples/
+          mandel
+             src/java/com.amd.aparapi.samples.mandel/*.java
+             build.xml
+             mandel.sh
+             mandel.bat
+          squares/
+             src/java/com.amd.aparapi.samples.squares/*.java
+             build.xml
+             squares.sh
+             squares.bat
+          convolution/
+             src/java/com.amd.aparapi.samples.convolution/*.java
+             build.xml
+             conv.sh
+             conv.bat
+       examples/
+          nbody/
+             src/java/com.amd.aparapi.nbody/
+             build.xml
+             nbody.sh
+             nbody.bat
+       build.xml
+       README.txt
+       LICENSE.txt
+       CREDITS.txt
+
+##Sub Directories
+
+The com.amd.aparapi and com.amd.aparapi.jni subdirectories contain the source for building and using Aparapi.
+
+The ant build.xml file, in each folder accept common 'clean' and 'build' targets. You can use the build.xml file at the root of the tree for two purposes:
+
+    To initiate a build com.amd.aparapi of com.amd.aparapi.jni.
+    To create a binary ‘distribution’ directory and zip file. This zip file is same as those available from the download section of the code.google.com/p/aparapi site.
+
+##Preparing for your first build
+
+Edit com.amd.aparapi.jni\build.properties and ensure that the properties are valid for your platform.
+
+View the comments in the properties file for assistance. The build.xml ant file contains some simple checks to help diagnose simple configuration errors in case something gets messed up.
+
+For Linux you should not need to edit build.xml unless your APP SDK install location differs from the default. The default for Linux® is /opt/AMDAPP
+
+    amd.app.sdk.dir=/opt/AMDAPP
+
+Perform a build from the root directory using the following command:
+
+    $ ant clean build dist
+
+Once your build has completed you should see an additional subdirectory named dist_linux_x86 or dist_linux_x86_64 (depending on the bitness of your platform).
+
+The distribution directory contains:
+
+    aparapi.jar containing Aparapi classes for all platforms.
+    the shared library for your platform (aparapi_x86.so or aparapi_x86_64.so).
+    an /api subdirectory containing the 'public' javadoc for Aparapi.
+    a samples directory containing the source and binaries for the mandel and squares sample projects.
+
+The root directory also contains either dist_linux_x86_64.zip or dist_linux_x86.zip containing a compressed archive of the distribution tree.
+
+[Attribution](Attribution.md)
diff --git a/doc/DevelopersGuideWindows.md b/doc/DevelopersGuideWindows.md
new file mode 100644
index 0000000000000000000000000000000000000000..dd0c386f7329e7c4255ca028c89fea23aca145b9
--- /dev/null
+++ b/doc/DevelopersGuideWindows.md
@@ -0,0 +1,187 @@
+#DevelopersGuideWindows
+*Developers guide for Windows. Updated Aug 23, 2012 by frost.g...@gmail.com*
+
+##Aparapi Developer Guide: Microsoft® Windows® 32- and 64-bit platforms
+
+##SVN Client
+
+To contribute to Aparapi you will need an SVN client to access the latest source code.
+
+This page lists a number of SVN client providers http://subversion.apache.org/packages.html
+
+For Microsoft Windows® users TortoiseSVN incorporates SVN functionality directly into Windows Explorer view and is often preferred http://tortoisesvn.tigris.org/
+
+Also you might want to consider one of the SVN-based plugins for Eclipse. http://wiki.eclipse.org/SVN_Howto
+Oracle® Java JDK install (JDK1.6 or later)
+
+http://www.oracle.com/technetwork/java/javase/downloads/index.html
+
+The Oracle® J2SE JDK site contains downloads and documentation showing how to install for various platforms. http://www.oracle.com/technetwork/java/javase/index-137561.html
+
+When the installation is complete, ensure that your JAVA_HOME environment variable is pointing to the install location (such as c:\progra~1\java\jdk1.6.0_26)and that %JAVA_HOME%\bin is in your path.
+
+    C:> set JAVA_HOME=c:\progra~1\java\jdk1.6.0_26
+    C:> set PATH=%PATH%;%JAVA_HOME%\bin
+
+Note that we tend to use the 8.3 form of Microsoft® Windows® path variables this avoids us having to quote paths in scripts.
+
+Double check your path and ensure that there is not another JDK/JRE in your path.
+
+    C:> java -version
+    java version "1.6.0_26"
+    Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
+    Java HotSpot(TM) Client VM (build 20.1-b02, mixed mode, sharing)
+
+##Apache Ant
+
+Apache Antâ„¢ can be downloaded from the apache project page http://ant.apache.org
+
+Aparapi has been tested using 1.7.1 version of Ant, it may well work with earlier versions, but if you encounter issues we recommend updating to at least 1.7.1 before reporting issues. Installation is straightforward, just unzip the ant.zip file and ensure that your ANT_HOME}} environment variable is pointing to your ANT installation and that `{{{%ANT_HOME%\bin` is in your path.
+
+    C:> set ANT_HOME=C:\progra~1\apache\apache-ant-1.8.1
+    C:> set PATH=%PATH%;%ANT_HOME%\bin
+
+Double check the installation and environment vars.
+
+    ant -version
+    Apache Ant version 1.7.1 compiled ..
+
+##AMD APP SDK
+
+To compile Aparapi JNI code you need access to OpenCL headers and libraries. The instructions below assume that there is an available AMD APP SDK v2.5 (or later) installed and that your platform supports the required device drivers for your GPU card. Install the Catalyst driver first, and then install AMD APP SDK v2.5.
+
+See http://developer.amd.com/sdks/AMDAPPSDK/pages/DriverCompatibility.aspx for help locating the appropriate driver for your AMD card. Be sure you obtain the catalyst driver that includes the OpenCLâ„¢ runtime components.
+
+    The OpenCLâ„¢ runtime is required for executing Aparapi or OpenCLâ„¢ on your CPU or GPU, but it is not necessary for building/compiling Aparapi.
+    The AMD APP SDK v2.5 is necessary for compiling the Aparapi JNI code against OpenCLâ„¢ APIs.
+
+Once you have a suitable driver, download a copy of AMD APP SDK v2.5 from http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx.
+
+Download the installation guide for Microsoft® Windows® (and Linux®) from http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_APP_SDK_Installation_Notes.pdf. Note that if you updating from a previous version of AMD APP SDK (or its predecessor ATI STREAM SDK), first uninstall the previous version. The release notes are available here http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_APP_SDK_Release_Notes_Developer.pdf
+##A C++ compiler
+
+For Microsoft® Windows® platforms the JNI build can support either Microsoft® Visual Studio® 2008, 2009 or 2010 compiler or MinGW (Minimal GNU for Windows) from GNU. Now that Visual Studio express is available for free, we would recommend using Visual studio. If you wish to use another compiler then you will have to tweak the com.amd.aparapi.jni/build.xml file to get your compiler to work.
+Microsoft® Visual Studio® 2008/2010 for 32-bit or 64-bit platforms
+
+Aparapi has been tested with various versions of Microsoft® Visual Studio® 2008, 2009 and 2010 including Enterprise, Professional and Express editions, if you encounter any version specific issues please let us know so we can address it and/or update this documentation.
+
+If you already have Microsoft® Visual Studio® installed you will need to know the location of the compiler and the SDK. These can vary depending upon the platform and version you are using. Typically an install results in a Visual Studio install, such as. c:\Program Files\Microsoft Visual Studio 9.0
+
+And an SDK, such as. c:\Program Files\Microsoft SDKs\Windows\v6.0A
+
+Note the location of both of these as this information will be needed to configure the com.amd.aparapi.jni\build.property file (later).
+For Visual Studio Express 64 bit users
+
+Visual studio express does not include the 64 bit compiler or libraries. You will need to also install the SDK from Microsoft. this link should help
+##MinGW – (MINimum Gnu for Windows)
+
+As an alternative to installing Microsoft® Visual Studio® we have included support for the MinGW tool chain and Aparapi has been (minimally) tested with this compiler.
+
+MingGW can be downloaded from http://www.mingw.org/ by following the instructions on their Getting Started page. We recommend installing the mingw-get-inst msi installer and just taking the defaults.
+
+Note the install location as this information will be needed to edit build.xml file and uncomment the line referencing the mingw instal dir. Typically the install location is
+
+    C:\MinGW
+
+After a successful build, you will need to ensure that the bin sub directory is in your path before you attempting to run an Aparapi enabled application built using MinGW. MinGW apps require access to MingGW/GNU C++/C runtime at execution time.
+
+    set PATH=%PATH%;C:\MinGW\bin
+
+This is one reason the binary distribution is ''not'' built using mingw.
+##JUnit
+
+The initial Open Source drop includes a suite of JUnit tests for validating bytecode to OpenCL code generation. These tests require JUnit 4.
+
+Download JUnit from http://www.junit.org/
+
+Note the location of your JUnit installation; the location is needed to configure the test\codegen\build.xml file. See the UnitTestGuide page for howto configure the JUnit build.
+##Eclipse
+
+Eclipse is not required to build Aparapi, however the developers of Aparapi do use Eclipse and have made the Eclipse artifacts (.classpath and .project files) available so that projects can be imported into Eclipse.
+
+The com.amd.aparapi.jni subproject (containing C++ JNI source) should be imported as a resource project, we do not recommend importing com.amd.aparapi.jni as a CDT project, and we do not recommend trying to configure a CDT build, the existing build.xml files has been customized for multiplatform C++ compilations.
+##Building
+
+Check out the Aparapi SVN trunk:
+
+svn checkout http://aparapi.googlecode.com/svn/trunk
+
+You will end up with the following files/directories
+
+    aparapi/
+       com.amd.aparapi/
+          src/java/com.amd.aparapi/*.java
+          build.xml
+       com.amd.aparapi.jni/
+          src/cpp/*.cpp
+          src/cpp/*.h
+          build.xml
+       test/
+          codegen/
+             src/java/
+                com.amd.aparapi/
+                com.amd.aparapi.test/
+             build.xml
+          runtime/
+             src/java/
+                com.amd.aparapi/
+                com.amd.aparapi.test/
+             build.xml
+       samples/
+          mandel
+             src/java/com.amd.aparapi.samples.mandel/*.java
+             build.xml
+             mandel.sh
+             mandel.bat
+          squares/
+             src/java/com.amd.aparapi.samples.squares/*.java
+             build.xml
+             squares.sh
+             squares.bat
+          convolution/
+             src/java/com.amd.aparapi.samples.convolution/*.java
+             build.xml
+             conv.sh
+             conv.bat
+       examples/
+          nbody/
+             src/java/com.amd.aparapi.nbody/
+             build.xml
+             nbody.sh
+             nbody.bat
+       build.xml
+       README.txt
+       LICENSE.txt
+       CREDITS.txt
+
+##Sub Directories
+
+The com.amd.aparapi and com.amd.aparapi.jni subdirectories contain the source for building and using Aparapi.
+
+The ant build.xml file, in each folder accept 'clean' and 'build' targets.
+
+Use the build.xml file at the root of the tree for two purposes:
+
+    To initiate a build of com.amd.aparapi and com.amd.aparapi.jni.
+    To create a binary distribution directory and zip file. This zip file is same as those available from the download section of the code.google.com/p/aparapi site.
+
+##Preparing for your first build
+
+You should only need to edit com.amd.aparapi.jni\build.xml file if you wish to use mingw or if you Visual Studio or gcc compiler is in an unusual place.
+
+Perform a build from the root directory using the following command:
+
+    $ ant clean dist
+
+The jni build will perform some simple tests to check the configuration properties and hopefully also guide you to a possible solution.
+
+Once your build has completed you should see an additional subdirectory named dist_windows_x86 or dist_windows_x86_64 (depending upon your platform type).
+
+    aparapi.jar containing Aparapi classes for all platforms.
+    the shared library for your platform (aparapi_x86.dll or aparapi_x86_64.dll).
+    an /api subdirectory containing the 'public' javadoc for Aparapi.
+    a samples directory containing the source and binaries for the mandel and squares sample projects.
+
+The root directory also contains either dist_windows_x86_64.zip or dist_windows_x86.zip containing a compressed archive of the distribution tree.
+
+[Attribution](Attribution.md)
diff --git a/doc/DeviceProposal.md b/doc/DeviceProposal.md
new file mode 100644
index 0000000000000000000000000000000000000000..cb91759b5e4a08f8e380941dcba5c0c1829b4b76
--- /dev/null
+++ b/doc/DeviceProposal.md
@@ -0,0 +1,65 @@
+#DeviceProposal
+*How we might use the extension mechanism devices for general Kernel execution. Updated May 9, 2012 by frost.g...@gmail.com*
+
+At present the first GPU or CPU device (depending on Kernel.ExecutionMode value) is chosen at execution time. This make it easy to execute simple Kernels, but is problematic when using some advanced feature (barriers, local memory) or for sizing buffers appropriate for the target device. I propose that we add API's to allow the developer to specify exactly which device we intend to target.
+
+In the extension proposal branch we needed to expose a Device class for binding arbitrary OpenCL to a Java interface. I suggest we also be use this to query device information useful for allocating suitable size global buffers/local buffers, and for dispatching Kernel's to specific devices.
+
+The general pattern would be that we ask Aparapi to give us a Device, probably via a Device factory method.
+
+Something like:-
+
+    Device device = Device.best();
+We would also offer other useful factory methods `getBestGPU(), getFirstCPU() getJavaMultiThread(), getJavaSequential()` as well as a method to get all device so that the developer can filter themselves.
+
+Note that as well as real OpenCL devices we also expose 'pseudo' devices such as JavaMultiThread and Sequential. We might also allow pseudo devices to group multiple devices. So getAllGPUDevices() might return a pseudo device for executing across devices.
+
+    Device chosen=null;
+    for (Device device: devices.getAll()){
+       if (device.getVendor().contains("AMD") && device.isGPU()){
+          chosen = device;
+          break;
+       }
+    }
+
+A Device can be queried `(isGPU(), isOpenCL(), isGroup(), isJava(), getOpenCLPlatform(), getMaxMemory(), getLocalSizes())` and may need to be cast to specific types.
+
+This would allow us to configure buffers.
+
+    Device device = Device.best();
+    if (device instanceof OpenCLDevice){
+       OpenCLDevice openCLDevice  = (OpenCLDevice)device;
+       char input[] = new char[openCLDevice.getMaxMemory()/4);
+    }
+We can also use the Device as a factory for creating Ranges.
+
+    Range range = device.createRange2D(width, height);
+This allows the Range to be created with knowledge of the underlying device. So for example `device.createRange3D(1024, 1024, 1024, 16, 16, 16)` will fail if the device does not allow a local size of (16x16x16).
+
+A range created using `device.createRangeXX()` would also capture the device that created it. As if we had
+
+    Range range = device.createRange2D(width, height);
+    // implied range.setDevice(device);
+    This basically means that the Range locks the device that it can be used with.
+
+    So when we have a Kernel.
+
+    Kernel kernel = new Kernel(){
+        @Override public void run(){
+          ...
+        }
+    }
+And we then use
+
+    Device device = Device.firstGPU();
+    final char input[] = new char[((OpenCLDevice)device).getMaxMemory()/4);
+    Kernel kernel = new Kernel(){
+        @Override public void run(){
+          // uses input[];
+        }
+    };
+    range = device.createRange2D(1024, 1024);
+    kernel.execute(range);
+We have forced execution on the first GPU. Java fallback would still be possible (should we forbid this?).
+
+    kernel.execute( Device.firstGPU().getRange2D(width, height));
diff --git a/doc/EmulatingMultipleEntrypointsUsingCurrentAPI.md b/doc/EmulatingMultipleEntrypointsUsingCurrentAPI.md
new file mode 100644
index 0000000000000000000000000000000000000000..b34051f5aadc8ba235098cbc088ce97eaa266d5d
--- /dev/null
+++ b/doc/EmulatingMultipleEntrypointsUsingCurrentAPI.md
@@ -0,0 +1,226 @@
+#EmulatingMultipleEntrypointsUsingCurrentAPI
+*How to emulate multiple entrypoints using existing Aparapi APIs Updated Jul 30, 2012 by frost.g...@gmail.com*
+
+##Emulating Multiple Entrypoints Using Existing Aparapi APIs
+
+Until we have support for multiple entrypoints in Aparapi, there are some tricks for emulating this feature.
+
+Follow the proposal for adding multiple entrypoints on this page [MultipleEntryPointSupportProposal](MultipleEntryPointSupportProposal.md).
+
+Suppose we wanted to create a general VectorMath kernel which might expose unary square, squareroot methods and binary addition and subtraction functionality. With our current API limitations we can't easily do this, we can approximate having separate methods by passing a separate arg to dictate the 'function' that we wish to perform.
+
+    class VectorKernel extends Kernel{
+        float[] lhsOperand;
+        float[] rhsOperand;
+        float[] unaryOperand;
+        float[] result;
+        final static int FUNC_ADD =0;
+        final static int FUNC_SUB =1;
+        final static int FUNC_SQR =2;
+        final static int FUNC_SQRT =3;
+        // other functions
+        int function;
+        @Override public void run(){
+            int gid = getGlobalId(0){
+            if (function==FUNC_ADD){
+               result[gid]=lhsOperand[gid]+rhsOperand[gid];
+            }else if (function==FUNC_SUB){
+               result[gid]=lhsOperand[gid]-rhsOperand[gid];
+            }else if (function==FUNC_SQR){
+               result[gid]=unaryOperand[gid]*unaryOperand[gid];
+            }else if (function==FUNC_ADD){
+               result[gid]=sqrt(unaryOperand[gid]);
+            }else if ....
+        }
+    }
+
+To use this for adding two vectors and then take the sqrt of the result we would use something like....
+
+    int SIZE=1024;
+    Range range = Range.create(SIZE);
+    VectorKernel vk = new VectorKernel();
+    vk.lhsOperand = new float[SIZE];
+    vk.rhsOperand = new float[SIZE];
+    vk.unaryOperand = new float[SIZE];
+    vk.result = new float[SIZE];
+
+    // fill lhsOperand ommitted
+    // fill rhsOperand ommitted
+    vk.function = VectorKernel.FUNC_ADD;
+    vk.execute(range);
+    System.arrayCopy(vk.result, 0, vk.unaryOperand, 0, SIZE);
+    vk.function = VectorKernel.FUNC_SQRT;
+    vk.execute(range);
+
+This approach is fairly common and I have used it successfully to perform various pipeline stages for calculating FFT's for example. Whilst this is functional it is not a great solution. First the API is clumsy. We have to mutate the state of the kernel instance and then re-arrange the arrays manually to chain math operations. We could of course hide all of this behind helper methods. One could imagine for example an implementation which exposes helper add(lhs, rhs)}}, or {{{sqrt() which hid all the nasty stuff.
+
+    class VectorKernel extends Kernel{
+        float[] lhsOperand;
+        float[] rhsOperand;
+        float[] unaryOperand;
+        float[] result;
+        final static int FUNC_ADD =0;
+        final static int FUNC_SUB =1;
+        final static int FUNC_SQR =2;
+        final static int FUNC_SQRT =3;
+        // other functions
+        int function;
+        @Override public void run(){
+            int gid = getGlobalId(0){
+            if (function==FUNC_ADD){
+               result[gid]=lhsOperand[gid]+rhsOperand[gid];
+            }else if (function==FUNC_SUB){
+               result[gid]=lhsOperand[gid]-rhsOperand[gid];
+            }else if (function==FUNC_SQR){
+               result[gid]=unaryOperand[gid]*unaryOperand[gid];
+            }else if (function==FUNC_ADD){
+               result[gid]=sqrt(unaryOperand[gid]);
+            }else if ....
+        }
+        private void binary(int operator, float[] lhs, float[] rhs){
+           lhsOperand = lhs;
+           rhsOperand = rhs;
+           function=operator;
+           execute(lhs.length());
+        }
+        public void add(float[] lhs, float[] rhs){
+           binary(FUNC_ADD, lhs, rhs);
+        }
+
+        public void sub(float[] lhs, float[] rhs){
+           binary(FUNC_SUB, lhs, rhs);
+        }
+
+        private void binary(int operator, float[] rhs){
+           System.arrayCopy(result, 0, lhsOperand, result.length);
+           rhsOperand = rhs;
+           function=operator;
+           execute(lhsOperand.legth());
+        }
+
+        public void add(float[] rhs){
+           binary(FUNC_ADD,  rhs);
+        }
+
+        public void sub( float[] rhs){
+           binary(FUNC_SUB,  rhs);
+        }
+
+        private void unary(int operator, float[] unary){
+           unaryOperand = unary;
+           function=operator;
+           execute(unaryOperand.length());
+        }
+
+        public void sqrt(float[] unary){
+           unary(FUNC_SQRT, unary);
+        }
+
+        private void unary(int operator){
+           System.array.copy(result, 0, unaryOperand, 0, result.length);
+           function=operator;
+           execute(unaryOperand.length());
+        }
+
+        public void sqrt(){
+           unary(FUNC_SQRT);
+        }
+
+    }
+
+    VectorKernel vk = new VectorKernel(SIZE);
+    vk.add(copyLhs, copyRhs);  // copies args to lhs and rhs operands
+                               // sets function type
+                               // and executes kernel
+    vk.sqrt();                 // because we have no arg
+                               // copies result to unary operand
+                               // sets function type
+                               // execute kernel
+
+However there is one more objection to this approach, namely that it by default will force unnecessarily buffer copies.
+
+When the bytecode for the above Kernel.run() method is analyzed Aparapi finds bytecode reading from lhsOperand, rhsOperand and unaryOperand arrays/buffers. Obviously at this bytecode analysis stage we can't predict which 'function type' will be used, so on every executions (Kernel.run()) Aparapi must copy all three buffers to the GPU. For binary operations this is one buffer copy wasted (the unaryOperand), for the unary operations we copy two buffers unnecessarily (lhsOperand and rhsOperand). We can of course use explicit buffer management to help us reduce these costs. Ideally we add this to our helper methods.
+
+    class VectorKernel extends Kernel{
+        float[] lhsOperand;
+        float[] rhsOperand;
+        float[] unaryOperand;
+        float[] result;
+        final static int FUNC_ADD =0;
+        final static int FUNC_SUB =1;
+        final static int FUNC_SQR =2;
+        final static int FUNC_SQRT =3;
+        // other functions
+        int function;
+        @Override public void run(){
+            int gid = getGlobalId(0){
+            if (function==FUNC_ADD){
+               result[gid]=lhsOperand[gid]+rhsOperand[gid];
+            }else if (function==FUNC_SUB){
+               result[gid]=lhsOperand[gid]-rhsOperand[gid];
+            }else if (function==FUNC_SQR){
+               result[gid]=unaryOperand[gid]*unaryOperand[gid];
+            }else if (function==FUNC_ADD){
+               result[gid]=sqrt(unaryOperand[gid]);
+            }else if ....
+        }
+        private void binary(int operator, float[] lhs, float[] rhs){
+           lhsOperand = lhs;
+           rhsOperand = rhs;
+           function=operator;
+           put(lhsOperand).put(rhsOperand);
+           execute(lhs.length());
+           get(result);
+        }
+        public void add(float[] lhs, float[] rhs){
+           binary(FUNC_ADD, lhs, rhs);
+        }
+
+        public void sub(float[] lhs, float[] rhs){
+           binary(FUNC_SUB, lhs, rhs);
+        }
+
+        private void binary(int operator, float[] rhs){
+           System.arrayCopy(result, 0, lhsOperand, result.length);
+           rhsOperand = rhs;
+           function=operator;
+           put(lhsOperand).put(rhsOperand);
+           execute(lhsOperand.legth());
+           get(result);
+        }
+
+        public void add(float[] rhs){
+           binary(FUNC_ADD,  rhs);
+        }
+
+        public void sub( float[] rhs){
+           binary(FUNC_SUB,  rhs);
+        }
+
+        private void unary(int operator, float[] unary){
+           unaryOperand = unary;
+           function=operator;
+           put(unaryOperand);
+           execute(unaryOperand.length());
+           get(result);
+        }
+
+        public void sqrt(float[] unary){
+           unary(FUNC_SQRT, unary);
+        }
+
+        private void unary(int operator){
+           System.array.copy(result, 0, unaryOperand, 0, result.length);
+           function=operator;
+           put(unaryOperand);
+           execute(unaryOperand.length());
+           get(result);
+
+        }
+
+        public void sqrt(){
+           unary(FUNC_SQRT);
+        }
+
+    }
+
diff --git a/doc/ExplicitBufferHandling.md b/doc/ExplicitBufferHandling.md
new file mode 100644
index 0000000000000000000000000000000000000000..5f0e70112dfda3c99d87b0e9103adcb20fec8045
--- /dev/null
+++ b/doc/ExplicitBufferHandling.md
@@ -0,0 +1,220 @@
+#ExplicitBufferHandling
+*How to minimize buffer transfers Updated Jul 24, 2012 by frost.g...@gmail.com*
+Aparapi is designed to shield the Java developer from dealing with the underlying movement of data between the OpenCL host and device. Aparapi can analyze a kernel's `run()` method and run-reachable methods to determine which primitive arrays to transfer to the GPU prior to execution, and which arrays to transfer back when the GPU execution is complete.
+
+Generally this strategy is both clean and performant. Aparapi will attempt to just do the right thing.
+
+However, occasionally the following code pattern is seen.
+
+    final int[] hugeArray = new int[HUGE];
+    Kernel kernel= new Kernel(){
+        ... // reads/writes hugeArray
+    };
+    for (int loop=0; loop <MAXLOOP; loop++){
+        kernel.execute(HUGE);
+    }
+
+This is a common pattern which unfortunately exposes an issue with Aparapi's normal buffer handling.
+
+Although Aparapi does analyze the byte code of the `Kernel.run()` method (and any method reachable from `Kernel.run()`) Aparapi has no visibility to the call site. In the above code there is no way for Aparapi to detect that that hugeArray is not modified within the for loop body. Unfortunately, Aparapi must default to being 'safe' and copy the contents of hugeArray backwards and forwards to the GPU device.
+
+Here we add comments to indicate where the unnecessary buffer transfers take place.
+
+    final int[] hugeArray = new int[HUGE];
+    Kernel kernel= new Kernel(){
+       ... // reads/writes hugeArray
+    };
+    for (int loop=0; loop <MAXLOOP; loop++){
+       // copy hugeArray to GPU
+       kernel.execute(HUGE);
+       // copy hugeArray back from the GPU
+    }
+
+In reality hugeArray only needs to be copied to the GPU once (prior to the loop) and then once again when the loop has terminated.
+
+Here we use comments to indicated the 'optimal' transfers.
+
+    final int[] hugeArray = new int[HUGE];
+    Kernel kernel= new Kernel(){
+       ... // reads/writes hugeArray
+    };
+    // Ideally transfer hugeArray to GPU here
+    for (int loop=0; loop <MAXLOOP; loop++){
+       kernel.execute(HUGE);
+    }
+    // Ideally transfer hugeArray back from GPU here
+
+Consider another common pattern
+
+    final int[] hugeArray = new int[HUGE];
+    final int[] done = new int[]{0};
+    Kernel kernel= new Kernel(){
+       ... // reads/writes hugeArray and writes to done[0] when complete
+    };
+    done[0]=0;
+    while (done[0] ==0)){
+       kernel.execute(HUGE);
+    }
+
+This is a common pattern in reduce stages of map-reduce type problems. Essentially the developer wants to keep executing a kernel until some condition is met. For example, this may be seen in bitonic sort implementations and various financial applications.
+
+From the code it can be seen that the kernel reads and writes `hugeArray[]` array and uses the single item `done[]` array to indicate some form of convergence or completion.
+
+As we demonstrated above, by default Aparapi will transfer `done[]` and `hugeArray[]` to and from the GPU device each time `Kernel.execute(HUGE)` is executed.
+
+To demonstrate which buffers are being transfered, these copies are shown as comments in the following version of the code.
+
+    final int[] hugeArray = new int[HUGE];
+    final int[] done = new int[]{0};
+    Kernel kernel= new Kernel(){
+       ... // reads/writes hugeArray and writes to done[0] when complete
+    };
+    done[0]=0;
+    while (done[0] ==0)){
+       // Send done[] to GPU
+       // Send hugeArray[] to GPU
+       kernel.execute(HUGE);
+       // Fetch done[] from GPU
+       // Fetch hugeArray[] from GPU
+    }
+
+Further analysis of the code reveals that `hugeArray[]` is not accessed by the loop containing the kernel execution, so Aparapi is performing 999 unnecessary transfers to the device and 999 unnecessary transfers back. Only two transfers of `hugeArray[]` are needed; one to move the initial data to the GPU and one to move it back after the loop terminates.
+
+The `done[]` array is accessed during each iteration (although never written to within the loop), so it does need to be transferred back for each return from Kernel.execute(), however, it only needs to be sent once.
+
+Clearly it is better to avoid unnecessary transfers, especially of large buffers like `hugeArray[]`.
+
+Aparapi exposes a feature which allows the developer to control these situations and explicitly manage transfers.
+
+To use this feature first the developer needs to 'turn on' explicit mode, using the `kernel.setExplicit(true)` method. Then the developer can request buffer/array transfers using either `kernel.put()` or `kernel.get()`. `Kernel.put()` forces a transfer to the GPU device and Kernel.get() transfers data back.
+
+The following code illustrates the use of these new explicit buffer management APIs.
+
+    final int[] hugeArray = new int[HUGE];
+    final int[] done = new int[]{0};
+    Kernel kernel= new Kernel(){
+       ... // reads/writes hugeArray and writes to done[0] when complete
+    };
+    kernel.setExplicit(true);
+    done[0]=0;
+    kernel.put(done);
+    kernel.put(hugeArray);
+    while (done[0] ==0)){
+       kernel.execute(HUGE);
+       kernel.get(done);
+    }
+    kernel.get(hugeArray);
+
+Note that marking a kernel as explicit and failing to request the appropriate transfer is a programmer error.
+
+We deliberately made `Kernel.put(...)`, `Kernel.get(...)` and `Kernel.execute(range)` return an instance of the executing kernel to allow these calls be chained. Some may find this fluent style API more expressive.
+
+    final int[] hugeArray = new int[HUGE];
+    final int[] done = new int[]{0};
+    Kernel kernel= new Kernel(){
+       ... // reads/writes hugeArray and writes to done[0] when complete
+    };
+    kernel.setExplicit(true);
+    done[0]=0;
+    kernel.put(done).put(hugeArray);    // chained puts
+    while (done[0] ==0)){
+       kernel.execute(HUGE).get(done);  // chained execute and put
+    }
+    kernel.get(hugeArray);
+
+An alternate approach for loops containing a single `kernel.execute(range)` call.
+One variant of code which would normally suggest the use of Explicit Buffer Management can be handled differently. For cases where `Kernel.execute(range)` is the sole statement inside a loop and where the iteration count is known prior to the first iteration we offer an alternate (hopefully more elegant) way of minimizing buffer transfers.
+
+So for cases like:-
+
+    final int[] hugeArray = new int[HUGE];
+    Kernel kernel= new Kernel(){
+        ... // reads/writes hugeArray
+    };
+
+    for (int pass=0; pass<1000; pass++){
+       kernel.execute(HUGE);
+    }
+
+The developer can request that Aparapi perform the outer loop rather than coding the loop. This is achieved explicitly by passing the iteration count as the second argument to `Kernel.execute(range, iterations)`.
+
+Now any form of code that looks like :-
+
+    int range = 1024;
+    int loopCount = 64;
+    for (int passId = 0; passId < loopCount; passId++){
+       kernel.execute(range);
+    }
+
+Can be replaced with
+
+    int range = 1024;
+    int loopCount = 64;
+
+    kernel.execute(range, loopCount);
+
+Not only does this make the code more compact and avoids the use of explicit buffer management APIs, it allows Aparapi visibility to the complete loop so that Aparapi can minimize the number of transfers. Aparapi will only transfer buffers to the GPU once and transfer them back once, resulting in improved performance.
+
+Sometimes kernel code using this loop-pattern needs to track the current iteration number as the code passed through the outer loop. Previously we would be forced to use explicit buffer management to allow the kernel to do this.
+
+The code for this would have looked something like
+
+    int range = 1024;
+    int loopCount = 64;
+    final int[] hugeArray = new int[HUGE];
+    final int[] passId = new int[0];
+    Kernel kernel = new Kernel(){
+       @Override public void run(){
+          int id=getGlobalId();
+          if (passId[0] == 0){
+              // perform some initialization!
+          }
+          ... // reads/writes hugeArray
+       }
+    };
+    Kernel.setExplicit(true);
+    kernel.put(hugeArray);
+    for (passId[0]=0; passId[0]<loopCount; passId[0]++){
+
+       kernel.put(passId).execute(range);
+    }
+In the current version of Aparapi we added `Kernel.getPassId()` to allow a Kernel to determine the current ‘pass’ through the outer loop without having to use explicit buffer management.
+
+So the previous code can now be written without any explicit buffer management APIs:-
+
+    final int[] hugeArray = new int[HUGE];
+    final int[] pass[] = new int[]{0};
+    Kernel kernel = new Kernel(){
+       @Override public void run(){
+          int id = getGlobalId();
+          int pass = getPassId();
+          if (pass == 0){
+              // perform some initialization!
+          }
+          ... // reads/writes both hugeArray
+       }
+    };
+
+    kernel.execute(HUGE, 1000);
+One common use for Kernel.getPassId() is to avoid flipping buffers in the outer loop.
+
+It is common for kernels to process data from one buffer to another, and in the next invocation process the data back the other way. Now these kernels can use the passId (odd or even) to determine the direction of data transfer.
+
+    final int[] arr1 = new int[HUGE];
+    final int[] arr2 = new int[HUGE];
+    Kernel kernel = new Kernel(){
+       int f(int v){ … }
+
+       @Override public void run(){
+          int id = getGlobalId();
+          int pass = getPassId();
+          if (pass % 2 == 0){
+              arr1[id] = f(arr2[id]);
+          }else{
+              arr2[id] = f(arr1[id]);
+
+          }
+       }
+    };
+
+    kernel.execute(HUGE, 1000);
\ No newline at end of file
diff --git a/doc/FrequentlyAskedQuestions.md b/doc/FrequentlyAskedQuestions.md
new file mode 100644
index 0000000000000000000000000000000000000000..4092dff8f4fc173fdf75bd7fd08c8ab9708ae904
--- /dev/null
+++ b/doc/FrequentlyAskedQuestions.md
@@ -0,0 +1,134 @@
+#FrequentlyAskedQuestions
+*Frequently Asked Questions Updated Oct 17, 2012 by frost.g...@gmail.com*
+
+##Frequently Asked Questions
+
+##Why is this project called Aparapi and how is it pronounced?
+
+Aparapi is just a contraction of A PAR{allel} API and is pronounced (ap-per-rap-ee).
+
+##Does Aparapi only work with AMD graphics cards?
+
+No. Aparapi has been tested with AMD's OpenCL enabled drivers and devices as well as a limited set of NVidia devices and drivers on Windows, Linux and Mac OSX platforms. The minimal requirement at runtime is OpenCL 1.1. If you have a compatible OpenCL 1.1 runtime and supported devices Aparapi should work.
+
+Although the build is currently configured for AMD APP SDK, OpenCL is an open standard and we look forward to contributions which will allow Aparapi to be built against other OpenCL SDK's.
+
+Note that dll's built using AMD APP SDK will work on other platforms at runtime. So the binary builds are expected to work on all OpenCL 1.1 platforms.
+
+Witold Bolt has kindly supplied the patches to allow Mac OS support. The Mac OS build will run against OpenCL 1.1 and 1.0 runtimes, but we won't fix any issues reported against the OpenCL 1.0, your code may run, or may not.
+
+Aparapi may be used in JTP (Java Thread Pool) mode on any platform supported by Oracle®’s JDK.
+
+## Does Aparapi only support AMD CPUs?
+
+No, there is nothing restricting Aparapi to AMD CPUs. The JNI code that we use may run on any x86/x64 machine provided there is a compatible Java Virtual Machine® JVM implementation for your platform.
+
+##Will there be an Aparapi-like translator for .NET?
+
+This is still an early technology and Aparapi is currently focused on Java® enablement. There are similar projects targeting .NET (See www.tidepowerd.com)
+
+##How can I profile the OpenCL kernels that Aparapi generates? Can I get details on the latency of my kernel request?How do I optimize my kernel?
+
+AMD offers the ‘AMD APP Profiler’ which can be used to profile the kernels. With Aparapi, we recommend using the command line mode of the profiler, which is described in the release notes. Using the ‘AMD APP Profiler’ you can see how much time is taken by each kernel execution and buffer transfer. Also, in each kernel, you can get more detailed information on things like memory reads and writes, and other useful data.
+
+##Can I have multiple threads all using the GPU compute capabilities?
+
+Yes. There might be a performance impact if the device becomes a bottleneck. However, OpenCL and your GPU driver are designed to coordinate the various threads of execution.
+
+##Can I make method calls from the run method?
+
+You can generally only make calls to other methods declared in the same class as the initial run() method. Aparapi will follow this call chain to try to determine whether it can create OpenCL. If, for example, Aparapi encounters System.out.println("Hello World") ( call to a method not in the users Kernel class) it will detect this and refuse to consider the call chain as an OpenCL candidate.
+
+One exception to this rule allows a kernel to access or mutate the state of objects held in simple arrays via their setters/getters. For example a kernel can include :-
+
+    out[i].setValue(in[i].getValue()*5);
+
+##Does Aparapi support vectorized types?
+
+Due to Java's lack of vector types (float4 for example) Aparapi can't directly use them. Also, due to Java's lack of operator overloading, simulating these with Java abstracts could lead to very complex and unwieldy code.
+
+##Is there a way I can see the generated OpenCL?
+
+Yes, by using adding -Dcom.amd.aparapi.enableShowGeneratedOpenCL=true to your command line when you start your JVM.
+
+##Does Aparapi support sharing buffers with JOGL? Can I exploit the features of JOGAMP/glugen?
+
+Rather than only supporting display-oriented compute, we are pursuing general data parallel compute. Therefore, we have chosen not to bind Aparapi too closely with JOGL.
+
+##What is the performance delta from handcrafted OpenCL?
+
+This depends heavily on the application. Although we can currently show 20x performance improvement on some compute intensive Java applications compared with the same algorithm using a Java Thread Pool a developer who is prepared to handcraft and hand-tune OpenCL and write custom host code in C/C++ is likely to see better performance than Aparapi may achieve.
+
+We understand that some user may use Aparapi as a gateway technology to test their Java code before porting to hand-crafted/tuned OpenCL.
+
+##Are you working with Project Lambda for offloading/parallelizing suitable work?
+
+We are following the progress of Project Lambda (currently scheduled for inclusion in Java 8) and would like to be able to leverage Lambda expression format in Aparapi, but none exists now.
+
+##Can I select a specific GPU if I have more than one card?
+
+Under review. At present, Aparapi just looks for the first AMD GPU (or APU) device. If the community has feedback on its preference, let us know.
+
+##Can I get the demos/samples presented at JavaOne or ADFS?
+
+The Squares and Mandlebrot sample code is included in the binary download of Aparapi. The NBody source is not included in the binary (because of a dependency on JOGL). We have, however, included the NBody code as an example project in the Open Source tree (code.google.com/p/aparapi) and provide details and we provide details on how to install the appropriate JOGL components.
+
+##Can Mersenne twister be ported as a random number function inside the kernel class?
+
+You can elect to implement your own Mersenne twister and use it in our own derived Kernel.
+
+##Does Aparapi use JNI?
+
+Yes, we do ship a small JNI shim to handle the host OpenCL calls.
+
+##How can I confirm that my code is actually executing on the GPU?
+
+From within the Java code itself you can query the execution mode after Kernel.execute(n) has returned.
+
+    Kernel kernel = new Kernel(){
+       @Override public void run(){
+       }
+    } ;
+    kernel.execute(1024);
+    System.out.priintln(“Execution mode = “+kernel.getExecutionMode());
+
+The above code fragment will print either ‘GPU’ if the kernel executed on the GPU or JTP if Aparapi executed the Kernel in a Java Thread Pool.
+
+Alternatively, setting the property –Dcom.amd.aparapi.enableShowExecutionModes=true when you start your JVM will cause Aparapi to automatically report the execution mode of all kernels to stdout.
+
+##Why does Aparapi need me to compile my code with -g?
+
+Aparapi extracts most of the information required to create OpenCL from the bytecode of your Kernel.run() (and run-reachable) methods. We use the debug information to re-create the original variable name and to determine the local variable scope information.
+
+Of course only the derived Kernel class (or accessed Objects using new Arrays of Objects feature) need to be compiled using -g.
+
+##Why does the Aparapi documentation suggest I use Oracle's JDK/JRE? Why can't I use any JVM/JDK?
+
+The documentation suggests using Oracle's JDK/JRE for coverage reasons and not as a requirement. AMD focused its testing on Oracle's JVM/JDK.
+
+There are two parts to this.
+
+1. Our bytecode to OpenCL engine is somewhat tuned to the bytecode structures created by javac supplied by Oracle®. Specifically, there are some optimizations that other javac implementation might perform that Aparapi won't recognize. Eclipse (for example) does not presently use Oracle's javac, and so we do have some experience handling Eclipse specific bytecode patterns.
+2. At runtime, we piggyback on the (aptly named) sun.misc.Unsafe class, which is included in rt.jar from Oracle®. This class is useful because it helps us avoid some JNI calls by providing low level routines for accessing object field addresses (in real memory) and useful routines for Atomic operations. All accesses to 'sun.misc.Unsafe' are handled by an Aparapi class called UnsafeWrapper with the intent that this could be refactored to avoid this dependency.
+
+##I am using a dynamic language (Clojure, Scala, Groovy, Beanshell, etc) will I be able to use Aparapi?
+
+No.
+
+To access the bytecode for a method Aparapi needs to parse the original class file. For Java code, Aparapi can use something like `YourClass.getClassLoader().loadAsResource(YourClass.getName()+".class"))` to reload the class file bytes and parse the constant pool, attributes, fields, methods and method bytecode.
+
+It is unlikely that this process would work with a dynamically created class based on the presumption that dynamic languages employ some form of custom classloader to make dynamically generated bytecode available to the JVM. Therefore, it is unlikely that these classloaders would yield the classfile bytes. However, we encourage contributors to investigate opportunities here. Even if the class bytes were loadable, Aparapi would also expect debug information to be available (see previous FAQ entry). Again, this is not impossible for a dynamic language to do, indeed it would probably even be desirable as it would allow the code to be debugged using JDB compatible debugger.
+
+Finally, Aparapi recognizes bytecode patterns created by the javac supplied by Oracle® and it is possible that the code generated by a particular dynamic language may not be compatible with Aparapi current code analyzer.
+
+Therefore, at present this is unlikely to work. However, these would be excellent contributions to Aparapi. It would be great to see Aparapi being adopted by other JVM based dynamic language.
+
+##Why does Aparapi seems to be copying data unnecessarily back and forth between host and GPU. Can I stop Aparapi from doing this?
+
+Aparapi ensures that required data is moved to the GPU prior to kernel execution and returned to the appropriate array before Java execution resumes. Generally, this is what the Java user will expect. However, for some code patterns where multiple Kernel.execute() calls are made in succession (or more likely in a tight loop) Aparapi's approach may not be optimal.
+
+In the NewFeatures page we discuss a couple of Aparapi enhancements which will developers to elect intervene to reduce unnecessary copies.
+
+##Do I have to refactor my code to use arrays of primitives? Why can’t Aparapi just work with Java Objects?
+
+Aparapi creates OpenCL from the bytecode. Generally, OpenCL constrains us to using parallel primitive arrays (OpenCL does indeed allow structs, but Java and OpenCL do not have comparable memory layouts for these structures). Therefore, you will probably need to refactor your code to use primitive arrays. In this initial contribution, we have included some limited support for Arrays of simple Objects and hope contributors extend them. Check the NewFeatures page which shows how you can use this feature.
\ No newline at end of file
diff --git a/doc/HSAEnablementOfLambdaBranch.md b/doc/HSAEnablementOfLambdaBranch.md
new file mode 100644
index 0000000000000000000000000000000000000000..15e7fe9c1b71cee17b796a38ab8a95688cafa2c0
--- /dev/null
+++ b/doc/HSAEnablementOfLambdaBranch.md
@@ -0,0 +1,32 @@
+#HSAEnablementOfLambdaBranch
+*Adding HSA Support to Aparapi lambda branch Updated Feb 28, 2014 by frost.g...@gmail.com*
+
+* [How to setup a HSA enabled Linux Platform](SettingUpLinuxHSAMachineForAparapi.md)
+* [How to setup a HSA simulator on a Linux Platform](UsingAparapiLambdaBranchWithHSASimulator.md)
+
+Recently the HSA Foundation released their ‘Programmers Reference Manual’. This manual is for developers wishing to write code for upcoming HSA compatible devices, it describes the HSA Intermediate Language (HSAIL) along with its binary form (BRIG) and describes how code is expected to execute on a HSA enabled devices.
+
+In many ways we can think of HSAIL as we do Java bytecode. It is a common intermediate form that can be optimized at runtime to execute across a variety of future heterogeneous platforms. HSAIL will greatly simplify the development of software taking advantage of both sequential and parallel compute solutions.
+
+Now that the spec is out, we have started adding HSA support to the Aparapi lambda branch. We believe that HSA combined with the upcoming Java 8 feature lambda will be a natural way to express parallel algorithms which can be executed on the GPU via HSA.
+
+A HSA+Lambda enabled Aparapi will remove many of Aparapi's constraints. HSA allows all of the CPU's memory to be accessed directly from code running on the GPU. This means
+
+* We no longer need to move data from the host CPU to the GPU.
+* We are no longer limited to the memory addressable from the GPU
+* We can access multi-dim arrays efficiently
+* We can access Java objects directly from the GPU.
+These are all substantial benefits.
+
+In the existing code (early prototype) we provide access to HSA as a specific device type.
+
+So our ubiquitous 'squares' example will initially be written as:
+
+    int in[] = ..//
+    int out[] = .../
+    Device.hsa().forEach(in.length, (i)->{
+       out[i] = in[i]*in[i];
+     });
+You will obviously need a Java 8 compatible JDK ([https://jdk8.java.net/download.html](https://jdk8.java.net/download.html)) in your path.
+
+We also recommend using IntelliJ which has preliminary support for Java 8 lambda features. You can download the community edition of IntelliJ from [http://www.jetbrains.com/idea/](http://www.jetbrains.com/idea/)
\ No newline at end of file
diff --git a/doc/HSAEnablementOfLambdaBranchSidebar.md b/doc/HSAEnablementOfLambdaBranchSidebar.md
new file mode 100644
index 0000000000000000000000000000000000000000..3275452280bd0f065af3f2f1e9e0291927ba490f
--- /dev/null
+++ b/doc/HSAEnablementOfLambdaBranchSidebar.md
@@ -0,0 +1,6 @@
+#HSAEnablementOfLambdaBranchSidebar
+*Sidebar for HSAEnablementOfLambdaBranchAparapi*
+
+[How to setup a HSA enabled Linux Platform](SettingUpLinuxHSAMachineForAparapi.md)
+
+[How to setup a HSA simulator on a Linux Platform](UsingAparapiLambdaBranchWithHSASimulator.md)
diff --git a/doc/HowToAddUML.md b/doc/HowToAddUML.md
new file mode 100644
index 0000000000000000000000000000000000000000..8c1c7f4997689f6f40a664e7bfd67b22e2726c8b
--- /dev/null
+++ b/doc/HowToAddUML.md
@@ -0,0 +1,39 @@
+#HowToAddUML
+*How to add plantuml docs to wiki pages Updated Apr 20, 2013 by frost.g...@gmail.com*
+
+Go to http://www.plantuml.com/plantuml and type in the text for you diagram.
+
+Hit submit and check out the diagram.
+
+Once you are happy, so with something like
+
+    start
+    :kernel.execute(range);
+    if (?) then (first call for this instance)
+        : Convert Kernel.run() to OpenCL;
+        note
+           We also convert all
+           methods reachable from
+           kernel.run()
+        end note
+        if (?) then (Conversion was successful)
+           : Compile OpenCL;
+           : Map compiled OpenCL to this Kernel;
+        else (Conversion unsuccessful)
+        endif
+    else (not first call)
+    endif
+    if (?) then (OpenCL mapped for this instance)
+       : Bind args (send to GPU);
+       : Execute kernel;
+    else (false)
+       : Execute using a Java Thread Pool;
+    endif
+    stop
+Paste the resulting URL into the wiki page but append %20as.png at the end of the URL
+
+http://www.plantuml.com:80/plantuml/img/BLAHBLAH%20as.png
+
+To get this!
+
+![Image of UML](uml.png)
\ No newline at end of file
diff --git a/doc/JavaKernelGuidelines.md b/doc/JavaKernelGuidelines.md
new file mode 100644
index 0000000000000000000000000000000000000000..89ab38dc789f31b1acf1f474c87b152a4ffbe4a6
--- /dev/null
+++ b/doc/JavaKernelGuidelines.md
@@ -0,0 +1,72 @@
+#JavaKernelGuidelines
+*What code can and can't be converted to OpenCL by Aparapi. Updated Sep 13, 2011 by frost.g...@gmail.com*
+##Aparapi Java Kernel Guidelines
+Certain practices can improve the chances of your Java kernel being converted to OpenCL and executing on a GPU.
+
+The following guidelines/restrictions only apply to the Kernel.run() method and any method reachable from run() (called” run-reachable methods” in this documentation), clearly any methods executed via a normal Java execution path will not be subject to these restrictions.
+
+Some restrictions/guidelines may be removed or augmented in a future Aparapi releases.
+
+##Data Types
+* Only the Java primitive data types boolean, byte, short, int, long, and float and one-dimensional arrays of these primitive data types are supported by Aparapi.
+* Aparapi support for the primitive data type double will depend on your graphics card, driver, and OpenCL version. Aparapi will query the device/platform to determine if double is supported (at runtime). If your platform does not support double, Aparapi will drop back to (Java Thread Pool) (JTP) mode.
+* The primitive data type char is not supported.
+
+##Fields
+* Elements of primitive array fields can be read from kernel code.
+* Elements of primitive array fields can be written to by kernel code.
+* Note that Java creates 'hidden' fields for captured final primitive arrays (from anonymous inner classes) and they can be accessed as if they were fields of the kernel.
+* Primitive scalar fields can only be read by the kernel code. Because kernel run-reachable methods execute in parallel in an indeterminate order, any reliance on the result of modifications to primitive scalar fields is discouraged even when executing in Java Thread Pool mode.
+* Static final fields can be read from kernel code.
+* Static non-final fields are not supported for either read or write. Try to make them final.
+
+##Arrays
+* Only one-dimensional arrays are supported.
+* Arrays cannot be aliased either by direct local assignment or by passed arguments to other methods.
+* Java 5’s extended 'for' syntax for (int i: arrayOfInt){} is not supported, because it causes a shallow copy of the original array under the covers.
+
+##Methods
+* References to or through a Java Object other than your kernel instance will cause Aparapi to abandon attempting to create OpenCL (note the following exceptions).
+* There are a few very specific exceptions to the above rule to allow accesses through getters/setters of objects held in arrays of objects referenced from the kernel code.
+* Static methods are not supported by Aparapi.
+* Recursion is not supported, whether direct or indirect. Aparapi tries to detect this recursion statically, but the developer should not rely on Aparapi to do so.
+* Methods with varargs argument lists are not supported by Aparapi.
+* Overloaded methods (i.e. methods with the same name but different signatures) are not supported by Aparapi. OpenCL is C99 based so we are constrained by OpenCL's lack of support for overloading.
+* The kernel base class contains wrappers around most of the functions offered by java.lang.Math.  When run in a thread pool these wrappers delegate back to java.lang.Math when executing in OpenCL they translate to OpenCL equivalents.
+
+##Other Restrictions
+
+* Exceptions are not supported (no throw, catch. or finally).
+* New is not supported either for arrays or objects
+* Synchronized blocks and synchronized methods are not supported.
+* Only simple loops and conditionals are supported; switch, break, and continue are not supported.
+* A variable cannot have its first assignment be the side effect of an expression evaluation or a method call.  For example, the following will not be translated to run on the GPU.
+
+
+        int foo(int a) {
+           // . . .
+        }
+        public void run() {
+          int z;
+          foo(z = 3);
+        }
+
+
+* This should be regarded as an error which needs to be addressed, as a workaround, explicitly initialize variables (even to 0) when declared.
+
+## Beware Of Side Effects
+OpenCL is C99-based and as such the result of expressions depending on side effects of other expressions can differ from what one might expect from Java, please avoid using code that assumes Java's tighter rules.  Generally code should be as simple as possible.
+For example, although Java explicitly defines
+
+    arra[i++] = arrb[i++];
+  to be equivalent to
+
+    arra[i] = arrb[i+1];
+    i += 2;
+
+The C99/OpenCL standard does not define this and so the result would be undefined.
+
+##Runtime Exceptions
+* When run on the GPU, array accesses will not generate an ArrayIndexOutOfBoundsException.  Instead the behavior will be unspecified.
+* When run on the GPU, ArithmeticExceptions will not be generated, for example with integer division by zero. Instead the behavior will be unspecified.
+Attribution
diff --git a/doc/LIbraryAgentDuality.md b/doc/LIbraryAgentDuality.md
new file mode 100644
index 0000000000000000000000000000000000000000..88e164e679d635bdd00aebc83d57ff88cb82637c
--- /dev/null
+++ b/doc/LIbraryAgentDuality.md
@@ -0,0 +1,28 @@
+#LIbraryAgentDuality
+*Aparapi libraries can now be loaded as JVMTI agents. Updated Jan 15, 2013 by frost.g...@gmail.com*
+
+##What are all these check-ins referring to JVMTI agents?
+
+If you have been tracking Aparapi SVN checkins you will have noticed a bunch of changes to JNI code. I just finished arranging for aparapi libraries (.dll or .so) to be able to be loaded as JVMTI agent. Now (assuming library is in ${APARAPI_DIR}) we can either launch using the traditional...
+
+    java –Djava.library.path=${APARAPI_DIR} –classpath ${APARAPI_DIR}/aparapi.jar;my.jar mypackage.MyClass
+
+or ...
+
+    java –agentpath=${APARAPI_DIR}/aparapi_x86_64.dll –classpath ${APARAPI_DIR}/aparapi.jar;my.jar mypackage.MyClass
+
+So the dll/so is now both ‘just a library’ and a JVMTI agent.
+
+##When would I need an agent?
+
+Prevously Aparapi loaded classes that it needed to convert to OpenCL using java.lang.Class.getResourceAsStream(). This only works if we have a jar, or if the classes are on the filesystem somewhere. This approach will not work for 'synthetically generated classes'.
+
+There are applications/frameworks which create synthetic classes (at runtime) which would not normally be useable by Aparapi.
+
+Specifically (and significantly) Java 8 uses synthetic classes to capture args (closure captures) so they can be passed to the final lambda implementation. We needed a way to allow Aparapi to access bytecode of any class, not just those in jars or on the disk.
+
+A JVMTI agent can register an interest in loaded classes (loaded by the classloader)do this. So when we use the aparapi library in 'agent mode' it caches all bytes for all loaded classes (yes we could filter by name) and puts this information in a common data structure (should be a map but is a linked list at present).
+
+By adding a new OpenCLJNI.getBytes(String) JNI method, Aparapi can now retrieve the bytes for any loaded classes, out of this cache.
+
+So this combined with our ability to parse classes which don’t have line number information should really enable Aparapi to be used with Scala/JRuby/Groovy or other dynamic scripting languages which create classes on the fly.
diff --git a/doc/MultipleEntryPointSupportProposal.md b/doc/MultipleEntryPointSupportProposal.md
new file mode 100644
index 0000000000000000000000000000000000000000..bf2d70563fcc52aea2db5e6d8008db376168e22e
--- /dev/null
+++ b/doc/MultipleEntryPointSupportProposal.md
@@ -0,0 +1,377 @@
+#MultipleEntryPointSupportProposal
+*How to extend Aparapi to allow multiple entrypoints for kernels Updated Jul 30, 2012 by frost.g...@gmail.com*
+
+##The Current Single Entrypoint World
+
+At present Aparapi allows us to dispatch execution to a single 'single entry point' in a Kernel. Essentially for each Kernel only the overridden Kernel.run() method can be used to initiate execution on the GPU.
+
+Our canonical example is the 'Squarer' Kernel which allows us to create squares for each element in an input array in an output array.
+
+    Kernel squarer = new Kernel(){
+       @Overide public void run(){
+          int id = getGlobalId(0);
+          out[id] = in[id] * in[id];
+       }
+    };
+
+If we wanted a vector addition Kernel we would have to create a whole new Kernel.
+
+    Kernel adder = new Kernel(){
+       @Overide public void run(){
+          int id = getGlobalId(0);
+          out[id] = in[id] * in[id];
+       }
+    };
+
+For us to square and then add a constant we would have to invoke two kernels. Or of course create single SquarerAdder kernel.
+
+See this page EmulatingMultipleEntrypointsUsingCurrentAPI for ideas on how to emulate having multiple methods, by passing data to a single run() method.
+
+##Why can't Aparapi just allow 'arbitary' methods
+
+Ideally we would just expose a more natural API, one which allows us to provide specific methods for each arithmetic operation.
+
+Essentially
+
+    class VectorKernel extends Kernel{
+       public void add();
+       public void sub();
+       public void sqr();
+       public void sqrt();
+    }
+
+Unfortunately this is hard to implement using Aparapi. There are two distinct problems, both at runtime.
+
+    How will Aparapi know which of the available methods we want to execute when we call Kernel.execute(range)?
+    On first execution how does Aparapi determine which methods might be entrypoints and are therefore need to be converted to OpenCL?
+
+The first problem can be solved by extending Kernel.execute() to accept a method name
+
+    kernel.execute(SIZE, "add");
+
+This is the obvious solution, but really causes maintenence issues int that it trades compile time reporting for a runtime errors. If a developer mistypes the name of the method, :-
+
+    kernel.execute(SIZE, "sadd"); // there is no such method
+
+The code will compile perfectly, only at runtime will we detect that there is no such method.
+##An aside
+
+Maybe the new Java 8 method reference feature method might help here. In the paper below Brian Goetz talks about a double-colon syntax (Class::Method) for directly referencing a method which is presumably checked at compile time.
+
+So presumably
+
+    kernel.execute(SIZE, VectorKernel::add);
+
+Would compile just fine, whereby
+
+    kernel.execute(SIZE, VectorKernel::sadd);
+
+Would yield a compile time error.
+
+See Brian Goetz's excellent Lambda documentation
+##back from Aside
+
+The second problem (knowing which methods need to be converted to OpenCL) can probably be solved using an Annotation.
+
+    class VectorKernel extends Kernel{
+       @EntryPoint public void add();
+       @EntryPoint public void sub();
+       @EntryPoint public void sqr();
+       @EntryPoint public void sqrt();
+       public void nonOpenCLMethod();
+    }
+
+Here the @EntryPoint annotation allows the Aparapi runtime to determine which methods need to be exposed.
+#My Extension Proposal
+
+Here is my proposal. Not only does it allow us to reference multiple entryoints, but I think it actually improves the single entrypoint API, albeit at the cost of being more verbose.
+##The developer must provide an API interface
+
+First I propose that we should ask the developer to provide an interface for all methods that we wish to execute on the GPU (or convert to OpenCL).
+
+    interface VectorAPI extends AparapiAPI {
+       public void add(Range range);
+       public void sub(Range range);
+       public void sqrt(Range range);
+       public void sqr(Range range);
+    }
+
+Note that each API takes a Range, this will make more sense in a moment.
+##The developer provides a bound implementation
+
+Aparapi should provide a mechanism for mapping the proposed implementation of the API to it's implementation.
+
+Note the weasel words here, this is not a conventional implementation of an interface. We will use an annotation (@Implements(Class class)) to provide the binding.
+
+    @Implements(VectorAPI.class) class Vector extends Kernel {
+       public void add(RangeId rangeId){/*implementation here */}
+       public void sub(RangeId rangeId){/*implementation here */}
+       public void sqrt(RangeId rangeId){/*implementation here */}
+       public void sqr(RangeId rangeId){/*implementation here */}
+       public void  public void nonOpenCLMethod();
+    }
+
+##Why we can't the implementation just implement the interface?
+
+This would be ideal. Sadly we need to intercept a call to say VectorAPI.add(Range) and dispatch to the resulting Vector.add(RangeId) instances. If you look at the signatures, the interface accepts a Range as it's arg (the range over which we intend to execute) whereas the implementation (either called by JTP threads or GPU OpenCL dispatch) receives a RangeId (containing the unique globalId, localId, etc fields). At the very end of this page I show a strawman implementation of a sequential loop implementation.
+##So how do we get an implementation of VectorAPI
+
+We instantiate our Kernel by creating an instance using new. We then ask this instance to create an API instance. Some presumably java.util.Proxy trickery will create an implementation of the actual instance, backed by the Java implementation.
+
+So execution would look something like.
+
+    Vector kernel = new Vector();
+    VectorAPI kernelApi = kernel.api();
+    Range range = Range.create(SIZE);
+    kernalApi.add(range);
+
+So the Vector instance is a pure Java implementation. The extracted API is the bridge to the GPU.
+
+Of course then we can also execute using an inline call through api()
+
+    Vector kernel = new Vector();
+    Range range = Range.create(SIZE);
+    kernel.api().add(range);
+    kernel.api().sqrt(range);
+
+or even expose api as public final fields
+
+    Vector kernel = new Vector();
+    Range range = Range.create(SIZE);
+    kernel.api.add(range);
+    kernel.api.sqrt(range);
+
+##How would our canonical Squarer example look
+
+    interface SquarerAPI extends AparapiAPI{
+       square(Range range);
+    }
+
+    @Implement(SquarerAPI) class Squarer extends Kernel{
+       int in[];
+       int square[];
+       public void square(RangeId rangeId){
+          square[rangeId.gid] = in[rangeId.gid]*in[rangeId.gid];
+       }
+    }
+
+Then we execute using
+
+    Squarer squarer = new Squarer();
+    // fill squarer.in[SIZE]
+    // create squarer.values[SIZE];
+
+squarer.api().square(Range.create(SIZE));
+
+#Extending this proposal to allow argument passing
+
+Note that we have effectively replaced the use of the 'abstract' squarer.execute(range) with the more concrete squarer.api().add(range).
+
+Now I would like to propose that we take one more step by allowing us to pass arguments to our methods.
+
+Normally Aparapi captures buffer and field accesses to create the args that it passes to the generated OpenCL code. In our cannonical squarer example the in[] and square[] buffers are captured from the bytecode and passed (behind the scenes) to the OpenCL.
+
+* **TODO: Add generated OpenCl code to show what this looks like.** *
+
+However, by exposing the actual method we want to execute, we could also allow the API to accept parameters.
+
+So our squarer example would go from
+
+    interface SquarerAPI extends AparapiAPI{
+       square(Range range);
+    }
+
+    @Implement(SquarerAPI) class Squarer extends Kernel{
+       int in[];
+       int square[];
+       public void square(RangeId rangeId){
+          square[rangeId.gid] = in[rangeId.gid]*in[rangeId.gid];
+       }
+    }
+
+
+    Squarer squarer = new Squarer();
+    // fill squarer.in[SIZE]
+    // create squarer.values[SIZE];
+
+    squarer.api().square(Range.create(SIZE));
+
+to
+
+    interface SquarerAPI extends AparapiAPI{
+       square(Range range, int[] in, int[] square);
+    }
+
+    @Implement(SquarerAPI) class Squarer extends Kernel{
+       public void square(RangeId rangeId, int[] in, int[] square){
+          square[rangeId.gid] = in[rangeId.gid]*in[rangeId.gid];
+       }
+    }
+
+
+    Squarer squarer = new Squarer();
+    int[] in = // create and fill squarer.in[SIZE]
+    int[] square = // create squarer.values[SIZE];
+
+    squarer.api().square(Range.create(SIZE), in, result);
+
+I think that this makes Aparapi look more conventional. It also allows us to allow overloading for the first time.
+
+    interface SquarerAPI extends AparapiAPI{
+       square(Range range, int[] in, int[] square);
+       square(Range range, float[] in, float[] square);
+    }
+
+    @Implement(SquarerAPI) class Squarer extends Kernel{
+       public void square(RangeId rangeId, int[] in, int[] square){
+          square[rangeId.gid] = in[rangeId.gid]*in[rangeId.gid];
+       }
+       public void square(RangeId rangeId, float[] in, float[] square){
+          square[rangeId.gid] = in[rangeId.gid]*in[rangeId.gid];
+       }
+    }
+
+
+    Squarer squarer = new Squarer();
+    int[] in = // create and fill squarer.in[SIZE]
+    int[] square = // create squarer.values[SIZE];
+
+    squarer.api().square(Range.create(SIZE), in, result);
+    float[] inf = // create and fill squarer.in[SIZE]
+    float[] squaref = // create squarer.values[SIZE];
+
+    squarer.api().square(Range.create(SIZE), inf, resultf);
+
+---
+
+test harness
+
+    import java.lang.reflect.InvocationHandler;
+    import java.lang.reflect.Method;
+    import java.lang.reflect.Proxy;
+
+
+    public class Ideal{
+
+       public static class OpenCLInvocationHandler<T> implements InvocationHandler {
+           Object instance;
+           OpenCLInvocationHandler(Object _instance){
+              instance = _instance;
+           }
+          @Override public Object invoke(Object interfaceThis, Method interfaceMethod, Object[] interfaceArgs) throws Throwable {
+             Class clazz = instance.getClass();
+
+             Class[] argTypes =  interfaceMethod.getParameterTypes();
+             argTypes[0]=RangeId.class;
+             Method method = clazz.getDeclaredMethod(interfaceMethod.getName(), argTypes);
+
+
+             if (method == null){
+                System.out.println("can't find method");
+             }else{
+                RangeId rangeId = new RangeId((Range)interfaceArgs[0]);
+                interfaceArgs[0]=rangeId;
+                for (rangeId.wgid = 0; rangeId.wgid <rangeId.r.width; rangeId.wgid++){
+                    method.invoke(instance, interfaceArgs);
+                }
+             }
+
+             return null;
+          }
+       }
+
+       static class Range{
+          int width;
+          Range(int _width) {
+             width = _width;
+          }
+       }
+
+       static class Range2D extends Range{
+          int height;
+
+          Range2D(int _width, int _height) {
+             super(_width);
+             height = _height;
+          }
+       }
+
+       static class Range1DId<T extends Range>{
+          Range1DId(T _r){
+             r = _r;
+          }
+          T r;
+
+          int wgid, wlid, wgsize, wlsize, wgroup;
+       }
+
+       static class RangeId  extends Range1DId<Range>{
+          RangeId(Range r){
+             super(r);
+          }
+       }
+
+       static class Range2DId extends Range1DId<Range2D>{
+          Range2DId(Range2D r){
+             super(r);
+          }
+
+          int hgid, hlid, hgsize, hlsize, hgroup;
+       }
+
+
+
+
+
+       static <T> T create(Object _instance, Class<T> _interface) {
+          OpenCLInvocationHandler<T> invocationHandler = new OpenCLInvocationHandler<T>(_instance);
+          T instance = (T) Proxy.newProxyInstance(Ideal.class.getClassLoader(), new Class[] {
+                _interface,
+
+          }, invocationHandler);
+          return (instance);
+
+       }
+
+
+
+       public static class Squarer{
+          interface API {
+             public API foo(Range range, int[] in, int[] out);
+             public Squarer dispatch();
+
+          }
+
+          public API foo(RangeId rangeId, int[] in, int[] out) {
+             out[rangeId.wgid] = in[rangeId.wgid]*in[rangeId.wgid];
+             return(null);
+          }
+       }
+
+       /**
+        * @param args
+        */
+       public static void main(String[] args) {
+
+          Squarer.API squarer = create(new Squarer(), Squarer.API.class);
+          int[] in = new int[] {
+                1,
+                2,
+                3,
+                4,
+                5,
+                6
+          };
+          int[] out = new int[in.length];
+          Range range = new Range(in.length);
+
+          squarer.foo(range, in, out);
+
+          for (int s:out){
+             System.out.println(s);
+          }
+
+       }
+
+    }
+
diff --git a/doc/NewFeatures.md b/doc/NewFeatures.md
new file mode 100644
index 0000000000000000000000000000000000000000..4bcb8f5983bc4063edee4e84160f39c9f8b6be75
--- /dev/null
+++ b/doc/NewFeatures.md
@@ -0,0 +1,227 @@
+#NewFeatures
+*New Features added to this open source release of Aparapi. Updated Sep 14, 2011 by frost.g...@gmail.com*
+##New Features
+Aparapi has two new, especially useful features:
+
+Explicit Buffer Management for minimizing buffer transfers
+Kernel access to objects held in arrays
+###Minimizing Buffer Transfers
+####Explicit Buffer Management
+Aparapi is designed to shield the Java developer from dealing with the underlying movement of data between the OpenCL host and device. Aparapi can analyze a kernel's run() method and run-reachable methods to determine which primitive arrays to transfer to the GPU prior to execution, and which arrays to transfer back when the GPU execution is complete.
+
+Generally this strategy is both clean and performant. Aparapi will attempt to just do the right thing.
+
+However, occasionally the following code pattern is seen.
+
+    final int[] hugeArray = new int[HUGE];
+    final int[] done = new int[]{0};
+    Kernel kernel= new Kernel(){
+       ... // reads/writes hugeArray and writes to done[0] when complete
+    };
+    done[0]=0;
+    while (done[0] ==0)){
+       kernel.execute(HUGE);
+    }
+This is a common pattern in reduce stages of map-reduce type problems. Essentially the developer wants to keep executing a kernel until some condition is met. For example, this may be seen in bitonic sort implementations and various financial applications.
+
+From the code it can be seen that the kernel reads and writes hugeArray[] array and uses the single item done[] array to indicate some form of convergence or completion.
+
+Unfortunately, by default Aparapi will transfer done[] and hugeArray[] to and from the GPU device each time Kernel.execute(HUGE) is executed.
+
+To demonstrate which buffers are being transfered, these copies are shown as comments in the following version of the code.
+
+    final int[] hugeArray = new int[HUGE];
+    final int[] done = new int[]{0};
+    Kernel kernel= new Kernel(){
+       ... // reads/writes hugeArray and writes to done[0] when complete
+    };
+    done[0]=0;
+    while (done[0] ==0)){
+       // Send done[] to GPU
+       // Send hugeArray[] to GPU
+       kernel.execute(HUGE);
+       // Fetch done[] from GPU
+       // Fetch hugeArray[] from GPU
+    }
+Further analysis of the code reveals that hugeArray[] is not accessed by the loop containing the kernel execution, so Aparapi is performing 999 unnecessary transfers to the device and 999 unnecessary transfers back. Only two transfers of hugeArray[] are needed; one to move the initial data to the GPU and one to move it back after the loop terminates.
+
+The done[] array is accessed during each iteration (although never written to within the loop), so it does needs to be transferred back for each return from Kernel.execute(), however, it only needs to be sent once.
+
+Clearly it is better to avoid unnecessary transfers, especially of large buffers like hugeArray[].
+
+A new Aparapi feature allows the developer to control these situations and explicitly manage transfers.
+
+To use this feature first set the mode to explicit, using the kernel.setExplicit(true) method, and then requests transfers using either kernel.put() or kernel.get(). Kernel.put() forces a transfer to the GPU device and Kernel.get() transfers data back.
+
+The following code illustrates the use of these new explicit buffer management APIs.
+
+    final int[] hugeArray = new int[HUGE];
+    final int[] done = new int[]{0};
+    Kernel kernel= new Kernel(){
+       ... // reads/writes hugeArray and writes to done[0] when complete
+    };
+    kernel.setExplicit(true);
+    done[0]=0;
+    kernel.put(done);
+    kernel.put(hugeArray);
+    while (done[0] ==0)){
+       kernel.execute(HUGE);
+       kernel.get(done);
+    }
+    kernel.get(hugeArray);
+Note that marking a kernel as explicit and failing to request the appropriate transfer is a programmer error.
+
+We deliberately made Kernel.put(…), Kernel.get(…) and Kernel.execute(range) return an instance of the executing kernel to allow these calls be chained. Some may find this fluent style API more expressive.
+
+    final int[] hugeArray = new int[HUGE];
+    final int[] done = new int[]{0};
+    Kernel kernel= new Kernel(){
+       ... // reads/writes hugeArray and writes to done[0] when complete
+    };
+    kernel.setExplicit(true);
+    done[0]=0;
+    kernel.put(done).put(hugeArray);    // chained puts
+    while (done[0] ==0)){
+       kernel.execute(HUGE).get(done);  // chained execute and put
+    }
+    kernel.get(hugeArray);
+####An alternate approach for loops containing a single kernel.execute(range) call.
+One variant of code which would normally suggest the use of Explicit Buffer Management can be handled differently. For cases where Kernel.execute(range) is the sole statement inside a loop and where the iteration count is known prior to the first iteration we offer an alternate (hopefully more elegant) way of minimizing buffer transfers.
+
+So for cases like:-
+
+    final int[] hugeArray = new int[HUGE];
+    Kernel kernel= new Kernel(){
+       ... // reads/writes hugeArray
+    };
+
+    for (int pass=0; pass<1000; pass++){
+       kernel.execute(HUGE);
+    }
+The developer can request that Aparapi perform the outer loop rather than coding the loop. This is achieved explicitly by passing the iteration count as the second argument to Kernel.execute(range, iterations).
+
+Now any form of code that looks like :-
+
+    int range=1024;
+    int loopCount=64;
+    for (int passId=0; passId<loopCount; passId++){
+       kernel.execute(range);
+    }
+Can be replaced with
+
+    int range=1024;
+    int loopCount=64;
+
+    kernel.execute(range, loopCount);
+Not only does this make the code more compact and avoids the use of explicit buffer management APIs, it allows Aparapi visibility to the complete loop so that Aparapi can minimize the number of transfers. Aparapi will only transfer buffers to the GPU once and transfer them back once, resulting in improved performance.
+
+Sometimes kernel code using this loop-pattern needs to track the current iteration number as the code passed through the outer loop. Previously we would be forced to use explicit buffer management to allow the kernel to do this.
+
+The code for this would have looked something like
+
+    int range=1024;
+    int loopCount=64;
+    final int[] hugeArray = new int[HUGE];
+    final int[] passId = new int[0];
+    Kernel kernel= new Kernel(){
+       @Override public void run(){
+          int id=getGlobalId();
+          if (passId[0] == 0){
+              // perform some initialization!
+          }
+          ... // reads/writes hugeArray
+       }
+    };
+    Kernel.setExplicit(true);
+    kernel.put(hugeArray);
+    for (passId[0]=0; passId[0]<loopCount; passId[0]++){
+
+       kernel.put(passId).execute(range);
+    }
+In the current version of Aparapi we added Kernel.getPassId() to allow a Kernel to determine the current ‘pass’ through the outer loop without having to use explicit buffer management.
+
+So the previous code can now be written without any explicit buffer management APIs:-
+
+    final int[] hugeArray = new int[HUGE];
+    final int[] pass[] = new int[]{0};
+    Kernel kernel= new Kernel(){
+       @Override public void run(){
+          int id=getGlobalId();
+          int pass = getPassId();
+          if (pass == 0){
+              // perform some initialization!
+          }
+          ... // reads/writes both hugeArray
+       }
+    };
+
+    kernel.execute(HUGE, 1000);
+One common use for Kernel.getPassId() is to avoid flipping buffers in the outer loop.
+
+It is common for kernels to process data from one buffer to another, and in the next invocation process the data back the other way. Now these kernels can use the passId (odd or even) to determine the direction of data transfer.
+
+    final int[] arr1 = new int[HUGE];
+    final int[] arr2 = new int[HUGE];
+    Kernel kernel= new Kernel(){
+       int f(int v){ … }
+
+       @Override public void run(){
+          int id=getGlobalId();
+          int pass = getPassId();
+          if (pass%2==0){
+              arr1[id] = f(arr2[id]);
+          }else{
+              arr2[id] = f(arr1[id]);
+
+          }
+       }
+    };
+
+    kernel.execute(HUGE, 1000);
+
+####Allow kernels to access simple arrays of objects
+Aparapi needs to create OpenCL from the bytecode that it sees. Generally OpenCL constrains us to using parallel primitive arrays (OpenCL allows structs, but Java and OpenCL do not have comparable memory layouts for these structures). Therefore, you will generally need to refactor your code from a classic object-oriented form to use primitive arrays.
+
+This incompatibility between data-parallel and object-oriented code patterns might discourage use of Aparapi, so Aparapi includes limited support for arrays of simple Objects. Future versions may well extend this functionality and address performance loss.
+
+Consider the NBody example.
+
+Typically, a Java developer writing NBody would probably not separate the x,y and z ordinates into parallel arrays of floats as was required in the previous (alpha) version of Aparapi. Instead, a Java developer would probably create a Body class to hold the state of each body and possibly a Universe class (container of Body instances) with the responsible for positioning and possibly displaying the bodies.
+
+    class Body{
+      float x,y,z;
+      float getX(){return x;}
+      void setX(float _x){ x = _x;}
+      float getY(){return y;}
+      void setY(float _y){ y = _y;}
+      float getZ(){return z;}
+      void setZ(float _z){ z = _z;}
+
+
+      // other data related to Body unused by positioning calculations
+    }
+
+    class Universe{
+         final Body[] bodies;
+         public Universe(final Body[] _bodies){
+            bodies = _bodies;
+         }
+         void adjustPositions(){
+             for (Body outer:bodies){
+                for (Body inner:bodies}{
+                   // adjust outer position to reflect the effect of inner
+                   // using inner and outer getters and setters for x, y and z
+                }
+             }
+         }
+         void display(){
+            for (Body body:bodies){
+               // draw body based on x, y and z using Body getters
+            }
+         }
+    }
+From the above code we see that the Universe.adjustPositions() method is compute intensive and an ideal candidate for refactoring to use Aparapi. The current version of Aparapi is able to deal with simple arrays of objects like this.
+
+Now when Aparapi encounters an array of objects and the accesses to these objects are constrained to simple getters and setters, Aparapi will automatically extract the values of the accessed fields into a data parallel form, execute the kernel and then replace the results back in the original objects in the array. This happens on each call to Kernel.execute() and is fairly costly (from a performance point of view), however, for embarrassingly parallel code (such as NBody), we can still show considerable performance gains over standard Java Thread Pool
+
+Attribution
\ No newline at end of file
diff --git a/doc/NewOpenCLBinding.md b/doc/NewOpenCLBinding.md
new file mode 100644
index 0000000000000000000000000000000000000000..32e5f4347b94d3e6b300543873850158350bdfde
--- /dev/null
+++ b/doc/NewOpenCLBinding.md
@@ -0,0 +1,51 @@
+#NewOpenCLBinding  
+*How to use new OpenCL binding mechanism. Updated Mar 6, 2012 by frost.g...@gmail.com*
+As a step towards the extension mechanism I needed a way to easily bind OpenCL to an interface.
+
+Here is what I have come up with. We will use the 'Square' example.
+
+You first define an interface with OpenCL annotations..
+
+  interface Squarer extends OpenCL<Squarer>{
+  @Kernel("{\n"//
+         + "  const size_t id = get_global_id(0);\n"//
+         + "  out[id] = in[id]*in[id];\n"//
+         + "}\n")//
+   public Squarer square(//
+         Range _range,//
+         @GlobalReadOnly("in") float[] in,//
+         @GlobalWriteOnly("out") float[] out);
+  }
+
+This describes the API we wish to bind to a set of kernel entrypoints (here we only have one, but we could have many). Then you 'realize' the interface by asking a device to create an implementation of the interface. Device is a new Aparapi class which represents a GPU or CPU OpenCL device. So here we are asking for the first (default) GPU device to realize the interface.
+
+  Squarer squarer = Device.firstGPU(Squarer.class);
+Now you can call the implementation directly with a Range.
+
+ squarer.square(Range.create(in.length), in, out);
+I think that we will have the easiest OpenCL binding out there...
+
+Following some conversations/suggestions online http://a-hackers-craic.blogspot.com/2012/03/aparapi.html we could also offer the ability to provide the OpenCL source from a file/url course using interface level Annotations.
+
+So we could allow.
+
+  @OpenCL.Resource("squarer.cl");
+  interface Squarer extends OpenCL<Squarer>{
+         public Squarer square(//
+           Range _range,//
+           @GlobalReadOnly("in") float[] in,//
+           @GlobalWriteOnly("out") float[] out);
+  }
+Or if the text is on-hand at compile time in a single constant string
+
+  @OpenCL.Source("... opencl text here");
+  interface Squarer extends OpenCL<Squarer>{
+         public Squarer square(//
+           Range _range,//
+           @GlobalReadOnly("in") float[] in,//
+           @GlobalWriteOnly("out") float[] out);
+  }
+Finally to allow for creation of dynamicl OpenCL (good for FFT's of various Radii).
+
+ String openclSource = ...;
+ Squarer squarer = Device.firstGPU(Squarer.class, openclSource);
diff --git a/doc/PossibleAparapiLambdaSyntaxOptions.md b/doc/PossibleAparapiLambdaSyntaxOptions.md
new file mode 100644
index 0000000000000000000000000000000000000000..8bfcf5f9cba7849ec5f4946bfc2a634c6e5089b5
--- /dev/null
+++ b/doc/PossibleAparapiLambdaSyntaxOptions.md
@@ -0,0 +1,96 @@
+#PossibleAparapiLambdaSyntaxOptions
+*syntax suggestions for HSA enabled Aparapi*
+
+#Introduction
+Now that Java 8 is nearly upon us and HSA enabled Aparapi 'lambda' branch is usable (though in no way complete) I figured we could use this page to discuss the 'programming model' we might prefer for Aparapi, and contrast with the API's for the new Java 8 lambda based stream APIs.
+
+##Converting between Aparapi HSA + Java 8 enabled Aparapi
+Our **hello world** app has always been the ''vector add''. In classic Aparapi we could transform
+
+    final float inA[] = .... // get a float array from somewhere
+    final float inB[] = .... // get a float from somewhere
+                         // assume (inA.length==inB.length)
+    final float result = new float[inA.length];
+
+    for (int i=0; i<array.length; i++){
+        result[i]=intA[i]+inB[i];
+    }
+to
+
+    Kernel kernel = new Kernel(){
+       @Override public void run(){
+          int i= getGlobalId();
+          result[i]=intA[i]+inB[i];
+       }
+    };
+    Range range = Range.create(result.length);
+    kernel.execute(range);
+For the lambda aparapi branch we can currently use
+
+    Device.hsa().forEach(result.length, i-> result[i]=intA[i]+inB[i]);
+Note that the closest Java 8 construct is
+
+    IntStream.range(0, result.length).parallel().forEach(i-> result[i]=intA[i]+inB[i]);
+Aparapi and Java 8 stream API's both use IntConsumer as the lambda type. So you can reuse the lambda.
+
+    IntConsumer lambda = i-> result[i]=intA[i]+inB[i];
+
+    IntStream.range(0, result.length).parallel().forEach(lambda);
+    Device.hsa().forEach(result.length, lambda);
+Exposing the Deviceness of this was a conscious effort. We may also hide it completely.
+
+    IntConsumer lambda = i-> result[i]=intA[i]+inB[i];
+
+    IntStream.range(0, result.length).parallel().forEach(lambda);
+    Aparapi.forEach(result.length, lambda);
+I am toying with providing an API which maps more closely to the Stream API from Java 8.
+
+Maybe
+
+    IntStream.range(0, result.length).parallel().forEach(lambda);
+    Aparapi.range(0, result.length).parallel().forEach(lambda);
+This way users can more readily swap between the two.
+
+For collections/arrays in Aparapi we can also offer
+
+    T[] arr = // get an array of T from somewhere
+    ArrayList<T> list = // get an array backed list of T from somewhere
+
+    Aparapi.range(arr).forEach(t -> /* do something with each T */);
+We can create special cases. Say for mutating images
+
+    BufferedImage in, out;
+    Aparapi.forEachPixel(in, out, rgb[] -> rgb[0] = 0 );
+We may also need select operations for associative operations
+
+    class Person{
+        int age;
+        String first;
+        String last;
+    };
+
+    Aparapi.selectOne(Person[] people, (p1,p2)-> p1.age>p2.age?p1:p2 );
+##A case for map reduce
+A mapper maps from one type to another. Possibly by extracting state. Here is a mapper which maps each String in an array of Strings to its length.
+
+As if the mapper was
+
+    interface mapToInt<T>{ int map(T v); }
+Here it is in action.
+
+    Aparapi.range(strings).map(s->string.length())...
+Now the result is a stream of int's which can be 'reduced' by a reduction lambda.
+
+In this case the reduction reduces two int's to one, by choosing the max of k and v. All reductions must be commutative style operations (max, min, add) where the order of execution is not important.
+
+    int lengthOfLongestString = Aparapi.range(strings).map(s->string.length()).reduce((k,v)-> k>v?k:v);
+Here we had a sum reduction.
+
+    int sumOfLengths = Aparapi.range(strings).map(s ->string.length()).reduce((k,v)-> k+v);
+Some of these may be common enough that we offer direct functionality.
+
+    int sumOfLengths = Aparapi.range(strings).map(s ->string.length()).sum();
+    int maxOfLengths = Aparapi.range(strings).map(s ->string.length()).max();
+    int minOfLengths = Aparapi.range(strings).map(s ->string.length()).min();
+    String string = Aparapi.range(strings).map(s->string.length()).select((k,v)-> k>v);
+This last one needs some explaining. We map String to int then select the String whose length is the greatest.
\ No newline at end of file
diff --git a/doc/PrivateMemorySpace.md b/doc/PrivateMemorySpace.md
new file mode 100644
index 0000000000000000000000000000000000000000..51fee39e43f50ec6c2c93cbde20315c3345a041a
--- /dev/null
+++ b/doc/PrivateMemorySpace.md
@@ -0,0 +1,34 @@
+PrivateMemorySpace
+==================
+
+*Using `__private` memory space in Aparapi kernels. Phase-Implementation Updated Sep 14, 2014 by barneydp...@gmail.com*
+
+## Introduction
+The private memory space identifier (just "private" is also recognised) can be applied to struct fields in order to indicate that the data is not shared with/accessible to other kernel instances. Whilst this is the default for non-array data, it must be explicitly applied to array fields in order to make them private. Aparapi now supports arrays in the private memory space.
+
+The private memory space is generally only suitable for smallish arrays, but is required for certain algorithms, e.g. for those which must mutate (for example, sort or partially sort) an exclusive copy of an array/subarray.
+
+##Details
+In Aparapi there are two mechanisms available to mark a Kernel class member as belonging to the private memory space when mapped to OpenCL code (matching the equivalent functionality for marking items as belonging to the local memory space). Either the field can be named with a suffix plus buffer size, for example
+
+    protected short[] myBuffer_$private$32 = new short[32];
+or using the Annotation Kernel.PrivateMemorySpace, for example
+
+    protected @PrivateMemorySpace(32) short[] myBuffer = new short[32];
+The latter should be used in preference to the former.
+
+Note that OpenCL requires that the size of a private array be fixed at compile time for any kernel. Thus it is not possible for a single Kernel subclass to support private buffers of varying size. Unfortunately this may entail creating multiple subclasses with varying buffer sizes in order to most efficiently support varying private buffer sizes.
+
+Of course, a single Kernel class can be created which has a private buffer large enough for all use cases, though this may be suboptimal if only a small fraction of the maximum buffer size is commonly required.
+
+Because private buffers are unshared, they require much more of a GPU's memory than a local or global buffer of the same size, and should therefore be used sparingly and kept as small as possible, as overuse of large private arrays might cause GPU execution to fail on lower-end graphics cards.
+
+However, private memory space is the fastest of all OpenCls memory spaces, so may in some limited cases might be used to increase execution speed even when the kernel does not need to modify the array and a shared (local or global) array would suffice - for example to provide a smallish lookup-table to replace an expensive function call.
+
+Without modification, an Aparapi kernel which uses private buffers may fail to work when invoked in Java Threadpool (JTP) mode, because the buffer will be shared across multiple threads. However a simple mechanism exists which allows such buffers to be used safely in JTP execution mode.
+
+The Kernel.NoCL annotation exists to allow specialised code to be executed when running in Java (or JTP) which is not invoked when running on the GPU. A NoCL method can be inserted at the begining of a Kernel's run() method which sets the private array to a value obtained from a static ThreadLocal<foo[]> where foo is the primitive type of the array in question. This will have no effect upon OpenCL execution, but will allow threadsafe execution when running in java.
+
+In the project samples, there is a package com.amd.aparapi.sample.median which gives an example of a median image filter which uses a private array of pixel data to apply a distructive median algorithm to a "window" of local pixels. This sample also demonstrates how to use the ThreadLocal trick to allow correct behaviour when running in JTP execution mode.
+
+[http://code.google.com/p/aparapi/source/browse/trunk/samples/median/src/com/amd/aparapi/sample/median/MedianDemo.java](http://code.google.com/p/aparapi/source/browse/trunk/samples/median/src/com/amd/aparapi/sample/median/MedianDemo.java)
\ No newline at end of file
diff --git a/doc/ProfilingKernelExecution.md b/doc/ProfilingKernelExecution.md
new file mode 100644
index 0000000000000000000000000000000000000000..58cc58844f5d51f2cbf4ce2e38a78d16c05b543d
--- /dev/null
+++ b/doc/ProfilingKernelExecution.md
@@ -0,0 +1,53 @@
+#ProfilingKernelExecution
+*Using Aparapi's built in profiling APIs Updated May 7, 2013 by frost.g...@gmail.com*
+
+If you want to extract OpenCL performance info from a kernel at runtime you need to set the property :-
+
+    -Dcom.amd.aparapi.enableProfiling=true
+
+Your application can then call kernel.getProfileInfo() after a successful call to kernel.execute(range) to extract a List List<ProfileInfo>.
+
+Each ProfileInfo holds timing information for buffer writes, executs and buffer reads.
+
+The following code will print a simple table of profile information
+
+    List<ProfileInfo> profileInfo = k.getProfileInfo();
+    for (final ProfileInfo p : profileInfo) {
+       System.out.print(" " + p.getType() + " " + p.getLabel() + " " + (p.getStart() / 1000) + " .. "
+           + (p.getEnd() / 1000) + " " + ((p.getEnd() - p.getStart()) / 1000) + "us");
+       System.out.println();
+    }
+
+Here is an example implementation
+
+            final float result[] = new float[2048*2048];
+            Kernel k = new Kernel(){
+               public void run(){
+                  final int gid=getGlobalId();
+                  result[gid] =0f;
+               }
+            };
+            k.execute(result.length);
+            List<ProfileInfo> profileInfo = k.getProfileInfo();
+
+            for (final ProfileInfo p : profileInfo) {
+               System.out.print(" " + p.getType() + " " + p.getLabel() + " " + (p.getStart() / 1000) + " .. "
+                  + (p.getEnd() / 1000) + " " + ((p.getEnd() - p.getStart()) / 1000) + "us");
+               System.out.println();
+            }
+            k.dispose();
+        }
+    }
+And here is the tabular output from
+
+        java
+           -Djava.library.path=${APARAPI_HOME}
+           -Dcom.amd.aparapi.enableProfiling=true
+           -cp ${APARAPI_HOME}:.
+           MyClass
+
+      W val$result 69500 .. 72694 3194us
+      X exec()     72694 .. 72835  141us
+      R val$result 75327 .. 78225 2898us
+
+The table shows that the transfer of the 'result' buffer to the device ('W') took 3194 us (micro seconds), the execute ('X') of the kernel 141 us and the read ('R') of resulting buffer 2898 us.
\ No newline at end of file
diff --git a/doc/ProfilingKernelsFromEclipse.md b/doc/ProfilingKernelsFromEclipse.md
new file mode 100644
index 0000000000000000000000000000000000000000..c1edfc9ebf2cd2a85f43b088dc237bc71ce1dba7
--- /dev/null
+++ b/doc/ProfilingKernelsFromEclipse.md
@@ -0,0 +1,97 @@
+#ProfilingKernelsFromEclipse
+*Profiling Kernels with AMD profiler in Eclipse (Indigo) Updated May 14, 2012 by frost.g...@gmail.com*
+
+##Profiling Kernels with AMD profiler in Eclipse (Indigo)
+
+Wayne Johnson
+
+12 May 2012
+Disclaimer: This has been tested with Eclipse (Indigo SR1) only on W7SR1.
+
+Assume your Eclipse project follows a typical Maven layout:
+
+    Project
+       src/main/java/...
+         AlgorithmImplementation.java
+       src/test/java/...
+         BenchmarkRunner.java
+         BenchmarkTest.java
+       lib/aparapi-2012-02-15/
+         aparapi jar file
+         native libraries for W7, Linux, and OSX
+         …
+       profiles/
+         [this is where the profiles and logs will be generated]
+
+1. Download and install the current AMD APP SDK
+2. Download and install Aparapi (see Wiki), making sure that the native libraries are on your build path.
+3. Create your algorithm implementation(s).
+
+        example: AlgorithmImplementations.java
+
+4. Create your performance benchmark test as a JUnit test case to exercise your implementations.
+
+        example: BenchmarkTest.java
+
+5. Test your JUnit test case inside Eclipse using BenchmarkRunner to make sure it works. The runner will be the main application for the runnable jar file you create in the next step.
+
+        This step will also automatically create the launch configuration that the export command will ask you for. Select BenchmarkRunner.java
+
+        Right-click > Run as > Java application
+
+6. Export your project as a runnable jar file.
+
+    Right-click > Export...
+      [wizard] Java > Runnable Jar File. Next.
+        Launch configuration: BenchmarkRunner [1] - Project
+        Export destination: Project\runner.jar
+        Library handling: [use default]    Finish.
+      Ok on “...repacks referenced libraries”
+      Yes on “Confirm replace” [You won’t see this dialog on the first export but will on subsequent exports]
+      Ok [ignore warning dialog]
+
+    After refreshing Project, you should see a runner.jar file at the top level.
+
+7. Create an external tool configuration to generate the performance counter profile
+
+    Run > External Tools > External Tool Configurations...
+      Name: AMD counters - Project
+      Location: C:\Program Files (x86)\AMD APP\tools\AMD APP Profiler 2.4\x64\sprofile.exe
+      Arguments:
+       -o "${project_loc}\profiles\counters.csv"
+       -w "${project_loc}"
+       "C:\Program Files\Java\jdk1.6.0_30\bin\java.exe"
+       -Djava.library.path="lib\aparapi-2012-02-15"
+       -jar "${project_loc}\runner.jar"
+
+
+    Note: The ''java.library.path'' indicates the relative location of the folder containing the native libraries used by Aparapi. If this is not set correctly, steps 9 and 10 below will run in JTP execution mode and the only error message you will see on the Eclipse console is that the profile was not generated. This is because nothing executed on the GPU.
+
+8. Create an external tool configuration to generate the cltrace and summary profiles.
+
+    1. Run > External Tools > External Tool Configurations...
+    2. Name: AMD cltrace - Project
+    3. Location: C:\Program Files (x86)\AMD APP\tools\AMD APP Profiler 2.4\x64\sprofile.exe
+    4. Arguments:
+
+        `-o "${project_loc}\profiles\cltrace.txt" -k all -r -O -t -T`
+
+        `-w "${project_loc}"`
+
+        `"C:\Program Files\Java\jdk1.6.0_30\bin\java.exe"`
+
+        `-Djava.library.path="lib\aparapi-2012-02-15"`
+
+        `-jar "${project_loc}\runner.jar"`
+
+
+9. Run the AMD profiler counter configuration to generate the counter profile.
+
+    Run > External Tools > AMD counters - Project
+
+
+10. Run the AMD profiler cltrace configuration to generate the cltrace and summary profiles.
+
+    Run > External Tools > AMD cltrace - Project
+    A project file for testing the above instructions can be found http://code.google.com/p/aparapi/source/browse/trunk/wiki-collateral/ProfilingKernelsFormEclipseProject.zip
+
diff --git a/doc/README.md b/doc/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5bbba270ad043b42e9b3f6f532e5be7b200d0e0d
--- /dev/null
+++ b/doc/README.md
@@ -0,0 +1,46 @@
+APARAPI Documentation
+======================
+
+| | |
+|----------------|------|
+| [PrivateMemorySpace](PrivateMemorySpace.md)| Using `__private` memory space in Aparapi kernels. |
+| [SettingUpLinuxHSAMachineForAparapi](SettingUpLinuxHSAMachineForAparapi.md) | How to setup a Linux HSA machine for testing HSA enabled Aparapi |
+| [PossibleAparapiLambdaSyntaxOptions](PossibleAparapiLambdaSyntaxOptions.md) | Syntax suggestions for HSA enabled Aparapi |
+| [HSAEnablementOfLambdaBranchSidebar](HSAEnablementOfLambdaBranchSidebar.md)| Sidebar for HSAEnablementOfLambdaBranchAparapi|
+| [HSAEnablementOfLambdaBranch](HSAEnablementOfLambdaBranch.md)	| Adding HSA Support to Aparapi lambda branch	|
+| [UsingAparapiLambdaBranchWithHSASimulator](UsingAparapiLambdaBranchWithHSASimulator.md) | One-sentence summary of this page. |
+| [SettingUpLinuxHSAMachineForAparapiSidebar](SettingUpLinuxHSAMachineForAparapiSidebar.md) | Sidebar for SettingUpLinuxHSAMachineForAparapi |
+| HSASidebar | |
+| [AddingLambdasToAparapi](AddingLambdasToAparapi.md) | Adding Java 8 Lambda Support to Aparapi |
+| [ProfilingKernelExecution](ProfilingKernelExecution.md) | Using Aparapi's built in profiling APIs |
+| [HowToAddUML](HowToAddUML.md) | How to add plantuml docs to wiki pages |
+| [LIbraryAgentDuality](LIbraryAgentDuality.md) | Aparapi libraries can now be loaded as JVMTI agents. |
+| [FrequentlyAskedQuestions](FrequentlyAskedQuestions.md) | Frequently Asked Questions|
+| HomePageSuggestions ||
+| [ChoosingSpecificDevicesForExecution](ChoosingSpecificDevicesForExecution.md) | Using the new Device API's to choose Kernel execution on a specific device.	|
+| Gadgets | Gadgetorium|
+| [ConvertingBytecodeToOpenCL](ConvertingBytecodeToOpenCL.md) | How Aparapi converts bytecode to OpenCL |
+| [DevelopersGuideLinux](DevelopersGuideLinux.md) | Developer guide for Linux. |
+| [DevelopersGuideWindows](DevelopersGuideWindows.md) | Developers guide for Windows. |
+| [EmulatingMultipleEntrypointsUsingCurrentAPI](EmulatingMultipleEntrypointsUsingCurrentAPI.md)	| How to emulate multiple entrypoints using existing Aparapi APIs	|
+| [MultipleEntryPointSupportProposal](MultipleEntryPointSupportProposal.md) | How to extend Aparapi to allow multiple entrypoints for kernels	|
+| [ExplicitBufferHandling](ExplicitBufferHandling.md) | How to minimize buffer transfers |
+| [AparapiPatterns](AparapiPatterns.md) | Examples and code fragments to demonstrate Aparapi fetaures. |
+| [ProfilingKernelsFromEclipse](ProfilingKernelsFromEclipse.md) | Profiling Kernels with AMD profiler in Eclipse (Indigo) |
+| [DeviceProposal](DeviceProposal.md) | How we might use the extension mechanism devices for general Kernel execution.|
+| [NewOpenCLBinding](NewOpenCLBinding.md) | How to use new OpenCL binding mechanism. |
+| [AparapiExtensionProposal](AparapiExtensionProposal.md) | A proposed aparapi extension mechanism. |
+| [UsingConstantMemory](UsingConstantMemory.md) | How to make use of constant memory in a Kernel |
+| [UsingLocalMemory](UsingLocalMemory.md) | How to make use of local memory in a Kernel |
+| [UsingMultiDimExecutionRanges](UsingMultiDimExecutionRanges.md) | How to use the new Range class (for multi-dim range access) |
+| [AccessingMultiDimNDRangeProposal](AccessingMultiDimNDRangeProposal.md) | A proposal for accessing multi-dim ND range execution |
+| LocalMemoryAndBarrierProposal | A proposal for handling local memory and barriers |
+| [AddressSpacesUsingBuffers](AddressSpacesUsingBuffers.md) | Proposal For OpenCL address space support using java Buffers instead of arrays.	|
+| [BuildingNBody](BuildingNBody.md) | How to build the NBody example.|
+| [UnitTestGuide](UnitTestGuide.md) | Unit test Guide Find out how to run Junit tests and how to add new tests. |
+| [NewFeatures](NewFeatures.md) | New Features added to this open source release of Aparapi. |
+| [UsersGuide](UsersGuide.md) | Aparapi User's Guide. |
+| [DevelopersGuide](DevelopersGuide.md) | Aparapi developers guide. |
+| [ContributionGuide](ContributionGuide.md) | How to contribute (bug fix or features). |
+| [JavaKernelGuidelines](JavaKernelGuidelines.md) | What code can and can't be converted to OpenCL by Aparapi. |
+| [Attribution](Attribution.md) | Attribution |
diff --git a/doc/SettingUpLinuxHSAMachineForAparapi.md b/doc/SettingUpLinuxHSAMachineForAparapi.md
new file mode 100644
index 0000000000000000000000000000000000000000..edf564be4e2766edb8db14ff4a4c36538af987b1
--- /dev/null
+++ b/doc/SettingUpLinuxHSAMachineForAparapi.md
@@ -0,0 +1,209 @@
+#SettingUpLinuxHSAMachineForAparapi
+*How to setup a Linux HSA machine for testing HSA enabled Aparapi Updated May 22, 2014 by frost.g...@gmail.com*
+
+* HSA Videos
+    * [http://www.youtube.com/watch?v=5ntILiXTuhE](http://www.youtube.com/watch?v=5ntILiXTuhE)
+    * [http://www.youtube.com/watch?v=caEPq4KvTTA](http://www.youtube.com/watch?v=caEPq4KvTTA)
+* HSA Articles
+    * [http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-computing/](http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-computing/)
+* HSA Foundation
+    * [https://github.com/HSAFoundation](https://github.com/HSAFoundation)
+
+##Introduction
+Now that HSA hardware is generally available I figured it was time to describe how to setup a HSA enabled Linux platform so that it can run Aparapi.
+
+Here is a nice introduction to HSA [http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-system-architecture-hsa/](http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-system-architecture-hsa/)
+
+But for Aparapi users the main advantage is that we are no longer limited to the GPU memory for running GPU tasks. Also because the CPU and the GPU can both see the same memory (the Java heap) Aparapi code can now access Java objects directly. This removes a number of Aparapi constraints. So more of your code can now run on the GPU.
+
+##Hardware Required
+These instructions were based on my experience setting up a platform using the following hardware.
+
+|Component	|       Suggested           |
+|---------------|---------------------------|
+|APU            | AMD A10-7850K APU [http://www.amd.com/us/products/desktop/processors/a-series/Pages/a-series-apu.aspx](http://www.amd.com/us/products/desktop/processors/a-series/Pages/a-series-apu.aspx) |
+|Motherboard	| ASUS A88X-PRO or A88XM-A [http://www.asus.com/Motherboards/A88XPRO](http://www.asus.com/Motherboards/A88XPRO) [http://www.asus.com/Motherboards/A88XMA](http://www.asus.com/Motherboards/A88XMA)|
+| Memory        | G.SKILL Ripjaws X Series 16GB (2 x 8GB) 240-Pin DDR3 SDRAM DDR3 2133|
+
+##Software Required
+We also have some software dependencies.
+
+|Component	| Suggested |
+|---------------|-----------|
+| Java 8 JDK	| [http://www.oracle.com/technetwork/java/javase/downloads/ea-jsp-142245.html](http://www.oracle.com/technetwork/java/javase/downloads/ea-jsp-142245.html) |
+| Ubuntu 13.10 64-bit edition | [http://www.ubuntu.com/download](http://www.ubuntu.com/download) |
+| Ubuntu 13.10 64-bit edition HSA enabled kernel image	| [https://github.com/HSAFoundation/Linux-HSA-Drivers-And-Images-AMD](https://github.com/HSAFoundation/Linux-HSA-Drivers-And-Images-AMD) |
+| OKRA HSA enabled runtime | [https://github.com/HSAFoundation/Okra-Interface-to-HSA-Device](https://github.com/HSAFoundation/Okra-Interface-to-HSA-Device) |
+
+The hope is that the list of HW/SW support widens, but for early adopters this is the set of HW/SW we have been testing with.
+
+#Setting up your System
+##Configure your BIOS to support IOMMU
+Once you have built your AMD A10-7850K APU based system you should make sure that your system is configured to use IOMMU.
+
+Remember HSA allows the GPU and CPU cores to share the same memory. IOMMU needs to be enabled for this.
+
+##For the A88X-PRO board
+For the recommended ASUS board above you will need to make sure that your BIOS is updated to version 0802. Here is a direct link to the 0802 version of the BIOS from ASUS's site as of 2/28/2014.
+
+[http://dlcdnet.asus.com/pub/ASUS/mb/SocketFM2/A88X-PRO/A88X-PRO-ASUS-0802.zip](http://dlcdnet.asus.com/pub/ASUS/mb/SocketFM2/A88X-PRO/A88X-PRO-ASUS-0802.zip)
+
+Once you have the latest BIOS you will need to enable IOMMU in the system BIOS. This is done using the "CPU Configuration" screen under "Advanced Mode" and then enabling IOMMU.
+
+##For the A88XM-A
+You will need the 1102 (or later) version of the BIOS
+
+[http://dlcdnet.asus.com/pub/ASUS/mb/SocketFM2/A88XM-A/A88XM-A-ASUS-1102.zip](http://dlcdnet.asus.com/pub/ASUS/mb/SocketFM2/A88XM-A/A88XM-A-ASUS-1102.zip)
+
+Once you have the latest BIOS you will need to enable IOMMU in the system BIOS. This is done using the "CPU Configuration" screen under "Advanced Mode" and then enabling IOMMU.
+
+##Installing Ubuntu 13.10
+Once you have your BIOS setup you need to install Ubuntu [http://www.ubuntu.com/download](http://www.ubuntu.com/download)
+
+Installing HSA enabled kernel + driver
+Until all of the HSA drivers and features are available in stock linux and have been pulled down into Ubuntu distro we will need a special HSA enabled kernel image.
+
+##A Ubuntu compatible kernel can be pulled from github
+
+    $ cd ~ # I put all of this in my home dir
+    $ sudo apt-get install git
+    $ git clone https://github.com/HSAFoundation/Linux-HSA-Drivers-And-Images-AMD.git
+Or you can pull the zip and unzip using curl if you don't have git
+
+    $ cd ~ # I put all of this in my home dir
+    $ curl -L https://github.com/HSAFoundation/Linux-HSA-Drivers-And-Images-AMD/archive/master.zip > drivers.zip
+    $ unzip drivers.zip
+This will create the following subdir on your machine
+
+    Linux-HSA-Drivers-And-Images-AMD/
+       LICENSE
+       README.md
+       ubuntu12.10-based-alpha1/
+           xorg.conf
+           linux-image-3.13.0-kfd+_3.13.0-kfd+-2_amd64.deb
+
+
+From here we can install our new image and setup the HSA KFD (the driver for HSA)and reboot to the new kernel.
+
+    $ cd ~/Linux-HSA-Drivers-And-Images-AMD
+    $ echo  "KERNEL==\"kfd\", MODE=\"0666\"" | sudo tee /etc/udev/rules.d/kfd.rules
+    $ sudo dpkg -i ubuntu13.10-based-alpha1/linux-image-3.13.0-kfd+_3.13.0-kfd+-2_amd64.deb
+    $ sudo cp ~/Linux-HSA-Drivers-And-Images-AMD/ubuntu13.10-based-alpha1/xorg.conf /etc/X11
+    $ sudo reboot
+##Installing OKRA RT
+Now we need a runtime for executing HSAIL code. We share common infrastructure used by our sister OpenJDK project called Sumatra. Both Aparapi and Sumatra use OKRA to execute HSAIL code on a HSA enabled platform.
+
+We can get the latest version using of OKRA (Offloadable Kernel Runtime API) from another HSA foundation repository.
+
+    $ cd ~ # I put all of this in my home dir
+    $ git clone https://github.com/HSAFoundation/Okra-Interface-to-HSA-Device.git
+or if you prefer curl/unzip
+
+    $ cd ~ # I put all of this in my home dir
+    $ curl -L https://github.com/HSAFoundation/Okra-Interface-to-HSA-Device/archive/master.zip > okra.zip
+    $ unzip okra.zip
+This will create the following dir structure.
+
+    Okra-Interface-to-HSA-Device/
+       README.md
+       okra/
+          README
+          dist/
+             okra.jar
+             bin/
+                libamdhsacl64.so
+                libnewhsacore64.so
+                libokra_x86_64.so
+             include/
+                common.h
+                okraContext.h
+
+          samples/
+             dist/
+               Squares
+               Squares.hsail
+             runSquares.sh
+
+OKRA offers a C API (for those that are so inclined ;) ) as well as a java jar file which contains JNI wrappers.
+
+##Sanity check your HSA and OKRA install
+So to sanity check your install you can run a small sample app (binary)
+
+    $ cd ~/Okra-Interface-to-HSA-Device/okra/samples/
+    $ sh runSquares.sh
+If everything is OK this should run the C Squares test app.
+
+Congratulations, you have executed your first HSA enabled app.
+
+Getting OpenCL headers and libraries
+We need OpenCL headers and libraries to build Aparapi (remember we still support OpenCL).
+
+My recommendation is to download AMD-APP-SDK-v2.9-lnx64.tgz from [http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads](http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads) and extract the libraries and headers.
+
+Note that we have nested zipped jars in this archive.
+
+    $ cd ~
+    $ gunzip ~/Downloads/AMD-APP-SDK-v2.9-lnx64.tgz
+    $ tar xvf ~/Downloads/AMD-APP-SDK-v2.9-lnx64.tar
+    $ rm ~/default-install_lnx_64.pl ~/icd-registration.tgz ~/Install-AMD-APP.sh ~/ReadMe.txt
+    $ gunzip ~/AMD-APP-SDK-v2.9-RC-lnx64.tgz
+    $ tar xvf ~/AMD-APP-SDK-v2.9-RC-lnx64.tar
+    $ rm ~/AMD-APP-SDK-v2.9-RC-lnx64.tar
+    $ rm -rf AMD-APP-SDK-v2.9-RC-lnx64/samples
+Note where AMD-APP-SDK-v2.9-RC-lnx64 is located, you need this in the following step.
+
+##You will need Java 8
+Download Java 8 JDK from [https://jdk8.java.net/download.html](https://jdk8.java.net/download.html) I chose to download the zipped tar and not install with RPM so I can control the location of the install.
+
+    $ cd ~
+    $ gunzip /home/gfrost/Downloads/jdk-8-fcs-bin-b132-linux-x64-04_mar_2014.tar.gz
+    $ tar xvf ~/Downloads/jdk-8-fcs-bin-b132-linux-x64-04_mar_2014.tar
+I now have ~/jdk1.8.0 as my java 8 install dir.
+
+Alternatively the following will pull from Oracles site using curl
+
+    $ cd ~
+    $ curl http://download.java.net/jdk8/archive/b132/binaries/jdk-8-fcs-bin-b132-linux-x64-04_mar_2014.tar.gz?q=download/jdk8/archive/b132/binaries/jdk-8-fcs-bin-b132-linux-x64-04_mar_2014.tar.gz > jdk-8-fcs-bin-b132-linux-x64-04_mar_2014.tar.gz
+    $ gunzip jdk-8-fcs-bin-b132-linux-x64-04_mar_2014.tar.gz
+    $ tar xvf jdk-8-fcs-bin-b132-linux-x64-04_mar_2014.tar
+I now have ~/jdk1.8.0 as my java 8 install dir.
+
+##You will need ant
+    $ sudo apt-get install ant
+This takes a long time because in also installs a java7 jdk.
+
+##You will need g++
+We use g++ to build the JNI side of Aparapi
+
+    $ sudo apt-get install g++
+##Pulling the HSA enabled Aparapi branch and building
+Now we can pull the Aparapi lambda/HSA branch from SVN
+
+    $ sudo apt-get install subversion
+    $ svn checkout https://aparapi.googlecode.com/svn/branches/lambda aparapi-lambda
+If you are familiar with Aparapi structure then this tree should not be that much of a surprise but there are a few subtle changes.
+
+Specifically the build system has been changed to support OKRA, Aparapi JNI code is provided as a Java agent and the execution scripts all refer to ${APARAPI_HOME}/env.sh to setup a reasonable execution environment.
+
+You will need to edit env.sh and make sure that APARAPI_HOME, OKRA_HOME, OCL_HOME and JAVA_HOME correctly.
+
+Here are how I set my vars.
+
+|environment variable	|value|
+|-----------------------|-----|
+|JAVA_HOME	|/home/${LOGNAME}/jdk1.8.0|
+|OCL_HOME	|/home/${LOGNAME}/AMD-APP-SDK-v2.9-RC-lnx64|
+|APARAPI_HOME	|/home/${LOGNAME}/aparapi-lambda|
+|OKRA_HOME	|/home/${LOGNAME}/Okra-Interface-to-HSA-Device/okra/|
+
+It is recommended (thanks notzed ;) ) that you test your env.sh using sh env.sh until it stops reporting errors. Once you have finished I recommend sourcing it into your current shell before building with ant.
+
+    $ cd ~aparapi-lambda
+    $ . env.sh
+    $ ant
+If you get any problems check the env.sh vars first.
+
+If all is well you should be able to run some samples.
+
+    $ cd ~/aparapi-lambda/samples/mandel
+    $ sh hsailmandel.sh
\ No newline at end of file
diff --git a/doc/SettingUpLinuxHSAMachineForAparapiSidebar.md b/doc/SettingUpLinuxHSAMachineForAparapiSidebar.md
new file mode 100644
index 0000000000000000000000000000000000000000..9d165df7e1fa5bd66761b5b95482d60225adb6ec
--- /dev/null
+++ b/doc/SettingUpLinuxHSAMachineForAparapiSidebar.md
@@ -0,0 +1,10 @@
+#SettingUpLinuxHSAMachineForAparapiSidebar
+*Sidebar for SettingUpLinuxHSAMachineForAparapi*
+
+* HSA Videos
+    * [http://www.youtube.com/watch?v=5ntILiXTuhE](http://www.youtube.com/watch?v=5ntILiXTuhE)
+    * [http://www.youtube.com/watch?v=caEPq4KvTTA](http://www.youtube.com/watch?v=caEPq4KvTTA)
+* HSA Articles
+    * [http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-computing/](http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-computing/)
+* HSA Foundation
+    * [https://github.com/HSAFoundation](https://github.com/HSAFoundation)
\ No newline at end of file
diff --git a/doc/UnitTestGuide.md b/doc/UnitTestGuide.md
new file mode 100644
index 0000000000000000000000000000000000000000..d812e14e11ffc7b2edfc08d83f831a33ff14091f
--- /dev/null
+++ b/doc/UnitTestGuide.md
@@ -0,0 +1,174 @@
+#UnitTestGuide
+*Unit test Guide Find out how to run Junit tests and how to add new tests. Updated Sep 14, 2011 by frost.g...@gmail.com*
+
+#Unit Test Guide
+
+The Unit Test Guide explains the test infrastructure associated with Aparapi, including instructions for executing existing tests adding new test cases.
+OpenCLâ„¢ code generation tests
+
+The initial open source tree includes the codegen subdirectory (test/codegen), which used to validate the Aparapi bytecode to OpenCLâ„¢ conversion.
+
+    aparapi/trunk/
+       com.amd.aparapi/
+          src/java/com.amd.aparapi/
+          build.xml
+       test/
+          codegen/
+             src/java/
+                com.amd.aparapi/
+                com.amd.aparapi.test/
+             build.xml
+       build.xml
+
+The code generation tests to not require OpenCLâ„¢ , AMD APP SDK or a GPU devices to be configured; these tests only validate the creation of valid OpenCLâ„¢ code by comparing against predefined expected output.
+
+##Running the OpenCLâ„¢ code generation JUnit tests
+
+Before executing the code generation tests, build the com.amd.aparapi sub-project and ensure that you have JUnit 4 installed.
+
+Edit the junit.jar property in test/codegen/build.xml to point to your install directory.
+
+    <property name="junit.jar" value="C:\JUnit4.9\junit-4.9.jar"/>
+
+Initiate the code generation tests using ant.
+
+    C:\> cd tests/codegen
+    C:\> ant
+    <failures will be reported here>
+    C:>
+
+View the HTML version of the JUnit report at junit/html/index.html. On Microsoft Windows(r) platforms use
+
+    C:\> start junit\html\index.html
+
+On Linux(r) platforms just invoke your browser (Firefox in this case).
+
+    C:\> firefox junit\html\index.html
+
+##Adding a new OpenCLâ„¢ code generation test
+
+The test cases for OpenCLâ„¢ code generation are not strictly JUnit tests. Instead the codegen Java tree contains a tool (CreateJUnitTests) to create JUnit test cases from specially formatted test source files.
+
+The package `com.amd.aparapi.test (codegen/src/java/com/amd/aparapi/test)` contains all of the existing code generation tests.
+
+Here is an example that tests the code generation resulting from a call to Kernel.getPassId(), this is taken from com.amd.aparapi.test.CallGetPassId
+
+    package com.amd.aparapi.test;
+
+    import com.amd.aparapi.Kernel;
+
+    public class CallGetPassId extends Kernel{
+       public void run() {
+          int thePassId = getPassId();
+       }
+
+    }
+    /**{OpenCL{
+
+    typedef struct This_s{
+       int passid;
+    }This;
+    int get_pass_id(This *this){
+       return this->passid;
+    }
+    __kernel void run(
+       int passid
+    ){
+       This thisStruct;
+       This* this=&thisStruct;
+       this->passid = passid;
+       {
+          int thePassId = get_pass_id(this);
+          return;
+       }
+    }
+
+    }OpenCL}**/
+
+The test source takes the form of a simple class that extends the kernel and a block of OpenCL code between the /**{OpenCL{ and }OpenCL}**/ markers. The code between these markers is the OpenCL code that we expect Aparapi to produce as a result of converting the run() method to OpenCL.
+
+The code-generating ant build.xml file performs the following steps to generate its report:
+
+* compiles the src/java tree. This compiles all the test cases as well as a few ‘utility’ classes.
+* executes the com.amd.aparapi.test.CreateJUnitTests program. This iterates through all of the test source files and converts them to JUnit form. The generated source is written to the src/genjava tree.
+* compiles the src/genjava tree to create the required JUnit classes
+* initiates the JUnit test phase (result data in junit/data)
+* creates the JUnit report (in junit/html/junit from junit/data)
+
+To create a new test case, just add your test case to the `codegen/src/java/com/amd/aparapi/test` package (including the expected OpenCL).
+
+Sometimes different javac implementations (such as Oracle and Eclipse) will generate different bytecode for the same source. When Aparapi converts this bytecode it may yield different (but equally acceptable) OpenCL forms. One example of this is the BooleanToggle test:
+
+    public class BooleanToggle{
+       public void run() {
+          boolean pass = false;
+
+          pass = !pass;
+
+       }
+    }
+
+The BooleanToggle test code creates two (slightly different) versions of OpenCLâ„¢ (sadly one line different) depending on the javac compiler.
+
+This example shows the ‘toggle’ OpenCL™ created from the bytecode generated by Oracle.
+
+    pass = pass==1?0:1;
+
+This example shows the bytecode from Eclipse javac:
+
+    pass = pass==0?1:0;
+
+Logically either of the above are correct. However, to accommodate the alternate acceptable forms we need to add two complete `/**{OpenCL{ and }OpenCL}**/` sections to the file. If either matches, the test will pass.
+
+Here is the complete BooleanToggle code.
+
+    package com.amd.aparapi.test;
+
+    public class BooleanToggle{
+       public void run() {
+          boolean pass = false;
+
+          pass = !pass;
+
+       }
+    }
+    /**{OpenCL{
+    typedef struct This_s{
+       int passid;
+    }This;
+    int get_pass_id(This *this){
+       return this->passid;
+    }
+    __kernel void run(
+       int passid
+    ){
+       This thisStruct;
+       This* this=&thisStruct;
+       this->passid = passid;
+       {
+          char pass = 0;
+          pass = (pass==0)?1:0;
+          return;
+       }
+    }
+    }OpenCL}**/
+    /**{OpenCL{
+    typedef struct This_s{
+       int passid;
+    }This;
+    int get_pass_id(This *this){
+       return this->passid;
+    }
+    __kernel void run(
+       int passid
+    ){
+       This thisStruct;
+       This* this=&thisStruct;
+       this->passid = passid;
+       {
+          char pass = 0;
+          pass = (pass!=0)?0:1;
+          return;
+       }
+    }
+    }OpenCL}**/
\ No newline at end of file
diff --git a/doc/UsersGuide.md b/doc/UsersGuide.md
new file mode 100644
index 0000000000000000000000000000000000000000..9a2ae7966805ef20cd2c1a415255b62c9a0cbed1
--- /dev/null
+++ b/doc/UsersGuide.md
@@ -0,0 +1,126 @@
+#UsersGuide
+*Aparapi User's Guide. Updated Sep 14, 2011 by frost.g...@gmail.com*
+##User’s Guide
+Aparapi is: An API used to express data parallel workloads in Java and a runtime system capable of running compatible workloads on a compatible GPU.
+
+Where your workload runs depends on
+
+Whether you have a compatible GPU and OpenCL capable device driver
+Whether your Java parallel code can be converted to OpenCL by Aparapi
+For information about restrictions on the code that Aparapi can convert to OpenCL, see JavaKernelGuidelines.
+Aparapi depends on AMD’s OpenCL™ driver to execute on the GPU and therefore shares the same device, driver, and platform compatibility requirements as AMD APP SDK V2.5®.
+
+* 32-bit Microsoft® Windows® 7
+* 32-bit Microsoft® Windows Vista® SP2
+* 64-bit Microsoft® Windows® 7
+* 64-bit Microsoft® Windows Vista® SP2
+* 32-bit Linux® OpenSUSE™ 11.2,   Ubuntu® 10.04/9.10, or Red Hat® Enterprise Linux® 5.5/5.4
+* 64-bit Linux® OpenSUSE™ 11.2,   Ubuntu® 10.04/9.10, or Red Hat® Enterprise Linux® 5.5/5.4
+* An OpenCL GPU and suitable OpenCL enabled device driver
+* An installed AMD APP SDK v2.5 or later
+
+If you prefer to test Aparapi in JTP mode (Java Thread Pool) then you will only need Aparapi.jar and Oracle Java 6 or later JRE or JDK.
+The following fragment of Java code takes an input float array and populates an output array with the square of each element.
+
+    final float in[8192]; // initialization of in[0..8191] omitted
+    final float out[in.length];
+
+    for(int i=0; i<in.length; i++){
+       out[i]=in[i]*in[i];
+    }
+This code segment illustrates an ideal data parallel candidate, each pass through the loop is independent of the others. Traversing the loop in any order should provide the same result.
+
+To convert the above code to Aparapi we use an anonymous inner-class (a common Java idiom) to express the data parallel nature of the above sequential loop.
+
+    Kernel kernel = new Kernel(){
+       @Override public void run(){
+          int i = getGlobalId();
+          out[i]=in[i]*in[i];
+       }
+    };
+    kernel.execute(in.length);
+Java developers should recognize the general pattern as similar to that used to launch a new Thread.
+
+    Thread thread = new Thread(new Runnable(){
+       @Override public void run(){
+           System.out.println(“In another thread!”);
+       }
+    });
+    thread.start();
+    thread.join();
+The Aparapi developer extends the com.amd.aparapi.Kernel and overrides the public void Kernel.run() method. It is this Kernel.run() method that is executed in parallel.
+
+The base class also exposes the Kernel.execute(range) method which is used to initiate the execution of Kernel.run() over the range 0...n.
+
+Kernel.execute(range) will block until execution has completed. Any code within the overridden ‘void run()’ method of Kernel (and indeed any method or methods reachable from that method) is assumed to be data-parallel and it is the developer’s responsibility to ensure that it is. Aparapi can neither detect nor enforce this.
+
+Within the executing kernel (on the GPU device or from the thread pool) the Kernel.getGlobalId() method is used to identify which (of the range 0..n) a particular execution represents.
+
+## Compiling an Aparapi application
+Aparapi has only two compilation requirements:
+
+Aparapi.jar must be in the class path at compile time.
+The generated class files must contain debug information (javac –g)
+A typical compilation might be:
+    $ javac –g –cp ${APARAPI_DIR}/aparapi.jar Squares.java
+Aparapi requires this classfile debug information so that can extract the name and scope of local variables for the generated OpenCL.
+
+## Running an Aparapi application
+At runtime an Aparapi-enabled application requires aparapi.jar to be in the class path to be able to execute in a Java Thread Pool (no GPU offload).
+
+    $ java–cp ${APARAPI_DIR}/aparapi.jar;. Squares
+To take advantage of the GPU, the directory containing the platform-dependent Aparapi shared library is passed via the java.library.path property.
+
+    $ java –Djava.library.path=${APARAPI_DIR} –cp ${APARAPI_DIR}/aparapi.jar;. Squares
+
+Aparapi detects whether the JNI shared library is available. If the library cannot be located your code will be executed using a Java Thread Pool.
+
+An application can detect whether a kernel was executed on the GPU or by a Java Thread Pool (JTP) by querying the execution mode ‘after’ Kernel.execute(range) has returned. This is achieved using the Kernel.getExecutionMode() method.
+
+    Kernel kernel = new Kernel(){
+       @Override public void run(){
+          int i = getGlobalId();
+          out[i]=in[i]*in[i];
+       }
+    };
+    kernel.execute(in.length);
+    if (!kernel.getExecutionMode().equals(Kernel.EXECUTION_MODE.GPU)){
+       System.out.println(“Kernel nid not execute on the GPU!”);
+    }
+
+To obtain a runtime report of the execution mode of all kernel executions, set the com.amd.aparapi.enableExecutionModeReporting property to true when the JVM is launched.
+
+    $ java –Djava.library.path=${APARAPI_DIR} –Dcom.amd.aparapi.enableExecutionModeReporting=true –cp ${APARAPI_DIR}/aparapi.jar;. Squares
+
+##Running the sample applications
+Aparapi includes two sample applications in the /samples subdirectory of the binary distribution zip file.
+
+samples/squares	simple example that computes an array of squares of integers
+samples/mandel	computes and displays the Mandelbrot set
+The jar file for each sample is included (so you can run a sample without having to build it) as well as both Linux® and Microsoft Windows® script files for launching the samples.
+
+You will need an appropriate GPU card, OpenCL® enabled Catalyst® driver and a compatible Oracle Java 6 JRE for your platform. To execute a sample:
+
+Set the environment variable JAVA_HOME to point to the root of your JRE or JDK.
+Change to the appropriate samples directory (samples/squares or samples/mandel)
+Run either the .bat or .sh script. On Linux® , you might have to initially chmod +x script.sh to add execute permissions.
+The sample scripts pass the first arg (%1 or $1) to -Dcom.amd.aparapi.executionMode when the JVM is launched. This allows the sample to be tested in either GPU or JTP execution modes by passing the requested mode.
+
+    $ cd samples/mandel
+    $ bash ./mandel.sh GPU
+    <executes in GPU mode here>
+    $ bash ./mandel.sh JTP
+    <executes in JTP mode here>
+
+## Building the sample applications
+To build a sample, install Oracle® JDK 6 and Apache Ant (at least 1.7.1).
+
+Set the environment variable ANT_HOME to point to the root of your ant install.
+Ensure that the %ANT_HOME%/bin or ${ANT_HOME}/bin is in your path.
+Set the environment variable JAVA_HOME to point to the root of your JDK.
+Change to the appropriate samples directory (sample/squares or sample/mandel).
+Initiate a build using ant.
+    $ cd samples/mandel
+    $ ant
+    $ bash ./mandel.sh GPU
+Attribution
\ No newline at end of file
diff --git a/doc/UsingAparapiLambdaBranchWithHSASimulator.md b/doc/UsingAparapiLambdaBranchWithHSASimulator.md
new file mode 100644
index 0000000000000000000000000000000000000000..4e35c0ebce318734ccd4247d81bce557398b6ab0
--- /dev/null
+++ b/doc/UsingAparapiLambdaBranchWithHSASimulator.md
@@ -0,0 +1,46 @@
+#UsingAparapiLambdaBranchWithHSASimulator
+*One-sentence summary of this page. Updated Feb 28, 2014 by frost.g...@gmail.com*
+
+##Introduction
+Although HSA compatible devices are available, we understand that Aparapi developers may not have access to these devices.
+
+The HSA foundation has open sourced an LLVM based HSAIL emulator which we can use to test HSAIL generated code.
+
+The project is based here ([https://github.com/HSAFoundation/Okra-Interface-to-HSAIL-Simulator](https://github.com/HSAFoundation/Okra-Interface-to-HSAIL-Simulator)) but we have extracted detailed download and build instructions for Ubuntu below.
+
+Aparapi users/developers can use this simulator to test correctness.
+
+##Building the HSA Simulator on Ubuntu
+We assume you have ant, svn and g++ available because you can build other aparapi artifacts.
+
+You will also need git, libelf-dev, libdwarf-dev, flex and cmake
+
+    $ sudo apt-get install git libelf-dev libdwarf-dev flex cmake
+
+login...
+
+    $ git clone https://github.com/HSAFoundation/Okra-Interface-to-HSAIL-Simulator.git okra
+    $ cd okra
+    $ ant -f build-okra-sim.xml
+
+##The build should take approximately 15 mins.
+
+How to setup and test an initial lambda/HSA enabled Aparapi build
+Assuming you have built okra in /home/gfrost/okra
+
+Assuming your Java8 JDK is in /home/gfrost/jdk1.8.0
+
+Assuming your aparapi svn trunk is /home/gfrost/aparapi
+
+    $ export JAVA_HOME=/home/gfrost/jdk1.8.0
+    $ export OKRA=/home/gfrost/okra
+    $ export PATH=${PATH}:${JAVA_HOME}/bin:${OKRA}/dist/bin
+    $ java -version
+    java version "1.8.0-ea"
+    Java(TM) SE Runtime Environment (build 1.8.0-ea-b94)
+    Java HotSpot(TM) 64-Bit Server VM (build 25.0-b36, mixed mode)
+    $ cd /home/gfrost/aparapi/branches/lambda
+    $ ant
+    $ export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${OKRA}/dist/bin
+    $ java -agentpath:com.amd.aparapi.jni/dist/libaparapi_x86_64.so -cp com.amd.aparapi/dist/aparapi.jar:${OKRA}/dist/okra.jar hsailtest.Squares
+    $
\ No newline at end of file
diff --git a/doc/UsingConstantMemory.md b/doc/UsingConstantMemory.md
new file mode 100644
index 0000000000000000000000000000000000000000..9bd53e6ac6602de0ae34f91eb5fd8f9e70c547a6
--- /dev/null
+++ b/doc/UsingConstantMemory.md
@@ -0,0 +1,50 @@
+#UsingConstantMemory
+*How to make use of constant memory in a Kernel Updated Feb 28, 2012 by frost.g...@gmail.com*
+##How to make use of new constant memory feature
+By default all primitive arrays accessed by an Aparapi Kernel is considered global. If we look at the generated code using `-Dcom.amd.aparapi.enableShowGeneratedOpenCL=true` we will see that primitive arrays (such as `int buf[]`) are mapped to `__global` pointers (such as `__global int *buf`) in OpenCL.
+
+Although this makes Aparapi easy to use (especially to Java developers who are unfamiliar to tiered memory hierarchies), it does limit the ability of the 'power developer' wanting to extract more performance from Aparapi on the GPU.
+
+This [page](http://www.amd.com/us/products/technologies/stream-technology/opencl/pages/opencl-intro.aspx?cmpid=cp_article_2_2010) from AMD's website shows the different types of memory that OpenCL programmers can exploit.
+
+Global memory buffers in Aparapi (primitive Java arrays) are stored in host memory and are copied to Global memory (the RAM of the GPU card).
+
+Local memory is 'closer' to the compute devices and not copied from the host memory, it is just allocated for use on the device. The use of local memory on OpenCL can lead to much more performant code as the cost of fetching from local memory is much lower.
+
+Local memory is shared by all work item's (kernel instances) executing in the same group. This is why the use of local memory was deferred until we had a satisfactory mechanism for specifying a required group size.
+
+We recently also added support for constant memory for data that needs to be written once to the GPU but will not change.
+
+Aparapi only supports constant arrays, not scalers.
+
+##How to define a primitive array as "constant"
+We have two ways define a constant buffer. Either we can decorate the variable name with a _$constant$ suffix (yes it is a valid identifier n Java).
+
+    final int[] buffer = new int[1024]; // this is global accessable to all work items.
+    final int[] buffer_$constant$ = new int[]{1,2,3,4,5,6,7,8,9} // this is a constant buffer
+
+    Kernel k = new Kernel(){
+        public void run(){
+             // access buffer
+             // access buffer_$constant$
+             // ....
+        }
+    }
+
+Alternatively (if defining inside the derived Kernel class - cannot be used via anonymous inner class pattern above!) we can can use the @Constant annotation.
+
+    final int[] buffer = new int[1024]; // this is global accessable to all work items.
+
+    Kernel k = new Kernel(){
+        @Constant int[] constantBuffer = new int[]{1,2,3,4,5,6,7,8,9} // this is a constant buffer
+        public void run(){
+             // access buffer
+             // access constantBuffers
+             // ....
+        }
+    }
+
+##Can I see some code?
+I updated the Mandelbrot example so that the pallete of RGB values is represented using constant memory, the source can be found here. Look at line #95. BTW for me this resulted in a 5-7 % performance improvement.
+
+[http://code.google.com/p/aparapi/source/browse/trunk/samples/mandel/src/com/amd/aparapi/sample/mandel/Main.java](tp://code.google.com/p/aparapi/source/browse/trunk/samples/mandel/src/com/amd/aparapi/sample/mandel/Main.java)
\ No newline at end of file
diff --git a/doc/UsingLocalMemory.md b/doc/UsingLocalMemory.md
new file mode 100644
index 0000000000000000000000000000000000000000..e74376b343649d71f80eadb1ed4527f9b0bc2e03
--- /dev/null
+++ b/doc/UsingLocalMemory.md
@@ -0,0 +1,180 @@
+#UsingLocalMemory
+*How to make use of local memory in a Kernel Updated Feb 28, 2012 by frost.g...@gmail.com*
+##How to make use of new local memory feature
+By default all primitive arrays accessed by an Aparapi Kernel is considered global. If we look at the generated code using -Dcom.amd.aparapi.enableShowGeneratedOpenCL=true we will see that primitive arrays (such as int buf[]) are mapped to __global pointers (such as __global int *buf) in OpenCL.
+
+Although this makes Aparapi easy to use (especially to Java developers who are unfamiliar to tiered memory hierarchies), it does limit the ability of the 'power developer' wanting to extract more performance from Aparapi on the GPU.
+
+This [page](http://www.amd.com/us/products/technologies/stream-technology/opencl/pages/opencl-intro.aspx?cmpid=cp_article_2_2010) from AMD's website shows the different types of memory that OpenCL programmers can exploit.
+
+Global memory buffers in Aparapi (primitive Java arrays) are stored in host memory and are copied to Global memory (the RAM of the GPU card).
+
+Local memory is 'closer' to the compute devices and not copied from the host memory, it is just allocated for use on the device. The use of local memory on OpenCL can lead to much more performant code as the cost of fetching from local memory is much lower.
+
+Local memory is shared by all work item's (kernel instances) executing in the same group. This is why the use of local memory was deferred until we had a satisfactory mechanism for specifying a required group size.
+
+Aparapi only supports local arrays, not scalers.
+
+##How to define a primitive array as "local"
+We have two ways define a local buffer. Either we can decorate the variable name with a _$local$ suffix (yes it is a valid identifier n Java).
+
+    final int[] buffer = new int[1024]; // this is global accessable to all work items.
+    final int[] buffer_$local$ = new int[1024]; // this is a local buffer 1024 int's shared across all work item's in a group
+
+    Kernel k = new Kernel(){
+        public void run(){
+             // access buffer
+             // access buffer_$local$
+             localBarrier(); // allows all writes to buffer_$local$ to be synchronized across all work items in this group
+             // ....
+        }
+    }
+Alternatively (if defining inside the derived Kernel class - cannot be used via anonymous inner class pattern above!) we can can use the @Local annotation.
+
+    final int[] buffer = new int[1024]; // this is global accessable to all work items.
+
+    Kernel k = new Kernel(){
+        @Local int[] localBuffer = new int[1024]; // this is a local buffer 1024 int's shared across all work item's in a group
+        public void run(){
+             // access buffer
+             // access localBuffer
+             localBarrier(); // allows all writes to localBuffer to be synchronized across all work items in this group
+             // ....
+        }
+    }
+##How do I know how big to make my local buffer?
+This is where the new Range class helps.
+
+If we create a Range using:
+
+    Range rangeWithUndefinedGroupSize = Range.create(1024);
+The Aparapi will pick a suitable group size. Generally this will be the highest factor of global size <= 256. So for a global size which is a power of two (and greater or equal to256 ;) ) the group size will be 256.
+
+Normally the size a local buffer will be some ratio of the group size.
+
+So if we needed 4 ints per group we might use a sequence such as.
+
+    final int[] buffer = new int[8192]; // this is global accessable to all work items.
+    final Range range = Range.create(buffer.length); // let the runtime pick the group size
+
+    Kernel k = new Kernel(){
+        @Local int[] localBuffer = new int[range.getLocalSize(0)*4]; // this is a local buffer containing 4 ints per work item in the group
+        public void run(){
+             // access buffer
+             // access localBuffer
+             localBarrier(); // allows all writes to localBuffer to be synchronized across all work items in this group
+             // ....
+        }
+    }
+Alternatively you can of course specify your own group size when you create the Range.
+
+    final int[] buffer = new int[8192]; // this is global accessable to all work items.
+    final Range range = Range.create(buffer.length,16); // we requested a group size of 16
+
+    Kernel k = new Kernel(){
+        @Local int[] localBuffer = new int[range.getLocalSize(0)*4]; // this is a local buffer containing 4 ints per work item in the group = 64 ints
+        public void run(){
+             // access buffer
+             // access localBuffer
+             localBarrier(); // allows all writes to localBuffer to be synchronized across all work items in this group
+             // ....
+        }
+    }
+##Using barriers
+As we mentioned above local memory buffers are shared by all work items/kernels executing in the same group. However, to read a value written by another workitem we need to insert a local barrier.
+
+A common pattern involves having each work item copying a value from global memory in local memory.
+
+    Kernel k = new Kernel(){
+        @Local int[] localBuffer = new int[range.getLocalSize(0)];
+        public void run(){
+
+             localBuffer[getLocalId(0)] = globalBuffer[getGlobalId(0)];
+             localBarrier(); // after this all kernels can see the data copied by other workitems in this group
+             // use localBuffer[0..getLocalSize(0)]
+        }
+    }
+Without the barrier above, there is no guarantee that other work items will see mutations to localBuffer from other work items.
+
+Caution regarding barriers
+Barriers can be dangerous. It is up to the developer to ensure that all kernels execute the same # of calls to localBarrier(). Be very careful with conditional code (or code containing loops!), to ensure that each kernel executes the same number of calls to localBarrier().
+
+The following kernel will deadlock!
+
+    Kernel kernel = new Kernel(){
+        public void run(){
+             if (getGlobalId(0)>10){
+                // ...
+                localBarrier();
+                // ...
+             }
+        }
+    }
+We need to make sure that all kernel's in a group execute the localBarrier(). So the following will work.
+
+    Kernel kernel = new Kernel(){
+        public void run(){
+             if (getGlobalId(0)>10){
+                // ...
+                localBarrier();
+                // ...
+             }else{
+                localBarrier();
+             }
+
+        }
+    }
+Of course if we have multiple calls to localBarrier() in the 'if' side of the if..then then we must match in the 'else'.
+
+    Kernel kernel = new Kernel(){
+        public void run(){
+             if (getGlobalId(0)>10){
+                // ...
+                localBarrier();
+                // ...
+                localBarrier();
+                // ...
+             }else{
+                localBarrier();
+                localBarrier();
+             }
+
+        }
+    }
+With loops we must make sure that each kernel processes any loop the sam e # of times.
+
+So the following is fine.
+
+    Kernel kernel = new Kernel(){
+        public void run(){
+             for (int i=0; i< 10; i++){
+                // ...
+                localBarrier();
+                // ...
+             }
+        }
+    }
+However the following will deadlock
+
+    Kernel kernel = new Kernel(){
+        public void run(){
+             for (int i=0; i< getLocalId(0); i++){
+                // ...
+                localBarrier();
+                // ...
+             }
+        }
+    }
+As a testament to how well we emulate OpenCL in JTP mode, this will also deadlock your kernel in JTP mode ;) so be careful.
+
+Performance impact in JTP mode
+Of course Java itself does not support local memory in any form. So any time code using local memory falls back to JTP mode we must expect a considerable performance degradation (try the NBody local example in JTP mode).
+
+We do honor localBarrier() using Java's barrier from the new concurrency utils. However, Java's memory model does not require the use of a barrier to observe array changes across threads. So these barriers are basically just an expense.
+
+I would recommend using local memory and barriers only if I am 90% sure the code will run on the GPU.
+
+##Can I see some code?
+I added a version of NBody example which uses local memory, the source can be found here.
+
+[http://code.google.com/p/aparapi/source/browse/trunk/examples/nbody/src/com/amd/aparapi/examples/nbody/Local.java](http://code.google.com/p/aparapi/source/browse/trunk/examples/nbody/src/com/amd/aparapi/examples/nbody/Local.java)
\ No newline at end of file
diff --git a/doc/UsingMultiDimExecutionRanges.md b/doc/UsingMultiDimExecutionRanges.md
new file mode 100644
index 0000000000000000000000000000000000000000..adaf8190f42f3f54fcf93796c34b2749a7ebf616
--- /dev/null
+++ b/doc/UsingMultiDimExecutionRanges.md
@@ -0,0 +1,60 @@
+#UsingMultiDimExecutionRanges
+*How to use the new Range class (for multi-dim range access) Updated Feb 13, 2012 by frost.g...@gmail.com*
+
+Aparapi now allows developers to execute over one, two or three dimensional ranges. OpenCL natively allows the user to execute over 1, 2 or 3 dimension grids via the clEnqueueNDRangeKernel() method.
+
+Initially we chose not to expose 2D or 3D ranges (Aparapi's Kernel.execute(range) allowed only !d ranges, but following a specific request we added the notion of a Range via the new com.amd.aparapi.Range class.
+
+A range is created using various static factory methods. For example to create a simple range {0..1024} we would use.
+
+Range range = Range.create(1024);
+In this case the range will span 1..1024 and a 'default' group size will be decided behind the scenes (256 probably in this case).
+
+If the user wishes to select a specific group size (say 32) for a one dimensional Range (0..1024) then they can use.
+
+Range range = Range.create(1024, 32);
+The group size must always be a 'factor' of the global range. So globalRange % groupSize == 0
+
+For a 2D range we use the Range.create2D(...) factory methods.
+
+Range range = Range.create2D(32, 32);
+The above represents a 2D grid of execution 32 rows by 32 columns. In this case a default group size will be determined by the runtime.
+
+If we wish to specify the groupsize (say 4x4) then we can use.
+
+    Range range = Range.create2D(32, 32, 4, 4);
+    This example uses a 2D range to apply a blurring convolution effect to a pixel buffer.
+
+    final static int WIDTH=128;
+    final static int HEIGHT=64;
+    final int in[] = new int[WIDTH*HEIGHT];
+    final int out[] = new int[WIDTH*HEIGHT];
+    Kernel kernel = new Kernel(){
+       public void run(){
+          int x = getGlobalId(0);
+          int y = getGlobalId(1);
+          if (x>0 && x<(getGlobalSize(0)-1) && y>0 && y<(getGlobalSize(0)-1)){
+             int sum = 0;
+             for (int dx =-1; dx<2; dx++){
+               for (int dy =-1; dy<2; dy++){
+                 sum+=in[(y+dy)*getGlobalSize(0)+(x+dx)];
+               }
+             }
+             out[y*getGlobalSize(0)+x] = sum/9;
+          }
+       }
+
+    };
+    Range range = Range.create2D(WIDTH, HEIGHT);
+    kernel.execute(range);
+
+##Handling this from JTP mode
+Mapping to OpenCL for this is all fairly straightforward.
+
+In Java JTP mode we have to emulate the execution over the 1D, 2D and 3D ranges using threads. Note that the number of threads we launch is essentially the size of the group. So be careful creating large groups.
+
+If we ask for a 3D range using :-
+
+    Range range = Range.create3D(1024, 1024, 1024, 8, 8, 8);
+
+We are asking for a group size of 8x8x8 == 512. So we are asking for 512 threads!
\ No newline at end of file
diff --git a/doc/uml.png b/doc/uml.png
new file mode 100644
index 0000000000000000000000000000000000000000..5dfc8690ae8500506d8337472ce5efae832de47c
Binary files /dev/null and b/doc/uml.png differ