Sunday, 25 April 2021

OpenCL max_work_item_sizes

I am having trouble understanding what the work item constraints mean. I am using pyopencl and looking at the max_work_item_sizes it gives what I assumed was the max number of global work threads for each dimension.

import pyopencl as cl
import numpy as np

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

queue.device.max_work_item_sizes # [1024, 1024, 64]

I could simulate the np.arange function by the following:

prg = cl.Program(ctx, """
__kernel void arange(__global int *res_g)
{
  int gid = get_global_id(0);
  res_g[gid] = gid;
}
""").build()

res_g = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * 4096)
prg.arange(queue, [4096], None, res_g)

# transfer back to cpu
res_np = np.empty(4096).astype(np.int32)
cl.enqueue_copy(queue, res_np, res_g)

assert (res_np == np.arange(4096)).all() # this is true

How is it possible to specify more than 1024 work items for the first dimension? What does the max_work_item_sizes mean?

Another question related to this is if it is beneficial to use as many work dimensions as possible? As I understand it possible to use 3 dimensions at most. A way of simulating np.arange by using 2 work item dimensions could be done by the following:

prg = cl.Program(ctx, """
__kernel void arange(__global int *res_g)
{
  int gid = get_global_id(0) * get_global_id(1);
  barrier(CLK_GLOBAL_MEM_FENCE);
  res_g[gid] = gid;
}
""").build()

res_g = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * 4096)
prg.arange(queue, [64, 64], [1,1], res_g)

# transfer back to cpu
res_np = np.empty(4096).astype(np.int32)
cl.enqueue_copy(queue, res_np, res_g)

assert (res_np == np.arange(4096)).all()

For some reason the assertion is not always true

But my question is, when processing a large array, is it better to make use of all 3 work_item_dimensions? Or is it better to treat the array as a 1d contiguous array and only use get_global_id(0)?



from OpenCL max_work_item_sizes

No comments:

Post a Comment