Page 1 of 1

OpenCL

Posted: Sun Feb 03, 2013 4:21 pm
by Fred
Here is a working version (at least on Windows x86 and x64) of the OpenCL trial made by Guimauve in this thread: http://www.purebasic.fr/english/viewtop ... lit=opencl

To have it working, you need to install the OpenCL SDK (CUDA SDK for NVidia card, AMD APP SDK for AMD card). Then you need to locate the OpenCL.lib file and change the absolute path in OpenCL.pbi. It's quick and dirty but should be a good learn to learn about OpenCL. It should also work on OS X / Linux, but I don't have tested it.

OpenCL.pbi:

Code: Select all

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; Project name : OpenCL Constants
; File Name : OpenCL Constants.pb
; File version: 1.1.0
; Programming : OK
; Programmed by : Guimauve
; Date : 12-10-2012
; Last Update : 12-10-2012
; PureBasic code : 4.70
; Platform : Windows, Linux, MacOS X
; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; From cl.h header file
;
;  * Copyright (c) 2011 The Khronos Group Inc.
;  *
;  * Permission is hereby granted, free of charge, To any person obtaining a
;  * copy of this software And/Or associated documentation files (the
;  * "Materials"), To deal in the Materials without restriction, including
;  * without limitation the rights To use, copy, modify, merge, publish,
;  * distribute, sublicense, And/Or sell copies of the Materials, And To
;  * permit persons To whom the Materials are furnished To do so, subject To
;  * the following conditions:
;  *
;  * The above copyright notice And this permission notice shall be included
;  * in all copies Or substantial portions of the Materials.
;  *
;  * THE MATERIALS ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
;  * EXPRESS OR IMPLIED, INCLUDING BUT Not LIMITED TO THE WARRANTIES OF
;  * MERCHANTABILITY, FITNESS For A PARTICULAR PURPOSE And NONINFRINGEMENT.
;  * IN NO EVENT SHALL THE AUTHORS Or COPYRIGHT HOLDERS BE LIABLE FOR ANY
;  * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
;  * TORT OR OTHERWISE, ARISING FROM, OUT OF Or IN CONNECTION With THE
;  * MATERIALS OR THE USE OR OTHER DEALINGS IN THE MATERIALS.
;
; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

; <<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Error Codes <<<<<

#CL_SUCCESS = 0
#CL_DEVICE_NOT_FOUND = -1
#CL_DEVICE_NOT_AVAILABLE = -2
#CL_COMPILER_NOT_AVAILABLE = -3
#CL_MEM_OBJECT_ALLOCATION_FAILURE = -4
#CL_OUT_OF_RESOURCES = -5
#CL_OUT_OF_HOST_MEMORY = -6
#CL_PROFILING_INFO_NOT_AVAILABLE = -7
#CL_MEM_COPY_OVERLAP = -8
#CL_IMAGE_FORMAT_MISMATCH = -9
#CL_IMAGE_FORMAT_NOT_SUPPORTED = -10
#CL_BUILD_PROGRAM_FAILURE = -11
#CL_MAP_FAILURE = -12
#CL_MISALIGNED_SUB_BUFFER_OFFSET = -13
#CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST = -14
#CL_COMPILE_PROGRAM_FAILURE = -15
#CL_LINKER_NOT_AVAILABLE = -16
#CL_LINK_PROGRAM_FAILURE = -17
#CL_DEVICE_PARTITION_FAILED = -18
#CL_KERNEL_ARG_INFO_NOT_AVAILABLE = -19
#CL_INVALID_VALUE = -30
#CL_INVALID_DEVICE_TYPE = -31
#CL_INVALID_PLATFORM = -32
#CL_INVALID_DEVICE = -33
#CL_INVALID_CONTEXT = -34
#CL_INVALID_QUEUE_PROPERTIES = -35
#CL_INVALID_COMMAND_QUEUE = -36
#CL_INVALID_HOST_PTR = -37
#CL_INVALID_MEM_OBJECT = -38
#CL_INVALID_IMAGE_FORMAT_DESCRIPTOR = -39
#CL_INVALID_IMAGE_SIZE = -40
#CL_INVALID_SAMPLER = -41
#CL_INVALID_BINARY = -42
#CL_INVALID_BUILD_OPTIONS = -43
#CL_INVALID_PROGRAM = -44
#CL_INVALID_PROGRAM_EXECUTABLE = -45
#CL_INVALID_KERNEL_NAME = -46
#CL_INVALID_KERNEL_DEFINITION = -47
#CL_INVALID_KERNEL = -48
#CL_INVALID_ARG_INDEX = -49
#CL_INVALID_ARG_VALUE = -50
#CL_INVALID_ARG_SIZE = -51
#CL_INVALID_KERNEL_ARGS = -52
#CL_INVALID_WORK_DIMENSION = -53
#CL_INVALID_WORK_GROUP_SIZE = -54
#CL_INVALID_WORK_ITEM_SIZE = -55
#CL_INVALID_GLOBAL_OFFSET = -56
#CL_INVALID_EVENT_WAIT_LIST = -57
#CL_INVALID_EVENT = -58
#CL_INVALID_OPERATION = -59
#CL_INVALID_GL_OBJECT = -60
#CL_INVALID_BUFFER_SIZE = -61
#CL_INVALID_MIP_LEVEL = -62
#CL_INVALID_GLOBAL_WORK_SIZE = -63
#CL_INVALID_PROPERTY = -64
#CL_INVALID_IMAGE_DESCRIPTOR = -65
#CL_INVALID_COMPILER_OPTIONS = -66
#CL_INVALID_LINKER_OPTIONS = -67
#CL_INVALID_DEVICE_PARTITION_COUNT = -68

; <<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< OpenCL Version <<<<<

#CL_VERSION_1_0 = 1
#CL_VERSION_1_1 = 1
#CL_VERSION_1_2 = 1

; <<<<<<<<<<<<<<<<<<<
; <<<<< cl_bool <<<<<

#CL_FALSE = 0
#CL_TRUE = 1
#CL_BLOCKING = #CL_TRUE
#CL_NON_BLOCKING = #CL_FALSE

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_platform_info <<<<<

#CL_PLATFORM_PROFILE = $0900
#CL_PLATFORM_VERSION = $0901
#CL_PLATFORM_NAME = $0902
#CL_PLATFORM_VENDOR = $0903
#CL_PLATFORM_EXTENSIONS = $0904

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_device_type - bitfield <<<<<

#CL_DEVICE_TYPE_DEFAULT = (1 << 0)
#CL_DEVICE_TYPE_CPU = (1 << 1)
#CL_DEVICE_TYPE_GPU = (1 << 2)
#CL_DEVICE_TYPE_ACCELERATOR = (1 << 3)
#CL_DEVICE_TYPE_CUSTOM = (1 << 4)
#CL_DEVICE_TYPE_ALL = $FFFFFFFF

; <<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_device_info <<<<<

#CL_DEVICE_TYPE = $1000
#CL_DEVICE_VENDOR_ID = $1001
#CL_DEVICE_MAX_COMPUTE_UNITS = $1002
#CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = $1003
#CL_DEVICE_MAX_WORK_GROUP_SIZE = $1004
#CL_DEVICE_MAX_WORK_ITEM_SIZES = $1005
#CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR = $1006
#CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT = $1007
#CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT = $1008
#CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG = $1009
#CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT = $100A
#CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE = $100B
#CL_DEVICE_MAX_CLOCK_FREQUENCY = $100C
#CL_DEVICE_ADDRESS_BITS = $100D
#CL_DEVICE_MAX_READ_IMAGE_ARGS = $100E
#CL_DEVICE_MAX_WRITE_IMAGE_ARGS = $100F
#CL_DEVICE_MAX_MEM_ALLOC_SIZE = $1010
#CL_DEVICE_IMAGE2D_MAX_WIDTH = $1011
#CL_DEVICE_IMAGE2D_MAX_HEIGHT = $1012
#CL_DEVICE_IMAGE3D_MAX_WIDTH = $1013
#CL_DEVICE_IMAGE3D_MAX_HEIGHT = $1014
#CL_DEVICE_IMAGE3D_MAX_DEPTH = $1015
#CL_DEVICE_IMAGE_SUPPORT = $1016
#CL_DEVICE_MAX_PARAMETER_SIZE = $1017
#CL_DEVICE_MAX_SAMPLERS = $1018
#CL_DEVICE_MEM_BASE_ADDR_ALIGN = $1019
#CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE = $101A
#CL_DEVICE_SINGLE_FP_CONFIG = $101B
#CL_DEVICE_GLOBAL_MEM_CACHE_TYPE = $101C
#CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE = $101D
#CL_DEVICE_GLOBAL_MEM_CACHE_SIZE = $101E
#CL_DEVICE_GLOBAL_MEM_SIZE = $101F
#CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE = $1020
#CL_DEVICE_MAX_CONSTANT_ARGS = $1021
#CL_DEVICE_LOCAL_MEM_TYPE = $1022
#CL_DEVICE_LOCAL_MEM_SIZE = $1023
#CL_DEVICE_ERROR_CORRECTION_SUPPORT = $1024
#CL_DEVICE_PROFILING_TIMER_RESOLUTION = $1025
#CL_DEVICE_ENDIAN_LITTLE = $1026
#CL_DEVICE_AVAILABLE = $1027
#CL_DEVICE_COMPILER_AVAILABLE = $1028
#CL_DEVICE_EXECUTION_CAPABILITIES = $1029
#CL_DEVICE_QUEUE_PROPERTIES = $102A
#CL_DEVICE_NAME = $102B
#CL_DEVICE_VENDOR = $102C
#CL_DRIVER_VERSION = $102D
#CL_DEVICE_PROFILE = $102E
#CL_DEVICE_VERSION = $102F
#CL_DEVICE_EXTENSIONS = $1030
#CL_DEVICE_PLATFORM = $1031
#CL_DEVICE_DOUBLE_FP_CONFIG = $1032

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< 0x1033 reserved for CL_DEVICE_HALF_FP_CONFIG <<<<<

#CL_DEVICE_PREFERRED_VECTOR_WIDTH_HALF = $1034
#CL_DEVICE_HOST_UNIFIED_MEMORY = $1035
#CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR = $1036
#CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT = $1037
#CL_DEVICE_NATIVE_VECTOR_WIDTH_INT = $1038
#CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG = $1039
#CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT = $103A
#CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE = $103B
#CL_DEVICE_NATIVE_VECTOR_WIDTH_HALF = $103C
#CL_DEVICE_OPENCL_C_VERSION = $103D
#CL_DEVICE_LINKER_AVAILABLE = $103E
#CL_DEVICE_BUILT_IN_KERNELS = $103F
#CL_DEVICE_IMAGE_MAX_BUFFER_SIZE = $1040
#CL_DEVICE_IMAGE_MAX_ARRAY_SIZE = $1041
#CL_DEVICE_PARENT_DEVICE = $1042
#CL_DEVICE_PARTITION_MAX_SUB_DEVICES = $1043
#CL_DEVICE_PARTITION_PROPERTIES = $1044
#CL_DEVICE_PARTITION_AFFINITY_DOMAIN = $1045
#CL_DEVICE_PARTITION_TYPE = $1046
#CL_DEVICE_REFERENCE_COUNT = $1047
#CL_DEVICE_PREFERRED_INTEROP_USER_SYNC = $1048
#CL_DEVICE_PRINTF_BUFFER_SIZE = $1049

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_device_fp_config - bitfield <<<<<

#CL_FP_DENORM = (1 << 0)
#CL_FP_INF_NAN = (1 << 1)
#CL_FP_ROUND_TO_NEAREST = (1 << 2)
#CL_FP_ROUND_TO_ZERO = (1 << 3)
#CL_FP_ROUND_TO_INF = (1 << 4)
#CL_FP_FMA = (1 << 5)
#CL_FP_SOFT_FLOAT = (1 << 6)
#CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT = (1 << 7)

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_device_mem_cache_type <<<<<

#CL_NONE = $0
#CL_READ_ONLY_CACHE = $1
#CL_READ_WRITE_CACHE = $2

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_device_local_mem_type <<<<<

#CL_LOCAL = $1
#CL_GLOBAL = $2

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_device_exec_capabilities - bitfield <<<<<

#CL_EXEC_KERNEL = (1 << 0)
#CL_EXEC_NATIVE_KERNEL = (1 << 1)

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_command_queue_properties - bitfield <<<<<

#CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE = (1 << 0)
#CL_QUEUE_PROFILING_ENABLE = (1 << 1)

; <<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_context_info <<<<<

#CL_CONTEXT_REFERENCE_COUNT = $1080
#CL_CONTEXT_DEVICES = $1081
#CL_CONTEXT_PROPERTIES = $1082
#CL_CONTEXT_NUM_DEVICES = $1083

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_context_properties <<<<<

#CL_CONTEXT_PLATFORM = $1084
#CL_CONTEXT_INTEROP_USER_SYNC = $1085

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_device_partition_property <<<<<

#CL_DEVICE_PARTITION_EQUALLY = $1086
#CL_DEVICE_PARTITION_BY_COUNTS = $1087
#CL_DEVICE_PARTITION_BY_COUNTS_LIST_END = $0
#CL_DEVICE_PARTITION_BY_AFFINITY_DOMAIN = $1088

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_device_affinity_domain <<<<<

#CL_DEVICE_AFFINITY_DOMAIN_NUMA = (1 << 0)
#CL_DEVICE_AFFINITY_DOMAIN_L4_CACHE = (1 << 1)
#CL_DEVICE_AFFINITY_DOMAIN_L3_CACHE = (1 << 2)
#CL_DEVICE_AFFINITY_DOMAIN_L2_CACHE = (1 << 3)
#CL_DEVICE_AFFINITY_DOMAIN_L1_CACHE = (1 << 4)
#CL_DEVICE_AFFINITY_DOMAIN_NEXT_PARTITIONABLE = (1 << 5)

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_command_queue_info <<<<<

#CL_QUEUE_CONTEXT = $1090
#CL_QUEUE_DEVICE = $1091
#CL_QUEUE_REFERENCE_COUNT = $1092
#CL_QUEUE_PROPERTIES = $1093

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_mem_flags - bitfield <<<<<

#CL_MEM_READ_WRITE = (1 << 0)
#CL_MEM_WRITE_ONLY = (1 << 1)
#CL_MEM_READ_ONLY = (1 << 2)
#CL_MEM_USE_HOST_PTR = (1 << 3)
#CL_MEM_ALLOC_HOST_PTR = (1 << 4)
#CL_MEM_COPY_HOST_PTR = (1 << 5)
#CL_MEM_HOST_WRITE_ONLY = (1 << 7)
#CL_MEM_HOST_READ_ONLY = (1 << 8)
#CL_MEM_HOST_NO_ACCESS = (1 << 9)

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_mem_migration_flags - bitfield <<<<<

#CL_MIGRATE_MEM_OBJECT_HOST = (1 << 0)
#CL_MIGRATE_MEM_OBJECT_CONTENT_UNDEFINED = (1 << 1)

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_channel_order <<<<<

#CL_R = $10B0
#CL_A = $10B1
#CL_RG = $10B2
#CL_RA = $10B3
#CL_RGB = $10B4
#CL_RGBA = $10B5
#CL_BGRA = $10B6
#CL_ARGB = $10B7
#CL_INTENSITY = $10B8
#CL_LUMINANCE = $10B9
#CL_Rx = $10BA
#CL_RGx = $10BB
#CL_RGBx = $10BC

; <<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_channel_type <<<<<

#CL_SNORM_INT8 = $10D0
#CL_SNORM_INT16 = $10D1
#CL_UNORM_INT8 = $10D2
#CL_UNORM_INT16 = $10D3
#CL_UNORM_SHORT_565 = $10D4
#CL_UNORM_SHORT_555 = $10D5
#CL_UNORM_INT_101010 = $10D6
#CL_SIGNED_INT8 = $10D7
#CL_SIGNED_INT16 = $10D8
#CL_SIGNED_INT32 = $10D9
#CL_UNSIGNED_INT8 = $10DA
#CL_UNSIGNED_INT16 = $10DB
#CL_UNSIGNED_INT32 = $10DC
#CL_HALF_FLOAT = $10DD
#CL_FLOAT = $10DE

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_mem_object_type <<<<<

#CL_MEM_OBJECT_BUFFER = $10F0
#CL_MEM_OBJECT_IMAGE2D = $10F1
#CL_MEM_OBJECT_IMAGE3D = $10F2
#CL_MEM_OBJECT_IMAGE2D_ARRAY = $10F3
#CL_MEM_OBJECT_IMAGE1D = $10F4
#CL_MEM_OBJECT_IMAGE1D_ARRAY = $10F5
#CL_MEM_OBJECT_IMAGE1D_BUFFER = $10F6

; <<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_mem_info <<<<<

#CL_MEM_TYPE = $1100
#CL_MEM_FLAGS = $1101
#CL_MEM_SIZE = $1102
#CL_MEM_HOST_PTR = $1103
#CL_MEM_MAP_COUNT = $1104
#CL_MEM_REFERENCE_COUNT = $1105
#CL_MEM_CONTEXT = $1106
#CL_MEM_ASSOCIATED_MEMOBJECT = $1107
#CL_MEM_OFFSET = $1108

; <<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_image_info <<<<<

#CL_IMAGE_FORMAT = $1110
#CL_IMAGE_ELEMENT_SIZE = $1111
#CL_IMAGE_ROW_PITCH = $1112
#CL_IMAGE_SLICE_PITCH = $1113
#CL_IMAGE_WIDTH = $1114
#CL_IMAGE_HEIGHT = $1115
#CL_IMAGE_DEPTH = $1116
#CL_IMAGE_ARRAY_SIZE = $1117
#CL_IMAGE_BUFFER = $1118
#CL_IMAGE_NUM_MIP_LEVELS = $1119
#CL_IMAGE_NUM_SAMPLES = $111A

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_addressing_mode <<<<<

#CL_ADDRESS_NONE = $1130
#CL_ADDRESS_CLAMP_TO_EDGE = $1131
#CL_ADDRESS_CLAMP = $1132
#CL_ADDRESS_REPEAT = $1133
#CL_ADDRESS_MIRRORED_REPEAT = $1134

; <<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_filter_mode <<<<<

#CL_FILTER_NEAREST = $1140
#CL_FILTER_LINEAR = $1141

; <<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_sampler_info <<<<<

#CL_SAMPLER_REFERENCE_COUNT = $1150
#CL_SAMPLER_CONTEXT = $1151
#CL_SAMPLER_NORMALIZED_COORDS = $1152
#CL_SAMPLER_ADDRESSING_MODE = $1153
#CL_SAMPLER_FILTER_MODE = $1154

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_map_flags - bitfield <<<<<

#CL_MAP_READ = (1 << 0)
#CL_MAP_WRITE = (1 << 1)
#CL_MAP_WRITE_INVALIDATE_REGION = (1 << 2)

; <<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_program_info <<<<<

#CL_PROGRAM_REFERENCE_COUNT = $1160
#CL_PROGRAM_CONTEXT = $1161
#CL_PROGRAM_NUM_DEVICES = $1162
#CL_PROGRAM_DEVICES = $1163
#CL_PROGRAM_SOURCE = $1164
#CL_PROGRAM_BINARY_SIZES = $1165
#CL_PROGRAM_BINARIES = $1166
#CL_PROGRAM_NUM_KERNELS = $1167
#CL_PROGRAM_KERNEL_NAMES = $1168

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_program_build_info <<<<<

#CL_PROGRAM_BUILD_STATUS = $1181
#CL_PROGRAM_BUILD_OPTIONS = $1182
#CL_PROGRAM_BUILD_LOG = $1183
#CL_PROGRAM_BINARY_TYPE = $1184

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_program_binary_type <<<<<

#CL_PROGRAM_BINARY_TYPE_NONE = $0
#CL_PROGRAM_BINARY_TYPE_COMPILED_OBJECT = $1
#CL_PROGRAM_BINARY_TYPE_LIBRARY = $2
#CL_PROGRAM_BINARY_TYPE_EXECUTABLE = $4

; <<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_build_status <<<<<

#CL_BUILD_SUCCESS = 0
#CL_BUILD_NONE = -1
#CL_BUILD_ERROR = -2
#CL_BUILD_IN_PROGRESS = -3

; <<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_kernel_info <<<<<

#CL_KERNEL_FUNCTION_NAME = $1190
#CL_KERNEL_NUM_ARGS = $1191
#CL_KERNEL_REFERENCE_COUNT = $1192
#CL_KERNEL_CONTEXT = $1193
#CL_KERNEL_PROGRAM = $1194
#CL_KERNEL_ATTRIBUTES = $1195

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_kernel_arg_info <<<<<

#CL_KERNEL_ARG_ADDRESS_QUALIFIER = $1196
#CL_KERNEL_ARG_ACCESS_QUALIFIER = $1197
#CL_KERNEL_ARG_TYPE_NAME = $1198
#CL_KERNEL_ARG_TYPE_QUALIFIER = $1199
#CL_KERNEL_ARG_NAME = $119A

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_kernel_arg_address_qualifier <<<<<

#CL_KERNEL_ARG_ADDRESS_GLOBAL = $119B
#CL_KERNEL_ARG_ADDRESS_LOCAL = $119C
#CL_KERNEL_ARG_ADDRESS_CONSTANT = $119D
#CL_KERNEL_ARG_ADDRESS_PRIVATE = $119E

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_kernel_arg_access_qualifier <<<<<

#CL_KERNEL_ARG_ACCESS_READ_ONLY = $11A0
#CL_KERNEL_ARG_ACCESS_WRITE_ONLY = $11A1
#CL_KERNEL_ARG_ACCESS_READ_WRITE = $11A2
#CL_KERNEL_ARG_ACCESS_NONE = $11A3

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_kernel_arg_type_qualifer <<<<<

#CL_KERNEL_ARG_TYPE_NONE = 0
#CL_KERNEL_ARG_TYPE_CONST = (1 << 0)
#CL_KERNEL_ARG_TYPE_RESTRICT = (1 << 1)
#CL_KERNEL_ARG_TYPE_VOLATILE = (1 << 2)

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_kernel_work_group_info <<<<<

#CL_KERNEL_WORK_GROUP_SIZE = $11B0
#CL_KERNEL_COMPILE_WORK_GROUP_SIZE = $11B1
#CL_KERNEL_LOCAL_MEM_SIZE = $11B2
#CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE = $11B3
#CL_KERNEL_PRIVATE_MEM_SIZE = $11B4
#CL_KERNEL_GLOBAL_WORK_SIZE = $11B5

; <<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_event_info <<<<<

#CL_EVENT_COMMAND_QUEUE = $11D0
#CL_EVENT_COMMAND_TYPE = $11D1
#CL_EVENT_REFERENCE_COUNT = $11D2
#CL_EVENT_COMMAND_EXECUTION_STATUS = $11D3
#CL_EVENT_CONTEXT = $11D4

; <<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_command_type <<<<<

#CL_COMMAND_NDRANGE_KERNEL = $11F0
#CL_COMMAND_TASK = $11F1
#CL_COMMAND_NATIVE_KERNEL = $11F2
#CL_COMMAND_READ_BUFFER = $11F3
#CL_COMMAND_WRITE_BUFFER = $11F4
#CL_COMMAND_COPY_BUFFER = $11F5
#CL_COMMAND_READ_IMAGE = $11F6
#CL_COMMAND_WRITE_IMAGE = $11F7
#CL_COMMAND_COPY_IMAGE = $11F8
#CL_COMMAND_COPY_IMAGE_TO_BUFFER = $11F9
#CL_COMMAND_COPY_BUFFER_TO_IMAGE = $11FA
#CL_COMMAND_MAP_BUFFER = $11FB
#CL_COMMAND_MAP_IMAGE = $11FC
#CL_COMMAND_UNMAP_MEM_OBJECT = $11FD
#CL_COMMAND_MARKER = $11FE
#CL_COMMAND_ACQUIRE_GL_OBJECTS = $11FF
#CL_COMMAND_RELEASE_GL_OBJECTS = $1200
#CL_COMMAND_READ_BUFFER_RECT = $1201
#CL_COMMAND_WRITE_BUFFER_RECT = $1202
#CL_COMMAND_COPY_BUFFER_RECT = $1203
#CL_COMMAND_USER = $1204
#CL_COMMAND_BARRIER = $1205
#CL_COMMAND_MIGRATE_MEM_OBJECTS = $1206
#CL_COMMAND_FILL_BUFFER = $1207
#CL_COMMAND_FILL_IMAGE = $1208

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< command execution status <<<<<

#CL_COMPLETE = $0
#CL_RUNNING = $1
#CL_SUBMITTED = $2
#CL_QUEUED = $3

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_buffer_create_type <<<<<

#CL_BUFFER_CREATE_TYPE_REGION = $1220

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< cl_profiling_info <<<<<

#CL_PROFILING_COMMAND_QUEUED = $1280
#CL_PROFILING_COMMAND_SUBMIT = $1281
#CL_PROFILING_COMMAND_START = $1282
#CL_PROFILING_COMMAND_END = $1283


Procedure.s clErrorMessage(ErrorID)
 
  Protected OutputMessage.s
 
  Select ErrorID
     
    Case #CL_DEVICE_NOT_FOUND
      OutputMessage = "CL DEVICE NOT FOUND"
     
    Case #CL_DEVICE_NOT_AVAILABLE
      OutputMessage = "CL DEVICE NOT AVAILABLE"
     
    Case #CL_COMPILER_NOT_AVAILABLE
      OutputMessage = "CL COMPILER NOT AVAILABLE"
     
    Case #CL_MEM_OBJECT_ALLOCATION_FAILURE
      OutputMessage = "CL MEM OBJECT ALLOCATION FAILURE"
     
    Case #CL_OUT_OF_RESOURCES
      OutputMessage = "CL OUT OF RESOURCES"
     
    Case #CL_OUT_OF_HOST_MEMORY
      OutputMessage = "CL OUT OF HOST MEMORY"
     
    Case #CL_PROFILING_INFO_NOT_AVAILABLE
      OutputMessage = "CL PROFILING INFO NOT AVAILABLE"
     
    Case #CL_MEM_COPY_OVERLAP
      OutputMessage = "CL MEM COPY OVERLAP"
     
    Case #CL_IMAGE_FORMAT_MISMATCH
      OutputMessage = "CL IMAGE FORMAT MISMATCH"
     
    Case #CL_IMAGE_FORMAT_NOT_SUPPORTED
      OutputMessage = "CL IMAGE FORMAT NOT SUPPORTED"
     
    Case #CL_BUILD_PROGRAM_FAILURE
      OutputMessage = "CL BUILD PROGRAM FAILURE"
     
    Case #CL_MAP_FAILURE
      OutputMessage = "CL MAP FAILURE"
     
    Case #CL_MISALIGNED_SUB_BUFFER_OFFSET
      OutputMessage = "CL MISALIGNED SUB BUFFER OFFSET"
     
    Case #CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST
      OutputMessage = "CL EXEC STATUS ERROR FOR EVENTS IN WAIT LIST"
     
    Case #CL_COMPILE_PROGRAM_FAILURE
      OutputMessage = "CL COMPILE PROGRAM FAILURE"
     
    Case #CL_LINKER_NOT_AVAILABLE
      OutputMessage = "CL LINKER NOT AVAILABLE"
     
    Case #CL_LINK_PROGRAM_FAILURE
      OutputMessage = "CL LINK PROGRAM FAILURE"
     
    Case #CL_DEVICE_PARTITION_FAILED
      OutputMessage = "CL DEVICE PARTITION FAILED"
     
    Case #CL_KERNEL_ARG_INFO_NOT_AVAILABLE
      OutputMessage = "CL KERNEL ARG INFO NOT AVAILABLE"
     
    Case #CL_INVALID_VALUE
      OutputMessage = "CL INVALID VALUE"
     
    Case #CL_INVALID_DEVICE_TYPE
      OutputMessage = "CL INVALID DEVICE TYPE"
     
    Case #CL_INVALID_PLATFORM
      OutputMessage = "CL INVALID PLATFORM"
     
    Case #CL_INVALID_DEVICE
      OutputMessage = "CL INVALID DEVICE"
     
    Case #CL_INVALID_CONTEXT
      OutputMessage = "CL INVALID CONTEXT"
     
    Case #CL_INVALID_QUEUE_PROPERTIES
      OutputMessage = "CL INVALID QUEUE PROPERTIES"
     
    Case #CL_INVALID_COMMAND_QUEUE
      OutputMessage = "CL INVALID COMMAND QUEUE"
     
    Case #CL_INVALID_HOST_PTR
      OutputMessage = "CL INVALID HOST PTR"
     
    Case #CL_INVALID_MEM_OBJECT
      OutputMessage = "CL INVALID MEM OBJECT"
     
    Case #CL_INVALID_IMAGE_FORMAT_DESCRIPTOR
      OutputMessage = "CL INVALID IMAGE FORMAT DESCRIPTOR"
     
    Case #CL_INVALID_IMAGE_SIZE
      OutputMessage = "CL INVALID IMAGE SIZE"
     
    Case #CL_INVALID_SAMPLER
      OutputMessage = "CL INVALID SAMPLER"
     
    Case #CL_INVALID_BINARY
      OutputMessage = "CL INVALID BINARY"
     
    Case #CL_INVALID_BUILD_OPTIONS
      OutputMessage = "CL INVALID BUILD OPTIONS"
     
    Case #CL_INVALID_PROGRAM
      OutputMessage = "CL INVALID PROGRAM"
     
    Case #CL_INVALID_PROGRAM_EXECUTABLE
      OutputMessage = "CL INVALID PROGRAM EXECUTABLE"
     
    Case #CL_INVALID_KERNEL_NAME
      OutputMessage = "CL INVALID KERNEL NAME"
     
    Case #CL_INVALID_KERNEL_DEFINITION
      OutputMessage = "CL INVALID KERNEL DEFINITION"
     
    Case #CL_INVALID_KERNEL
      OutputMessage = "CL INVALID KERNEL"
     
    Case #CL_INVALID_ARG_INDEX
      OutputMessage = "CL INVALID ARG INDEX"
     
    Case #CL_INVALID_ARG_VALUE
      OutputMessage = "CL INVALID ARG VALUE"
     
    Case #CL_INVALID_ARG_SIZE
      OutputMessage = "CL INVALID ARG SIZE"
     
    Case #CL_INVALID_KERNEL_ARGS
      OutputMessage = "CL INVALID KERNEL ARGS"
     
    Case #CL_INVALID_WORK_DIMENSION
      OutputMessage = "CL INVALID WORK DIMENSION"
     
    Case #CL_INVALID_WORK_GROUP_SIZE
      OutputMessage = "CL INVALID WORK GROUP SIZE"
     
    Case #CL_INVALID_WORK_ITEM_SIZE
      OutputMessage = "CL INVALID WORK ITEM SIZE"
     
    Case #CL_INVALID_GLOBAL_OFFSET
      OutputMessage = "CL INVALID GLOBAL OFFSET"
     
    Case #CL_INVALID_EVENT_WAIT_LIST
      OutputMessage = "CL INVALID EVENT WAIT LIST"
     
    Case #CL_INVALID_EVENT
      OutputMessage = "CL INVALID EVENT"
     
    Case #CL_INVALID_OPERATION
      OutputMessage = "CL INVALID OPERATION"
     
    Case #CL_INVALID_GL_OBJECT
      OutputMessage = "CL INVALID GL OBJECT"
     
    Case #CL_INVALID_BUFFER_SIZE
      OutputMessage = "CL INVALID BUFFER SIZE"
     
    Case #CL_INVALID_MIP_LEVEL
      OutputMessage = "CL INVALID MIP LEVEL"
     
    Case #CL_INVALID_GLOBAL_WORK_SIZE
      OutputMessage = "CL INVALID GLOBAL WORK SIZE"
     
    Case #CL_INVALID_PROPERTY
      OutputMessage = "CL INVALID PROPERTY"
     
    Case #CL_INVALID_IMAGE_DESCRIPTOR
      OutputMessage = "CL INVALID IMAGE DESCRIPTOR"
     
    Case #CL_INVALID_COMPILER_OPTIONS
      OutputMessage = "CL INVALID COMPILER OPTIONS"
     
    Case #CL_INVALID_LINKER_OPTIONS
      OutputMessage = "CL INVALID LINKER OPTIONS"
     
    Case #CL_INVALID_DEVICE_PARTITION_COUNT
      OutputMessage = "CL INVALID DEVICE PARTITION COUNT"
     
    Default
      OutputMessage = "UNKNOWN (" + Str(ErrorID) + ")"
     
  EndSelect
 
  ProcedureReturn OutputMessage
EndProcedure

; Function imports
;
CompilerSelect #PB_Compiler_OS
   
  CompilerCase #PB_OS_Windows
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
      #OpenCL_Lib_File_Name = "C:\Program Files (x86)\AMD APP\lib\x86/OpenCL.lib"
    CompilerElse
      #OpenCL_Lib_File_Name = "C:\Program Files (x86)\AMD APP\lib\x86_64/OpenCL.lib"
    CompilerEndIf

  CompilerCase #PB_OS_Linux
    #OpenCL_Lib_File_Name = "/usr/lib/libOpenCL.so"

  CompilerCase #PB_OS_MacOS
    #OpenCL_Lib_File_Name = "/System/Library/Frameworks/OpenCL.framework/OpenCL"

CompilerEndSelect


Import #OpenCL_Lib_File_Name
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Platform API <<<<<
 
  clGetPlatformIDs(num_entries, *platforms, *num_platforms) ; CL_API_SUFFIX__VERSION_1_0;
  clGetPlatformInfo(platform, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Device APIs <<<<<
 
  clGetDeviceIDs(platform, device_type.q, num_entries, *devices, *num_devices) ; CL_API_SUFFIX__VERSION_1_0;
  clGetDeviceInfo(device, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clCreateSubDevices(in_device, *properties, num_devices, *out_devices, *num_devices_ret) ; CL_API_SUFFIX__VERSION_1_2;
  clRetainDevice(device) ; CL_API_SUFFIX__VERSION_1_2;
  clReleaseDevice(device) ; CL_API_SUFFIX__VERSION_1_2;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Context APIs <<<<<
 
  clCreateContext(*properties, num_devices, *devices, *pfn_notify, *user_data, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0
  clCreateContextFromType(*properties, device_type.q, *pfn_notify, *user_data, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0
  clRetainContext(context) ; CL_API_SUFFIX__VERSION_1_0;
  clReleaseContext(context) ; CL_API_SUFFIX__VERSION_1_0;
  clGetContextInfo(context, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Command Queue APIs <<<<<
 
  clCreateCommandQueue(context, device, properties.q, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clRetainCommandQueue(command_queue) ; CL_API_SUFFIX__VERSION_1_0;
  clReleaseCommandQueue(command_queue) ; CL_API_SUFFIX__VERSION_1_0;
  clGetCommandQueueInfo(command_queue, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Memory Object APIs <<<<<
 
  clCreateBuffer(context, flags.q, size, *host_ptr, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clCreateSubBuffer(buffer, flags.q, buffer_create_type, *buffer_create_info, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_1;
  clCreateImage(context, flags.q, *image_format, *image_desc, *host_ptr, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_2;
  clRetainMemObject(memobj) ; CL_API_SUFFIX__VERSION_1_0;
  clReleaseMemObject(memobj) ; CL_API_SUFFIX__VERSION_1_0;
  clGetSupportedImageFormats(context, flags.q, image_type, num_entries, *image_formats, *num_image_formats) ; CL_API_SUFFIX__VERSION_1_0;
  clGetMemObjectInfo(memobj, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clGetImageInfo(image, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clSetMemObjectDestructorCallback(memobj, *pfn_notify, *user_data) ; CL_API_SUFFIX__VERSION_1_1
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Sampler APIs <<<<<
 
  clCreateSampler(context, normalized_coords, addressing_mode, filter_mode, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clRetainSampler(sampler) ; CL_API_SUFFIX__VERSION_1_0;
  clReleaseSampler(sampler) ; CL_API_SUFFIX__VERSION_1_0;
  clGetSamplerInfo(sampler, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Program Object APIs <<<<<
 
  clCreateProgramWithSource(context, count, strings, *lengths, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clCreateProgramWithBinary(context, num_devices, *device_list, *lengths, binaries, *binary_status, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clCreateProgramWithBuiltInKernels(context, num_devices, *device_list, *kernel_names, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_2;
  clRetainProgram(program) ; CL_API_SUFFIX__VERSION_1_0;
  clReleaseProgram(program) ; CL_API_SUFFIX__VERSION_1_0;
  clBuildProgram(program, num_devices, *device_list, *options, *pfn_notify, *user_data) ; CL_API_SUFFIX__VERSION_1_0
  clCompileProgram(program, num_devices, *device_list, *options, num_input_headers, *input_headers, header_include_names, *pfn_notify, *user_data) ; CL_API_SUFFIX__VERSION_1_0
  clLinkProgram(context, num_devices, *device_list, *options, num_input_programs, *input_programs, *pfn_notify, *user_data, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_2
  clUnloadPlatformCompiler(platform) ; CL_API_SUFFIX__VERSION_1_2;
  clGetProgramInfo(program, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clGetProgramBuildInfo(program, device, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Kernel Object APIs <<<<<
 
  clCreateKernel(program, *kernel_name, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clCreateKernelsInProgram(program, num_kernels, *kernels, *num_kernels_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clRetainKernel(kernel) ; CL_API_SUFFIX__VERSION_1_0;
  clReleaseKernel(kernel) ; CL_API_SUFFIX__VERSION_1_0;
  clSetKernelArg(kernel, arg_index, arg_size, *arg_value) ; CL_API_SUFFIX__VERSION_1_0;
  clGetKernelInfo(kernel, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clGetKernelArgInfo(kernel, arg_indx, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_2;
  clGetKernelWorkGroupInfo(kernel, device, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Event Object APIs <<<<<
 
  clWaitForEvents(num_events, *event_list) ; CL_API_SUFFIX__VERSION_1_0;
  clGetEventInfo(event, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clCreateUserEvent(context, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_1;
  clRetainEvent(event) ; CL_API_SUFFIX__VERSION_1_0;
  clReleaseEvent(event) ; CL_API_SUFFIX__VERSION_1_0;
  clSetUserEventStatus(event, execution_status) ; CL_API_SUFFIX__VERSION_1_1;
  clSetEventCallback(event, command_exec_callback_type, *pfn_notify, *user_data) ; CL_API_SUFFIX__VERSION_1_1;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Profiling APIs <<<<<
 
  clGetEventProfilingInfo(event, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Flush and Finish APIs <<<<<
 
  clFlush(command_queue) ; CL_API_SUFFIX__VERSION_1_0;
  clFinish(command_queue) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Enqueued Commands APIs <<<<<
 
  clEnqueueReadBuffer(command_queue, buffer, blocking_read, offset, size, *ptr, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueReadBufferRect(command_queue, buffer, blocking_read, *buffer_offset, *host_offset, *region, buffer_row_pitch, buffer_slice_pitch, host_row_pitch, host_slice_pitch, *ptr, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_1;
  clEnqueueWriteBuffer(command_queue, buffer, blocking_write, offset, size, *ptr, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueWriteBufferRect(command_queue, buffer, blocking_write, *buffer_offset, *host_offset, *region, buffer_row_pitch, buffer_slice_pitch, host_row_pitch, host_slice_pitch, *ptr, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_1;
  clEnqueueFillBuffer(command_queue, buffer, *pattern, pattern_size, offset, size, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_2;
  clEnqueueCopyBuffer(command_queue, src_buffer, dst_buffer, src_offset, dst_offset, size, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueCopyBufferRect(command_queue, src_buffer, dst_buffer, *src_origin, *dst_origin, *region, src_row_pitch, src_slice_pitch, dst_row_pitch, dst_slice_pitch, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_1;
  clEnqueueReadImage(command_queue, image, blocking_read, *origin, *region, row_pitch, slice_pitch, *ptr, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueWriteImage(command_queue, image, blocking_write, *origin, *region, input_row_pitch, input_slice_pitch, *ptr, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueFillImage(command_queue, image, *fill_color, *origin, *region, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_2;
  clEnqueueCopyImage(command_queue, src_image, dst_image, *src_origin, *dst_origin, *region, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueCopyImageToBuffer(command_queue, src_image, dst_buffer, *src_origin, *region, dst_offset, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueCopyBufferToImage(command_queue, src_buffer, dst_image, src_offset, *dst_origin, *region, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueMapBuffer(command_queue, buffer, blocking_map, map_flags.q, offset, size, num_events_in_wait_list, *event_wait_list, *event, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueMapImage(command_queue, image, blocking_map, map_flags.q, *origin, *region, *image_row_pitch, *image_slice_pitch, num_events_in_wait_list, *event_wait_list, *event, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueUnmapMemObject(command_queue, memobj, *mapped_ptr, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueMigrateMemObjects(command_queue, num_mem_objects, *mem_objects, flags.q, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_2;
  clEnqueueNDRangeKernel(command_queue, kernel, work_dim, *global_work_offset, *global_work_size, *local_work_size, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueTask(command_queue, kernel, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueNativeKernel(command_queue,*user_func, *args, cb_args, num_mem_objects,*mem_list, *args_mem_loc, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueMarkerWithWaitList(command_queue, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_2;
  clEnqueueBarrierWithWaitList(command_queue, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_2;
  clSetPrintfCallback(context, *pfn_notify, *user_data); CL_API_SUFFIX__VERSION_1_2
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Deprecated OpenCL 1.1 APIs <<<<<
 
  clCreateImage2D(context, flags.q, *image_format, image_width, image_height, image_row_pitch, *host_ptr, *errcode_ret) ; CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED;
  clCreateImage3D(context, flags.q, *image_format, image_width, image_height, image_depth, image_row_pitch, image_slice_pitch, *host_ptr, *errcode_ret) ; CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED;
  clEnqueueMarker(command_queue, *event) ; CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED;
  clEnqueueWaitForEvents(command_queue, num_events, *event_list) ; CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED;
  clEnqueueBarrier(command_queue) ; CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED;
  clUnloadCompiler() ; CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED;
 
EndImport

Re: OpenCL

Posted: Sun Feb 03, 2013 4:23 pm
by Fred
Then the test file:

Code: Select all

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; Project name : OpenCL Examples
; File Name : OpenCL Examples - 00.pb
; File version: 1.0.0
; Programming : OK
; Programmed by : Guimauve
; Date : 14-10-2012
; Last Update : 15-10-2012
; PureBasic code : 5.00 B5
; Platform : Windows, Linux, MacOS X
; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

XIncludeFile "OpenCL.pbi"


#DATA_SIZE = 128
#INPUT_SIZE = #DATA_SIZE * SizeOf(Float)
UseGPU = #True

Dim InputValues.f(#DATA_SIZE - 1)
Dim Result.f(#DATA_SIZE - 1)

For Index = 0 To #DATA_SIZE - 1
  InputValues(Index) = Random(10) + 1
Next

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Get the PlaformID <<<<<

clGetPlatformIDs(1, @PlatformID, #Null)

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Connect to a Compute Device <<<<<

If UseGPU = #True
  err = clGetDeviceIDs(PlatformID, #CL_DEVICE_TYPE_GPU, 1, @ComputeDeviceID, #Null)
Else
  err = clGetDeviceIDs(PlatformID, #CL_DEVICE_TYPE_CPU, 1, @ComputeDeviceID, #Null)
EndIf

If err <> #CL_SUCCESS
  MessageRequester("Fatal Error", "Failed to create a device group ! " + clErrorMessage(err))
  End
EndIf

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Create a compute context <<<<<

ComputeContextID = clCreateContext(#Null, 1, @ComputeDeviceID, #Null, #Null, @err)

If ComputeContextID = #Null
  MessageRequester("Fatal Error", "Failed to create a compute context ! ")
  End
EndIf

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Create a command commands <<<<<

Commands = clCreateCommandQueue(ComputeContextID, ComputeDeviceID, #Null, @err)

If Commands = #Null
  MessageRequester("Fatal Error", "Failed to create a command commands!")
  End
EndIf

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Load the KernelSource buffer <<<<<

If ReadFile(0, "square.cl")
 
  KernelSourceLen.q = Lof(0)
  KernelSource = AllocateMemory(KernelSourceLen)
 
  If KernelSource <> #Null
    ReadData(0, KernelSource, KernelSourceLen)
  Else
    MessageRequester("Fatal Error", "Failed to allocate memory to load the compute program source code!")
    End
  EndIf
 
  CloseFile(0)
Else
  MessageRequester("Fatal Error", "Failed read square.cl !")
  End
EndIf

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Create the compute program from the KernelSource buffer <<<<<

ProgramID = clCreateProgramWithSource(ComputeContextID, 1, @KernelSource, @KernelSourceLen, @err);

If ProgramID = #Null
  MessageRequester("Fatal Error", "Failed to create compute program!")
  End
EndIf


; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Build the program executable <<<<<

err = clBuildProgram(ProgramID, 0, #Null, #Null, #Null, #Null);

If err <> #CL_SUCCESS
 
  BuildLogBuffer = AllocateMemory(2048)
  MessageRequester("Fatal Error", "Failed to build program executable!")
  clGetProgramBuildInfo(ProgramID, ComputeDeviceID, #CL_PROGRAM_BUILD_LOG, MemorySize(BuildLogBuffer), @BuildLogBuffer, @len)
  MessageRequester("Build Log", PeekS(BuildLogBuffer, #PB_Ascii))
  End
 
Else
  BuildLogBuffer = AllocateMemory(2048)
  
  ;clGetProgramBuildInfo(ProgramID, ComputeDeviceID, #CL_PROGRAM_BUILD_LOG, MemorySize(BuildLogBuffer), @BuildLogBuffer, @len)

  ;If CreateFile(0, "Build Log")
  ;  WriteString(0, PeekS(BuildLogBuffer, #PB_Ascii))
  ;  CloseFile(0)
  ;EndIf
 
EndIf

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Create the compute kernel in the program we wish To run <<<<<

KernelID = clCreateKernel(ProgramID, @"square2", @err)

If KernelID = #Null Or err <> #CL_SUCCESS
  MessageRequester("Fatal Error", "Failed to create compute kernel!")
  End
EndIf

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Create the input And output arrays in device memory For our calculation <<<<<

InputBuffer  = clCreateBuffer(ComputeContextID, #CL_MEM_READ_ONLY , #INPUT_SIZE, #Null, #Null)
OutputBuffer = clCreateBuffer(ComputeContextID, #CL_MEM_WRITE_ONLY, #INPUT_SIZE, #Null, #Null)

If InputBuffer = #Null Or OutputBuffer = #Null
  MessageRequester("Fatal Error", "Failed to allocate device memory!")
  End
EndIf

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Write our Data set into the input Array in device memory <<<<<

err = clEnqueueWriteBuffer(Commands, InputBuffer, #CL_TRUE, 0, #INPUT_SIZE, @InputValues(), 0, #Null, #Null)

If err <> #CL_SUCCESS
  MessageRequester("Fatal Error", "Failed to write to source array!")
  End
EndIf

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Set the arguments To our compute kernel <<<<<

err = clSetKernelArg(KernelID, 0, SizeOf(Integer), @InputBuffer);
err = err | clSetKernelArg(KernelID, 1, SizeOf(Integer), @OutputBuffer);

Count = #DATA_SIZE
err = err | clSetKernelArg(KernelID, 2, SizeOf(Long), @Count);

If err <> #CL_SUCCESS
  MessageRequester("Fatal Error", "Failed to set kernel arguments ! " + clErrorMessage(err))
  End
EndIf

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Get the maximum work group size For executing the kernel on the device <<<<<

Local.i = 0

err = clGetKernelWorkGroupInfo(KernelID, ComputeDeviceID, #CL_KERNEL_WORK_GROUP_SIZE, SizeOf(Local), @Local, #Null);

If err <> #CL_SUCCESS
  MessageRequester("Fatal Error", "Failed to retrieve kernel work group info ! " + clErrorMessage(err))
  End
EndIf

Debug "Nb cores: " + Local

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Execute the kernel over the entire range of our 1d input Data set <<<<<
; <<<<< using the maximum number of work group items For this device      <<<<<

GlobalCount = #INPUT_SIZE

err = clEnqueueNDRangeKernel(Commands, KernelID, 1, #Null, @GlobalCount, @Local, 0, #Null, #Null);

If err
  MessageRequester("Fatal Error", "Failed to execute kernel ! " + clErrorMessage(err))
  End
EndIf

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Wait For the command commands To get serviced before reading back results <<<<<

clFinish(Commands)

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Read back the results from the device To verify the output <<<<<

err = clEnqueueReadBuffer(Commands, OutputBuffer, #CL_TRUE, 0, SizeOf(Float) * #DATA_SIZE, @Result(), 0, #Null, #Null); 

If err <> #CL_SUCCESS
  MessageRequester("Fatal Error", "Failed to read output array! " + Str(err))
  End
EndIf

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Validate our results <<<<<

Correct = 0

For Index = 0 To #DATA_SIZE - 1
 
  If Result(Index) = InputValues(Index) * InputValues(Index)
    Correct + 1
  EndIf
 
  Debug StrF(Result(Index), 3) + " --> " + StrF(InputValues(Index) * InputValues(Index), 3)
 
Next

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Print a brief summary detailing the results <<<<<
 
MessageRequester("Summary", "Computed " + Str(Correct) + "/" + Str(#DATA_SIZE) + " correct values!")

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Shutdown And cleanup <<<<<

clReleaseMemObject(InputBuffer)
clReleaseMemObject(OutputBuffer)
clReleaseProgram(ProgramID)
clReleaseKernel(KernelID)
clReleaseCommandQueue(Commands)
clReleaseContext(ComputeContextID)
And finally the OpenCL program file: (named square.cl)

Code: Select all

__kernel void square2(__global float* input, __global float* output, const unsigned int count)
{
   int i = get_global_id(0);

   if (i < count)
      output[i] = input[i] * input[i];
}
Here it does compute the 128 squares operations in parallel, as my GPU has 256 units.

Re: OpenCL

Posted: Sun Feb 03, 2013 4:54 pm
by Guimauve
Hello,

I confirm the Fred's code work on LinuxMint 14.1 x64.
I still studying what I have done wrong in my example but so far it's not so different.

Thanks.
Guimauve

Re: OpenCL

Posted: Sun Feb 03, 2013 5:02 pm
by Fred
The imports were wrong, you needed some '.q' (cl_ulong) here and here. Some errors as well in the test sample, but so far it didn't take long to have it working.

Re: OpenCL

Posted: Sun Dec 27, 2015 10:38 pm
by Psychophanta
Failed creating executable here.
Win 8.1-x64, AMD APP SDK 3.0, machine: AMD A8, VGA AMD R5

Re: OpenCL

Posted: Fri Aug 05, 2016 5:10 pm
by wilbert
Last few days I've been playing a bit with OpenCL ( http://www.purebasic.fr/english/viewtop ... 13&t=66315 ).

While the code above was helpful, it doesn't work properly in it's current form.
- Windows needs Import for the x86 version, OSX and Linux need ImportC .
- The code doesn't work with unicode which is a problem for PB 5.50+ .

Things I did to make it work cross platform ...

For the import part:
- changed the paths for OSX and Linux
- added a macro so Import would be used on Windows and ImportC on OSX and Linux
- use of p-ascii prototype for clCreateProgramWithBuiltInKernels, clBuildProgram, clCompileProgram, clLinkProgram and clCreateKernel imports.

Code: Select all

; Function imports
;
CompilerSelect #PB_Compiler_OS
   
  CompilerCase #PB_OS_Windows
    CompilerIf #PB_Compiler_Processor = #PB_Processor_x86
      #OpenCL_Lib_File_Name = "C:\Program Files (x86)\AMD APP\lib\x86/OpenCL.lib"
    CompilerElse
      #OpenCL_Lib_File_Name = "C:\Program Files (x86)\AMD APP\lib\x86_64/OpenCL.lib"
    CompilerEndIf

  CompilerCase #PB_OS_Linux
    #OpenCL_Lib_File_Name = "-l OpenCL"

  CompilerCase #PB_OS_MacOS
    #OpenCL_Lib_File_Name = "-framework OpenCL"

CompilerEndSelect


CompilerIf #PB_Compiler_OS = #PB_OS_Windows
  Macro ImportCL:Import:EndMacro
  Macro EndImportCL:EndImport:EndMacro
CompilerElse
  Macro ImportCL:ImportC:EndMacro
  Macro EndImportCL:EndImport:EndMacro
CompilerEndIf


ImportCL #OpenCL_Lib_File_Name
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Platform API <<<<<
 
  clGetPlatformIDs(num_entries, *platforms, *num_platforms) ; CL_API_SUFFIX__VERSION_1_0;
  clGetPlatformInfo(platform, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Device APIs <<<<<
 
  clGetDeviceIDs(platform, device_type.q, num_entries, *devices, *num_devices) ; CL_API_SUFFIX__VERSION_1_0;
  clGetDeviceInfo(device, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clCreateSubDevices(in_device, *properties, num_devices, *out_devices, *num_devices_ret) ; CL_API_SUFFIX__VERSION_1_2;
  clRetainDevice(device) ; CL_API_SUFFIX__VERSION_1_2;
  clReleaseDevice(device) ; CL_API_SUFFIX__VERSION_1_2;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Context APIs <<<<<
 
  clCreateContext(*properties, num_devices, *devices, *pfn_notify, *user_data, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0
  clCreateContextFromType(*properties, device_type.q, *pfn_notify, *user_data, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0
  clRetainContext(context) ; CL_API_SUFFIX__VERSION_1_0;
  clReleaseContext(context) ; CL_API_SUFFIX__VERSION_1_0;
  clGetContextInfo(context, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Command Queue APIs <<<<<
 
  clCreateCommandQueue(context, device, properties.q, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clRetainCommandQueue(command_queue) ; CL_API_SUFFIX__VERSION_1_0;
  clReleaseCommandQueue(command_queue) ; CL_API_SUFFIX__VERSION_1_0;
  clGetCommandQueueInfo(command_queue, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Memory Object APIs <<<<<
 
  clCreateBuffer(context, flags.q, size, *host_ptr, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clCreateSubBuffer(buffer, flags.q, buffer_create_type, *buffer_create_info, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_1;
  clCreateImage(context, flags.q, *image_format, *image_desc, *host_ptr, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_2;
  clRetainMemObject(memobj) ; CL_API_SUFFIX__VERSION_1_0;
  clReleaseMemObject(memobj) ; CL_API_SUFFIX__VERSION_1_0;
  clGetSupportedImageFormats(context, flags.q, image_type, num_entries, *image_formats, *num_image_formats) ; CL_API_SUFFIX__VERSION_1_0;
  clGetMemObjectInfo(memobj, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clGetImageInfo(image, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clSetMemObjectDestructorCallback(memobj, *pfn_notify, *user_data) ; CL_API_SUFFIX__VERSION_1_1
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Sampler APIs <<<<<
 
  clCreateSampler(context, normalized_coords, addressing_mode, filter_mode, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clRetainSampler(sampler) ; CL_API_SUFFIX__VERSION_1_0;
  clReleaseSampler(sampler) ; CL_API_SUFFIX__VERSION_1_0;
  clGetSamplerInfo(sampler, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Program Object APIs <<<<<
 
  clCreateProgramWithSource(context, count, strings, *lengths, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clCreateProgramWithBinary(context, num_devices, *device_list, *lengths, binaries, *binary_status, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clCreateProgramWithBuiltInKernels(context, num_devices, *device_list, kernel_names.p-ascii, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_2;
  clRetainProgram(program) ; CL_API_SUFFIX__VERSION_1_0;
  clReleaseProgram(program) ; CL_API_SUFFIX__VERSION_1_0;
  clBuildProgram(program, num_devices, *device_list, options.p-ascii, *pfn_notify, *user_data) ; CL_API_SUFFIX__VERSION_1_0
  clCompileProgram(program, num_devices, *device_list, options.p-ascii, num_input_headers, *input_headers, header_include_names, *pfn_notify, *user_data) ; CL_API_SUFFIX__VERSION_1_0
  clLinkProgram(context, num_devices, *device_list, options.p-ascii, num_input_programs, *input_programs, *pfn_notify, *user_data, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_2
  clUnloadPlatformCompiler(platform) ; CL_API_SUFFIX__VERSION_1_2;
  clGetProgramInfo(program, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clGetProgramBuildInfo(program, device, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Kernel Object APIs <<<<<
 
  clCreateKernel(program, kernel_name.p-ascii, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clCreateKernelsInProgram(program, num_kernels, *kernels, *num_kernels_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clRetainKernel(kernel) ; CL_API_SUFFIX__VERSION_1_0;
  clReleaseKernel(kernel) ; CL_API_SUFFIX__VERSION_1_0;
  clSetKernelArg(kernel, arg_index, arg_size, *arg_value) ; CL_API_SUFFIX__VERSION_1_0;
  clGetKernelInfo(kernel, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clGetKernelArgInfo(kernel, arg_indx, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_2;
  clGetKernelWorkGroupInfo(kernel, device, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Event Object APIs <<<<<
 
  clWaitForEvents(num_events, *event_list) ; CL_API_SUFFIX__VERSION_1_0;
  clGetEventInfo(event, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clCreateUserEvent(context, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_1;
  clRetainEvent(event) ; CL_API_SUFFIX__VERSION_1_0;
  clReleaseEvent(event) ; CL_API_SUFFIX__VERSION_1_0;
  clSetUserEventStatus(event, execution_status) ; CL_API_SUFFIX__VERSION_1_1;
  clSetEventCallback(event, command_exec_callback_type, *pfn_notify, *user_data) ; CL_API_SUFFIX__VERSION_1_1;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Profiling APIs <<<<<
 
  clGetEventProfilingInfo(event, param_name, param_value_size, *param_value, *param_value_size_ret) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Flush and Finish APIs <<<<<
 
  clFlush(command_queue) ; CL_API_SUFFIX__VERSION_1_0;
  clFinish(command_queue) ; CL_API_SUFFIX__VERSION_1_0;
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Enqueued Commands APIs <<<<<
 
  clEnqueueReadBuffer(command_queue, buffer, blocking_read, offset, size, *ptr, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueReadBufferRect(command_queue, buffer, blocking_read, *buffer_offset, *host_offset, *region, buffer_row_pitch, buffer_slice_pitch, host_row_pitch, host_slice_pitch, *ptr, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_1;
  clEnqueueWriteBuffer(command_queue, buffer, blocking_write, offset, size, *ptr, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueWriteBufferRect(command_queue, buffer, blocking_write, *buffer_offset, *host_offset, *region, buffer_row_pitch, buffer_slice_pitch, host_row_pitch, host_slice_pitch, *ptr, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_1;
  clEnqueueFillBuffer(command_queue, buffer, *pattern, pattern_size, offset, size, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_2;
  clEnqueueCopyBuffer(command_queue, src_buffer, dst_buffer, src_offset, dst_offset, size, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueCopyBufferRect(command_queue, src_buffer, dst_buffer, *src_origin, *dst_origin, *region, src_row_pitch, src_slice_pitch, dst_row_pitch, dst_slice_pitch, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_1;
  clEnqueueReadImage(command_queue, image, blocking_read, *origin, *region, row_pitch, slice_pitch, *ptr, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueWriteImage(command_queue, image, blocking_write, *origin, *region, input_row_pitch, input_slice_pitch, *ptr, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueFillImage(command_queue, image, *fill_color, *origin, *region, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_2;
  clEnqueueCopyImage(command_queue, src_image, dst_image, *src_origin, *dst_origin, *region, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueCopyImageToBuffer(command_queue, src_image, dst_buffer, *src_origin, *region, dst_offset, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueCopyBufferToImage(command_queue, src_buffer, dst_image, src_offset, *dst_origin, *region, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueMapBuffer(command_queue, buffer, blocking_map, map_flags.q, offset, size, num_events_in_wait_list, *event_wait_list, *event, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueMapImage(command_queue, image, blocking_map, map_flags.q, *origin, *region, *image_row_pitch, *image_slice_pitch, num_events_in_wait_list, *event_wait_list, *event, *errcode_ret) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueUnmapMemObject(command_queue, memobj, *mapped_ptr, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueMigrateMemObjects(command_queue, num_mem_objects, *mem_objects, flags.q, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_2;
  clEnqueueNDRangeKernel(command_queue, kernel, work_dim, *global_work_offset, *global_work_size, *local_work_size, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueTask(command_queue, kernel, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueNativeKernel(command_queue,*user_func, *args, cb_args, num_mem_objects,*mem_list, *args_mem_loc, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_0;
  clEnqueueMarkerWithWaitList(command_queue, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_2;
  clEnqueueBarrierWithWaitList(command_queue, num_events_in_wait_list, *event_wait_list, *event) ; CL_API_SUFFIX__VERSION_1_2;
  clSetPrintfCallback(context, *pfn_notify, *user_data); CL_API_SUFFIX__VERSION_1_2
 
  ; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  ; <<<<< Deprecated OpenCL 1.1 APIs <<<<<
 
  clCreateImage2D(context, flags.q, *image_format, image_width, image_height, image_row_pitch, *host_ptr, *errcode_ret) ; CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED;
  clCreateImage3D(context, flags.q, *image_format, image_width, image_height, image_depth, image_row_pitch, image_slice_pitch, *host_ptr, *errcode_ret) ; CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED;
  clEnqueueMarker(command_queue, *event) ; CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED;
  clEnqueueWaitForEvents(command_queue, num_events, *event_list) ; CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED;
  clEnqueueBarrier(command_queue) ; CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED;
  clUnloadCompiler() ; CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED;
 
EndImportCL
To the test code:
- change the lines with clBuildProgram and clCreateKernel to match the updated declaration.
- GlobalCount has to be a multiple of Local !

Code: Select all

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Build the program executable <<<<<

err = clBuildProgram(ProgramID, 0, #Null, "", #Null, #Null);

Code: Select all

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Create the compute kernel in the program we wish To run <<<<<

KernelID = clCreateKernel(ProgramID, "square2", @err)

Code: Select all

; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
; <<<<< Execute the kernel over the entire range of our 1d input Data set <<<<<
; <<<<< using the maximum number of work group items For this device      <<<<<

GlobalCount = Local * ((#INPUT_SIZE + Local - 1) / Local)
You should also make sure that the square.cl program file is written in Ascii format !

Re: OpenCL

Posted: Mon May 02, 2022 4:37 pm
by pdwyer
It took me 6 years to realize that Wilbert has updated the fixes in OpenCL so that it would work :mrgreen: I should have checked back years ago

Actually, when it stopped working due to Unicode and the 64bit int in the err code that meant I couldn't see the error being thrown back to me, I just tossed it in the too hard basket and didn't really investigate. Recently though I had a building need to do some GPU coding so I was looking around and tried Etayson's Cuda code but it was too difficult for me and I learned enough about cuda to realize that it's not what I want (pre compiled modules to be added, nvidia only).

So I also made the minor tweaks to fix Guimauve code (although with uglier conversions, not your elegant .p-ascii method) and got it working (removing the ascii compileIf too). Basically Guimauve's version works fine

I ran some tests and was getting about 5-6x the speed of my CPU which isn't that impressive but I had done no optimization and I was copying several gb to the device and back to the host which was slowing it down a little. Doing some reading it seems you need to understand the core counts, blocks and dimensions better to format your data before you copy it to the device. Actually I don't know how to do this yet but I realised that I don't know anything about my card specs or how to query them so I can dynamically set up the data... optimization alone can apparently bring another order of magnitude more speed so I want to try this.

So, my code below is just some simple code to display details about the Platform and Device. I removed most of the error checking since the way it's done makes it a bit hard to read the code and this was also a learning exercise. I'm hoping to learn more about OpenCL to make it more accessible for when I need it. There are some other card params that can be queried but I think these are the main ones. see: https://www.khronos.org/registry/OpenCL ... eInfo.html

For those looking for an opencl.lib file there is one in nvidia's toolkit which is downloadable but likely if you search your drive you'll find it

Anyway, here is my attempt a simple opencl info code and an example of the output.
I'd be interested to know what is returned by other card types, especially non-nvidia ones if you feel like sharing.

Next step is to do some study and then see if I can get a simple example of some performance benefits and when I get there I'll post. Hopefully I won't get distracted and it take me 6 years again

Code: Select all

EnableExplicit

XIncludeFile "OpenCL.pbi"

#ParamChar = 1
#ParamInt = 2

Structure ParamInfo
    ParamName.s
    ParamConst.i
    ParamType.i
EndStructure

Define PlatformParamCount.i = 4
Define DeviceParamCount.i = 19
Define err.l
Define PlatformID.i
Define ComputeDeviceID.i
Define ProgramID.i
Define DevErr.l
Define PlatErr.l
Define BufSize.l = 1024 

Define RetChar.s = Space(BufSize)
Define RetCharSize.i = BufSize
Define RetInt.i = 0
Define RetSizeInt.i = SizeOf(Integer)
Define RetSizeActual.i = 0
Define ParamLoop.i

Dim PlatformParams.ParamInfo(PlatformParamCount)
Dim DeviceParams.ParamInfo(DeviceParamCount)

;Set Platform Params
PlatformParams(0)\ParamName = "#CL_PLATFORM_NAME" : PlatformParams(0)\ParamConst = #CL_PLATFORM_NAME : PlatformParams(0)\ParamType = #ParamChar
PlatformParams(1)\ParamName = "#CL_PLATFORM_PROFILE" : PlatformParams(1)\ParamConst = #CL_PLATFORM_PROFILE : PlatformParams(1)\ParamType = #ParamChar
PlatformParams(2)\ParamName = "#CL_PLATFORM_VENDOR" : PlatformParams(2)\ParamConst = #CL_PLATFORM_VENDOR : PlatformParams(2)\ParamType = #ParamChar
PlatformParams(3)\ParamName = "#CL_PLATFORM_VERSION" : PlatformParams(3)\ParamConst = #CL_PLATFORM_VERSION : PlatformParams(3)\ParamType = #ParamChar

;Set Device Params
DeviceParams(0)\ParamName = "#CL_DEVICE_VENDOR" : DeviceParams(0)\ParamConst = #CL_DEVICE_VENDOR : DeviceParams(0)\ParamType = #ParamChar
DeviceParams(1)\ParamName = "#CL_DEVICE_NAME" : DeviceParams(1)\ParamConst = #CL_DEVICE_NAME : DeviceParams(1)\ParamType = #ParamChar
DeviceParams(2)\ParamName = "#CL_DRIVER_VERSION" : DeviceParams(2)\ParamConst = #CL_DRIVER_VERSION : DeviceParams(2)\ParamType = #ParamChar
DeviceParams(3)\ParamName = "#CL_DEVICE_VERSION" : DeviceParams(3)\ParamConst = #CL_DEVICE_VERSION : DeviceParams(3)\ParamType = #ParamChar
DeviceParams(4)\ParamName = "#CL_DEVICE_GLOBAL_MEM_SIZE" : DeviceParams(4)\ParamConst = #CL_DEVICE_GLOBAL_MEM_SIZE : DeviceParams(4)\ParamType = #ParamInt
DeviceParams(5)\ParamName = "#CL_DEVICE_LOCAL_MEM_SIZE" : DeviceParams(5)\ParamConst = #CL_DEVICE_LOCAL_MEM_SIZE : DeviceParams(5)\ParamType = #ParamInt
DeviceParams(6)\ParamName = "#CL_DEVICE_ADDRESS_BITS" : DeviceParams(6)\ParamConst = #CL_DEVICE_ADDRESS_BITS : DeviceParams(6)\ParamType = #ParamInt
DeviceParams(7)\ParamName = "#CL_DEVICE_IMAGE2D_MAX_WIDTH" : DeviceParams(7)\ParamConst = #CL_DEVICE_IMAGE2D_MAX_WIDTH : DeviceParams(7)\ParamType = #ParamInt
DeviceParams(8)\ParamName = "#CL_DEVICE_IMAGE2D_MAX_HEIGHT" : DeviceParams(8)\ParamConst = #CL_DEVICE_IMAGE2D_MAX_HEIGHT : DeviceParams(8)\ParamType = #ParamInt
DeviceParams(9)\ParamName = "#CL_DEVICE_MAX_COMPUTE_UNITS" : DeviceParams(9)\ParamConst = #CL_DEVICE_MAX_COMPUTE_UNITS : DeviceParams(9)\ParamType = #ParamInt
DeviceParams(10)\ParamName = "#CL_DEVICE_MAX_CLOCK_FREQUENCY" : DeviceParams(10)\ParamConst = #CL_DEVICE_MAX_CLOCK_FREQUENCY : DeviceParams(10)\ParamType = #ParamInt
DeviceParams(11)\ParamName = "#CL_DEVICE_MAX_MEM_ALLOC_SIZE" : DeviceParams(11)\ParamConst = #CL_DEVICE_MAX_MEM_ALLOC_SIZE : DeviceParams(11)\ParamType = #ParamInt
DeviceParams(12)\ParamName = "#CL_DEVICE_MAX_SAMPLERS" : DeviceParams(12)\ParamConst = #CL_DEVICE_MAX_SAMPLERS : DeviceParams(12)\ParamType = #ParamInt
DeviceParams(13)\ParamName = "#CL_DEVICE_MAX_WORK_GROUP_SIZE" : DeviceParams(13)\ParamConst = #CL_DEVICE_MAX_WORK_GROUP_SIZE : DeviceParams(13)\ParamType = #ParamInt
DeviceParams(14)\ParamName = "#CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS" : DeviceParams(14)\ParamConst = #CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS : DeviceParams(14)\ParamType = #ParamInt
DeviceParams(15)\ParamName = "#CL_DEVICE_MAX_WORK_ITEM_SIZES" : DeviceParams(15)\ParamConst = #CL_DEVICE_MAX_WORK_ITEM_SIZES : DeviceParams(15)\ParamType = #ParamInt
DeviceParams(16)\ParamName = "#CL_DEVICE_GLOBAL_MEM_CACHE_SIZE" : DeviceParams(16)\ParamConst = #CL_DEVICE_GLOBAL_MEM_CACHE_SIZE : DeviceParams(16)\ParamType = #ParamInt
DeviceParams(17)\ParamName = "#CL_DEVICE_PROFILING_TIMER_RESOLUTION" : DeviceParams(17)\ParamConst = #CL_DEVICE_PROFILING_TIMER_RESOLUTION : DeviceParams(17)\ParamType = #ParamInt
DeviceParams(18)\ParamName = "#CL_DEVICE_EXTENSIONS" : DeviceParams(18)\ParamConst = #CL_DEVICE_EXTENSIONS : DeviceParams(18)\ParamType = #ParamChar


;Connect to platform and device
PlatErr = clGetPlatformIDs(1, @PlatformID, #Null)
DevErr = clGetDeviceIDs(PlatformID, #CL_DEVICE_TYPE_GPU, 1, @ComputeDeviceID, #Null) 

OpenConsole()
    If PlatformID > 0 And ComputeDeviceID > 0
        PrintN("GPU Connected!")
        PrintN("")
    Else
        PrintN("Something wrong with connection to GPU")
        PrintN("PlatErr (" + Str(PlatErr) + ")" + clErrorMessage(PlatErr))
        PrintN("DevErr  (" + Str(DevErr) + ")" + clErrorMessage(DevErr))
        Input()
        CloseConsole()
        End
    EndIf
    
    ;Platform Info
    PrintN("Platform Information")
    PrintN("")
    For ParamLoop = 0 To PlatformParamCount -1
        clGetPlatformInfo(PlatformID,PlatformParams(ParamLoop)\ParamConst,RetCharSize,@RetChar,@RetSizeActual)
        PrintN(PlatformParams(ParamLoop)\ParamName + Space(50 - Len(PlatformParams(ParamLoop)\ParamName)) + PeekS(@RetChar,BufSize,#PB_Ascii)) 
        RetChar.s = Space(BufSize)
    Next
    
    PrintN("")
    
    ;Device Info
    PrintN("Device Information")
    PrintN("")
    For ParamLoop = 0 To DeviceParamCount -1
        If DeviceParams(ParamLoop)\ParamType = #ParamChar
            clGetDeviceInfo(ComputeDeviceID,DeviceParams(ParamLoop)\ParamConst,RetCharSize,@RetChar,@RetSizeActual) 
            PrintN(DeviceParams(ParamLoop)\ParamName + Space(50 - Len(DeviceParams(ParamLoop)\ParamName)) + PeekS(@RetChar,BufSize,#PB_Ascii)) 
            RetChar.s = Space(BufSize)
        ElseIf DeviceParams(ParamLoop)\ParamType = #ParamInt
            clGetDeviceInfo(ComputeDeviceID, DeviceParams(ParamLoop)\ParamConst, RetSizeInt, @RetInt, @RetSizeActual)
            PrintN(DeviceParams(ParamLoop)\ParamName + Space(50 - Len(DeviceParams(ParamLoop)\ParamName)) + Str(RetInt))
        EndIf
    
    Next    
    
    ;end
    clReleaseProgram(ProgramID)
    Input()

CloseConsole()

And the output:

Code: Select all

GPU Connected!

Platform Information

#CL_PLATFORM_NAME                                 NVIDIA CUDA
#CL_PLATFORM_PROFILE                              FULL_PROFILE
#CL_PLATFORM_VENDOR                               NVIDIA Corporation
#CL_PLATFORM_VERSION                              OpenCL 3.0 CUDA 11.6.99

Device Information

#CL_DEVICE_VENDOR                                 NVIDIA Corporation
#CL_DEVICE_NAME                                   NVIDIA GeForce GTX 1650
#CL_DRIVER_VERSION                                511.65
#CL_DEVICE_VERSION                                OpenCL 3.0 CUDA
#CL_DEVICE_GLOBAL_MEM_SIZE                        4294508544
#CL_DEVICE_LOCAL_MEM_SIZE                         49152
#CL_DEVICE_ADDRESS_BITS                           64
#CL_DEVICE_IMAGE2D_MAX_WIDTH                      32768
#CL_DEVICE_IMAGE2D_MAX_HEIGHT                     32768
#CL_DEVICE_MAX_COMPUTE_UNITS                      14
#CL_DEVICE_MAX_CLOCK_FREQUENCY                    1740
#CL_DEVICE_MAX_MEM_ALLOC_SIZE                     1073627136
#CL_DEVICE_MAX_SAMPLERS                           32
#CL_DEVICE_MAX_WORK_GROUP_SIZE                    1024
#CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS               3
#CL_DEVICE_MAX_WORK_ITEM_SIZES                    3
#CL_DEVICE_GLOBAL_MEM_CACHE_SIZE                  458752
#CL_DEVICE_PROFILING_TIMER_RESOLUTION             1000
#CL_DEVICE_EXTENSIONS                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_device_uuid cl_khr_pci_bus_info cl_khr_external_semaphore cl_khr_external_memory cl_khr_external_semaphore_win32 cl_khr_external_memory_win32


Re: OpenCL

Posted: Tue May 03, 2022 1:49 pm
by pdwyer
I made some progress on doing some real computations and comparing with normal CPU calculations.
While I still haven't done anything with passing structures or multi-dimensional arrays I have two cases below

The first one is single array of floats (millions of them) and do a sigmoid function on them which is reasonably heavy if in a tight loop.
The second one is 2 input arrays of floats and get the distance between them in the the sense of Pythagoras's theorem to calc hypotenuse. Similar to what you might do in 2d/3d vertex distance calculations.

Some learnings along the way:
- There is an overhead with copying data to and from the GPU. It's fast but it's still an overhead. Best performance is when there is heavier processing to do on a smaller data set. On my PC, Hypotenuse calc, being fairly light ends up only 2x faster on GPU but the sigmoid is about 30-40 times faster (see output). Likewise the card and PC are better at different things it seems
- [this is wrong]The number of items to put in the queue needs to be divisible by 64 (might be different on different cards) or it fails
- You can overflow the card with large data sets and need to break them up, it's hard to tell what memory threshold is being exceeded though
- if you don't check for an error on the main processing call (clEnqueueNDRangeKernel) then it will proceed as usual and copy data back but it's all empty
- You don't need to know much C unless you are doing something really complex (which is lucky) but knowing a little will help you debug.
- floating point calculations, being on different hardware chips show slight differences in results as you can see in my output. you need to keep this in mind if you are planning to compare for absolute likeness etc. EDIT doubles seem to have less of a problem here that floats. doubles seem faster than floats on my PC but seem about the same speed on the gpu

Outputs from my PC are below too so you can compare times.

I'd be interested to know if this works on other PCs (sorry for the WinAPI call for timing to the cross platform people)

EDIT
The second 'learning' above turned out to be incorrect.
The amount of data you can pass though in one call to "clEnqueueNDRangeKernel()"needs to fit into GPU Memory. So if I have two double arrays then that is 2x8bytes per item (if both arrays have the same item count. This needs to fit into #CL_DEVICE_GLOBAL_MEM_SIZE (minus memory in use which I haven't worked out how to query, in task manager it's about 0.5gb of a 4gb card in use and I can call about 3.5gb before I get an #CL_MEM_OBJECT_ALLOCATION_FAILURE error returned on this function. In the code below this variable is the #INPUT_SIZE * number of arrays (2) or #DATA_SIZE * SizeOf(double) * arrays passed (2)
The "Local" param is the local group size and I thought this needed to be divisible by 64. Turns out that it needs to be less than #CL_DEVICE_MAX_WORK_GROUP_SIZE (1024 on my PC) and can be anything. A call like this will set it automatically:

Code: Select all

err = clGetKernelWorkGroupInfo(KernelID, ComputeDeviceID, #CL_KERNEL_WORK_GROUP_SIZE, SizeOf(Local), @Local, #Null)
and on my PC it seems to always set it to 256. The key point is that it needs to divide into #DATA_SIZE exactly. So, you don't need to have a #DATA_SIZE count of a multiple of 64 or 256 but a multiple of whatever you set "local" to. if #DATA_SIZE is prime then you'll need to set it to "1" or #DATA_SIZE (if #DATA_SIZE is less than #CL_DEVICE_MAX_WORK_GROUP_SIZE ).
The param #CL_DEVICE_MAX_MEM_ALLOC_SIZE (on my pc is 1gb) implies that this is the max alloc size but it seems not to be. I was sending 2x 1gb arrays + 1 array of the same size as output so that's 3gb.

In short...
When you package up your data size, you set it to an amount that fits easily into GPU memory (give yourself some headroom) and choose a number of elements and a "local" where local is a factor and less than the max.
** when calculating your arrays space, remember to include you Output array(s) mem too! That also needs to sit in GPU memory.


What happens if you send arrays of different sizes to the GPU? -> No idea :?



Code: Select all

    EnableExplicit
    
    XIncludeFile "OpenCL.pbi"
    
    Define err.l
    Define Index.i
    Define PlatformID.i
    Define ComputeDeviceID.i
    Define ComputeContextID.i
    Define Commands.i
    Define ProgramID.i
    Define BuildLogBuffer.i
    Define KernelID.i
    Define InputBuffer.i
    Define OutputBuffer.i
    Define Count.i
    Define GlobalCount.i
    Define local.i
    Define correct.i
    Define testloop.i
    Define TestCount.i = 10
    Define TempRandNo.i
    
    Define CLCode.s
    Define CLCodeLen.i
    Define CLCodeFunct.s 
    Define CLCodeBuf.i
    Global NewMap TimeHist.f()
    
    Define i.i
    
    ;====================================
    Procedure Output(Comment.s)
        
        Static timeIncr.q
        Static timeLast.q
            
        If timeIncr = 0 ;not init'd
            QueryPerformanceCounter_(@timeIncr)
        EndIf
        
        QueryPerformanceCounter_(@timeLast)
        PrintN(Comment + Space(40 - Len(Comment)) + StrF((timeLast-timeIncr)/10000,3) + " ms") 
        TimeHist(Comment) = (timeLast-timeIncr)/10000 
        QueryPerformanceCounter_(@timeIncr)
    
    EndProcedure
    ;====================================
    
    ; Start +++++++++++++++++++
    
    OpenConsole()
    Output("Start:")

    Define magnitude.i = 1;50
    #DATA_SIZE = 32*1000000  ;896;00;000
    #INPUT_SIZE = #DATA_SIZE * SizeOf(Float)
    
    Dim InputValues.f(#DATA_SIZE - 1)
    Dim Result.f(#DATA_SIZE - 1)
    Dim TestResult.f(#DATA_SIZE - 1)
    
    For Index = 0 To #DATA_SIZE - 1
        InputValues(Index) = (Random(100) + 1) / 100
    Next
    
    Output("Prep Time:")
    
    clGetPlatformIDs(1, @PlatformID, #Null)
    clGetDeviceIDs(PlatformID, #CL_DEVICE_TYPE_GPU, 1, @ComputeDeviceID, #Null)
    ComputeContextID = clCreateContext(#Null, 1, @ComputeDeviceID, #Null, #Null, @err)
    Commands = clCreateCommandQueue(ComputeContextID, ComputeDeviceID, #Null, @err)

    CLCode = "    __kernel void square2(__global float* input, __global float* output, const unsigned int count)" + #CRLF$
    CLCode = CLCode + "    {" + #CRLF$
    CLCode = CLCode + "       int i = get_global_id(0);" + #CRLF$
    CLCode = CLCode + "" + #CRLF$
    CLCode = CLCode + "       if (i < count)" + #CRLF$
  ; CLCode = CLCode + "          output[i] = sqrt((input[i] * input[i]) + (input[i] * input[i])) ;" + #CRLF$
    CLCode = CLCode + "          output[i] = 1 / (1+ pow(2,(-1 * input[i])))  ;" + #CRLF$   ;Pow(2.72,(-1 * InputValues(Index))))
    CLCode = CLCode + "    }"
    
    CLCodeBuf.i = Ascii(CLCode)
    CLCodeLen = Len(CLCode)
    CLCodeFunct = PeekS(Ascii("square2"))
    
    ProgramID = clCreateProgramWithSource(ComputeContextID, 1, @CLCodeBuf, @CLCodeLen, @err);
    clBuildProgram(ProgramID, 0, #Null, #Null, #Null, #Null);

    KernelID = clCreateKernel(ProgramID,  @CLCodeFunct, @err)
    
    Output("GPU Setup Time:")

    InputBuffer  = clCreateBuffer(ComputeContextID, #CL_MEM_READ_ONLY , #INPUT_SIZE, #Null, @err)
    OutputBuffer = clCreateBuffer(ComputeContextID, #CL_MEM_WRITE_ONLY, #INPUT_SIZE, #Null, @err)
    
    err = clEnqueueWriteBuffer(Commands, InputBuffer, #CL_TRUE, 0, #INPUT_SIZE, @InputValues(), 0, #Null, #Null)
    err = clSetKernelArg(KernelID, 0, SizeOf(Integer), @InputBuffer);
    err = clSetKernelArg(KernelID, 1, SizeOf(Integer), @OutputBuffer);

    Count = #DATA_SIZE
    err = clSetKernelArg(KernelID, 2, SizeOf(Long), @Count);

    Output("Device Copy Time:")
    
    err = clGetKernelWorkGroupInfo(KernelID, ComputeDeviceID, #CL_KERNEL_WORK_GROUP_SIZE, SizeOf(Local), @Local, #Null);
    
    GlobalCount = #INPUT_SIZE

    err = clEnqueueNDRangeKernel(Commands, KernelID,1, #Null, @GlobalCount, @Local, 0, #Null, #Null)
    If err
        PrintN("Execution Init error:" + clErrorMessage(err))  
        ;CL MEM OBJECT ALLOCATION FAILURE   for processing mem size to big (needs to be split up)
        ;CL INVALID WORK GROUP SIZE         for non multiple of 64 (64 is just my card?)
        ;CL INVALID KERNEL                  for code compile issues
    EndIf
    
    Output("Process time:")
    
    clFinish(Commands)
    clEnqueueReadBuffer(Commands, OutputBuffer, #CL_TRUE, 0, #INPUT_SIZE, @Result(), 0, #Null, #Null);
    
    Output("Host Copy time:")
   
    ;Testing on CPU
    For Index = 0 To #DATA_SIZE - 1
        ;TestResult(Index) = Sqr((InputValues(Index) * InputValues(Index)) + (InputValues(Index)*InputValues(Index)))
        TestResult(Index) = 1 / (1 + Pow(2,(-1 * InputValues(Index))))
    Next

    Output("CPU Comparison time:")
    
    ;Results
    PrintN("")    
    PrintN("#DATA_SIZE: " + Str(#DATA_SIZE))
    PrintN(StrF(TimeHist("CPU Comparison time:") /(TimeHist("Device Copy Time:") + TimeHist("Process time:") + TimeHist("Host copy time:")),1) + "x Speedup (Including Host/Dev & Dev/Host mem copy)")    
    PrintN(StrF(TimeHist("CPU Comparison time:") / TimeHist("Process time:"),1) + "x Speedup (Compute only)")
    PrintN("")
    PrintN("Random sampling output comparisons")
    PrintN("")
    
    For testloop = 1 To TestCount
        TempRandNo = Random(#DATA_SIZE - 1,1)
        PrintN("CPU Result:   " +  StrF(TestResult(TempRandNo)) + #TAB$ + " GPU Result:  " + StrF(Result(TempRandNo)) + #TAB$ + "Diff: " + StrF(TestResult(TempRandNo) - Result(TempRandNo)) )
    Next
    
    ; cleanup

    clReleaseMemObject(InputBuffer)
    clReleaseMemObject(OutputBuffer)
    clReleaseProgram(ProgramID)
    clReleaseKernel(KernelID)
    clReleaseCommandQueue(Commands)
    clReleaseContext(ComputeContextID)

    Input()
    CloseConsole()

Code: Select all

    EnableExplicit
    
    XIncludeFile "OpenCL.pbi"
    
    Define err.l
    Define Index.i
    Define PlatformID.i
    Define ComputeDeviceID.i
    Define ComputeContextID.i
    Define Commands.i
    Define ProgramID.i
    Define BuildLogBuffer.i
    Define KernelID.i
    Define InputBuffer_a.i
    Define InputBuffer_b.i
    Define OutputBuffer_c.i
    Define Count.i
    Define GlobalCount.i
    Define local.i
    Define correct.i
    Define testloop.i
    Define TestCount.i = 10
    Define TempRandNo.i
    
    Define CLCode.s
    Define CLCodeLen.i
    Define CLCodeFunct.s 
    Define CLCodeBuf.i
    Global NewMap TimeHist.f()
    
    Define i.i
    
    ;====================================
    Procedure Output(Comment.s)
        
        Static timeIncr.q
        Static timeLast.q
            
        If timeIncr = 0 ;not init'd
            QueryPerformanceCounter_(@timeIncr)
        EndIf
        
        QueryPerformanceCounter_(@timeLast)
        PrintN(Comment + Space(40 - Len(Comment)) + StrF((timeLast-timeIncr)/10000,3) + " ms") 
        TimeHist(Comment) = (timeLast-timeIncr)/10000 
        QueryPerformanceCounter_(@timeIncr)
    
    EndProcedure
    ;====================================
    
    ; Start +++++++++++++++++++
    
    OpenConsole()
    Output("Start:")

    #DATA_SIZE = 32*1000000  ; reduce this if getting "CL MEM OBJECT ALLOCATION FAILURE"
    #INPUT_SIZE = #DATA_SIZE * SizeOf(Float)
    
    Dim Input_a.f(#DATA_SIZE - 1)
    Dim Input_b.f(#DATA_SIZE - 1)
    Dim Result_c.f(#DATA_SIZE - 1)
    Dim TestResult.f(#DATA_SIZE - 1)
    
    For Index = 0 To #DATA_SIZE - 1
        Input_a(Index) = (Random(100) + 1) / 100
        Input_b(Index) = (Random(100) + 1) / 100
    Next
    
    Output("Prep Time:")
    
    clGetPlatformIDs(1, @PlatformID, #Null)
    clGetDeviceIDs(PlatformID, #CL_DEVICE_TYPE_GPU, 1, @ComputeDeviceID, #Null)
    ComputeContextID = clCreateContext(#Null, 1, @ComputeDeviceID, #Null, #Null, @err)
    Commands = clCreateCommandQueue(ComputeContextID, ComputeDeviceID, #Null, @err)

    
    CLCode =          " __kernel void" + #CRLF$
    CLCode = CLCode + "  vectorAdd(__global float* input_a," + #CRLF$
    CLCode = CLCode + "            __global float* input_b," + #CRLF$
    CLCode = CLCode + "            __global float* output_c," + #CRLF$
    CLCode = CLCode + "            const unsigned int count)" + #CRLF$
    CLCode = CLCode + "  {" + #CRLF$
    CLCode = CLCode + "      int i = get_global_id(0);" + #CRLF$
    CLCode = CLCode + "       if (i < count)" + #CRLF$
    CLCode = CLCode + "      output_c[i] = sqrt((input_a[i] * input_a[i]) + (input_b[i] * input_b[i]));" + #CRLF$
    CLCode = CLCode + "  }" + #CRLF$

    
    CLCodeBuf.i = Ascii(CLCode)
    CLCodeLen = Len(CLCode)
    CLCodeFunct = PeekS(Ascii("vectorAdd"))
    
    ProgramID = clCreateProgramWithSource(ComputeContextID, 1, @CLCodeBuf, @CLCodeLen, @err);
    clBuildProgram(ProgramID, 0, #Null, #Null, #Null, #Null);

    KernelID = clCreateKernel(ProgramID,  @CLCodeFunct, @err)
    
    Output("GPU Setup Time:")

    InputBuffer_a  = clCreateBuffer(ComputeContextID, #CL_MEM_READ_ONLY , #INPUT_SIZE, #Null, @err)
    InputBuffer_b  = clCreateBuffer(ComputeContextID, #CL_MEM_READ_ONLY , #INPUT_SIZE, #Null, @err)
    OutputBuffer_c = clCreateBuffer(ComputeContextID, #CL_MEM_WRITE_ONLY, #INPUT_SIZE, #Null, @err)
    
    err = clEnqueueWriteBuffer(Commands, InputBuffer_a, #CL_TRUE, 0, #INPUT_SIZE, @Input_a(), 0, #Null, #Null)
    err = clEnqueueWriteBuffer(Commands, InputBuffer_b, #CL_TRUE, 0, #INPUT_SIZE, @Input_b(), 0, #Null, #Null)
    
    err = clSetKernelArg(KernelID, 0, SizeOf(Integer), @InputBuffer_a);
    err = clSetKernelArg(KernelID, 1, SizeOf(Integer), @InputBuffer_b);
    err = clSetKernelArg(KernelID, 2, SizeOf(Integer), @OutputBuffer_c);
    
    Count = #DATA_SIZE
    err = clSetKernelArg(KernelID, 3, SizeOf(Long), @Count);

    Output("Device Copy Time:")
    
    err = clGetKernelWorkGroupInfo(KernelID, ComputeDeviceID, #CL_KERNEL_WORK_GROUP_SIZE, SizeOf(Local), @Local, #Null);
    
    GlobalCount = #INPUT_SIZE

    err = clEnqueueNDRangeKernel(Commands, KernelID,1, #Null, @GlobalCount, @Local, 0, #Null, #Null)
    If err
        PrintN("Execution Init error:" + clErrorMessage(err))  
        ;CL MEM OBJECT ALLOCATION FAILURE   for processing mem size to big (needs to be split up)
        ;CL INVALID WORK GROUP SIZE         for non multiple of 64 (64 is just my card?)
        ;CL INVALID KERNEL                  for code compile issues
    EndIf
    
    Output("Process time:")
    
    clFinish(Commands)
    clEnqueueReadBuffer(Commands, OutputBuffer_c, #CL_TRUE, 0, #INPUT_SIZE, @Result_c(), 0, #Null, #Null);
    
    Output("Host Copy time:")
   
    ;Testing on CPU
    For Index = 0 To #DATA_SIZE - 1
        ;TestResult(Index) = Sqr((InputValues(Index) * InputValues(Index)) + (InputValues(Index)*InputValues(Index)))
       ; TestResult(Index) = 1 / (1 + Pow(2,(-1 * InputValues(Index))))
        TestResult(Index) = Sqr( (Input_a(Index)*Input_a(Index)) + (Input_b(Index)*Input_b(Index)) )
    Next

    Output("CPU Comparison time:")
    
    ;Results
    PrintN("")    
    PrintN("#DATA_SIZE: " + Str(#DATA_SIZE))
    PrintN(StrF(TimeHist("CPU Comparison time:") /(TimeHist("Device Copy Time:") + TimeHist("Process time:") + TimeHist("Host copy time:")),1) + "x Speedup (Including Host/Dev & Dev/Host mem copy)")    
    PrintN(StrF(TimeHist("CPU Comparison time:") / TimeHist("Process time:"),1) + "x Speedup (Compute only)")
    PrintN("")
    PrintN("Random sampling output comparisons")
    PrintN("")
    
    For testloop = 1 To TestCount
        TempRandNo = Random(#DATA_SIZE - 1,1)
        PrintN("CPU Result:   " +  StrF(TestResult(TempRandNo)) + #TAB$ + " GPU Result:  " + StrF(Result_c(TempRandNo)) + #TAB$ + "Diff: " + StrF(TestResult(TempRandNo) - Result_c(TempRandNo)) )
    Next
    
    ; cleanup
    
    clReleaseMemObject(InputBuffer_a)
    clReleaseMemObject(InputBuffer_b)
    clReleaseMemObject(OutputBuffer_c)
    clReleaseProgram(ProgramID)
    clReleaseKernel(KernelID)
    clReleaseCommandQueue(Commands)
    clReleaseContext(ComputeContextID)

    Input()
    CloseConsole()

Code: Select all

Start:                                  0.000 ms
Prep Time:                              195.560 ms
GPU Setup Time:                         194.143 ms
Device Copy Time:                       21.566 ms
Process time:                           0.458 ms
Host Copy time:                         26.155 ms
CPU Comparison time:                    945.209 ms

#DATA_SIZE: 32000000
42.9x Speedup (Including Host/Dev & Dev/Host mem copy)
2062.4x Speedup (Compute only)

Random sampling output comparisons

CPU Result:   0.6383894682       GPU Result:  0.6383894086      Diff: 0.0000000596
CPU Result:   0.5991742015       GPU Result:  0.5991741419      Diff: 0.0000000596
CPU Result:   0.6431758404       GPU Result:  0.6431758404      Diff: 0
CPU Result:   0.5051984191       GPU Result:  0.5051984191      Diff: 0
CPU Result:   0.5807320476       GPU Result:  0.5807320476      Diff: 0
CPU Result:   0.5173217654       GPU Result:  0.5173217058      Diff: 0.0000000596
CPU Result:   0.6635789275       GPU Result:  0.6635789275      Diff: 0
CPU Result:   0.5941699743       GPU Result:  0.5941699743      Diff: 0
CPU Result:   0.5908228159       GPU Result:  0.5908228755      Diff: -0.0000000596
CPU Result:   0.5138593912       GPU Result:  0.5138594508      Diff: -0.0000000596

Code: Select all

Start:                                  0.000 ms
Prep Time:                              327.895 ms
GPU Setup Time:                         131.462 ms
Device Copy Time:                       44.059 ms
Process time:                           0.456 ms
Host Copy time:                         26.270 ms
CPU Comparison time:                    95.343 ms

#DATA_SIZE: 32000000
2.1x Speedup (Including Host/Dev & Dev/Host mem copy)
209.3x Speedup (Compute only)

Random sampling output comparisons

CPU Result:   0.8514692783       GPU Result:  0.8514692783      Diff: 0
CPU Result:   0.6551335454       GPU Result:  0.655133605       Diff: -0.0000000596
CPU Result:   1.0480934381       GPU Result:  1.0480934381      Diff: 0
CPU Result:   1.3862178326       GPU Result:  1.3862178326      Diff: 0
CPU Result:   0.841486752        GPU Result:  0.841486752       Diff: 0
CPU Result:   0.9207062721       GPU Result:  0.9207062125      Diff: 0.0000000596
CPU Result:   1.1014535427       GPU Result:  1.1014535427      Diff: 0
CPU Result:   0.6987131238       GPU Result:  0.6987131238      Diff: 0
CPU Result:   1.3865785599       GPU Result:  1.3865784407      Diff: 0.0000001192
CPU Result:   1.0200489759       GPU Result:  1.0200489759      Diff: 0



Re: OpenCL

Posted: Fri May 06, 2022 3:39 pm
by jack
hello pdwyer :)
this is an interesting subject, thank you for bringing attention to it, as soon as I get out of hibernation I plan on giving it a good try

Re: OpenCL

Posted: Fri May 06, 2022 7:19 pm
by Caronte3D
OpenCL in PB :shock: Nice! :wink:

Re: OpenCL

Posted: Sat May 07, 2022 10:09 am
by pdwyer
Credit to the people above who got that include file working!!

I'm trying to work out if investment in time in understanding this is worthwhile or not. OpenCL 3.0 is quite new so I figured it was still alive but when I tried to use it on my AMD Rizen CPU it didn't detect it. Looking around for a driver it seems AMD doesn't support OpenCL anymore so I can only see my nvidia GPU card. (which is enough I guess). But it makes me wonder if this is a dying framework, which would be a shame.

The world seems to be moving toward something called "Vulkan" which I did a little reading on and while mainly used for 3d graphics like DX12 or OpenGL it can be used for GPGPU compute as well. It looks about 10x more complicated though, I can't find a library for it and I don't think it will accept code passed as a string or text file like openCL but wants compiled or bytecode like Cuda.

Lots of academic types seemed pretty pissed online about AMD's decision as it seems OpenCL is used a lot in that area. If you know anything about good directions to go for GPU compute in PB please share. I might take a peek at the Python community to see what they plan to do since GPU compute is a big part of their machine learning libs.

EDIT
Actually there are lots of "vulkan-1.dll" on my PC.. :oops: . and python has some header translation. I'll take a look and see if it's going in the too hard basket or not