Memory errors/corruptions in cl-api
Track and report memory errors in:
- cl/get-platform-info
- cl/get-device-info
- cl/get-context-info
- cl/get-command-queue-info
- cl/get-mem-object-info
- cl/get-image-info
- cl/get-program-info
- cl/get-program-build-info
- cl/get-kernel-info
- cl/get-kernel-arg-info
- cl/get-kernel-work-group-info
- cl/get-event-info
Some platforms are trying to write in param_value even if
param_value_size is smaller than size of return type. For instance
I noticed (by means of Valgrind) that AMDGPU privative OpenCL
implementation is invalid writing exactly 1 byte of information
in clGetDeviceInfo()
call.
In such case, the test execution gets corrupted. After that, several things may happen:
- The program is successfully finishing, silently hiding the error
- The program is immediately crashing (I actually never experienced that), letting at least the user know which platform is responsible
- The program is crashing at the end, even after reporting passed test, due to 'double free/corruption'. The test is finally reported as crashed in the results
- The program is randomly crashing during the execution of another platform, becoming very confusing
So I suggest this classic solution, consisting on a canary to track when memory is wrongly written, without causing actual memory errors. Hopefully the same idea can be applied somewhere else, in some other tests.
Changes in the results:
After running piglit with the following 2 platforms:
- Portable Computing Language (OpenCL 1.2 pocl 1.5-pre/tags/v1.4-RC2-0-g81763647, Release, LLVM 8.0.1, RELOC, SPIR, SLEEF, POCL_DEBUG)
- AMD Accelerated Parallel Processing (OpenCL 2.1 AMD-APP (2766.4))
The following results are obtained:
cl/get-platform-info: "Pass" -> "Fail"
Memory error is detected now for AMD Accelerated Parallel Processing platform, that would silently corrupt the program without the canary.
cl/get-device-info: "Fail" -> "Fail"
In both cases AMD Accelerated Parallel Processing failed due to this bug. However, with the canary it is also reporting memory errors.
cl/get-context-info: "Pass" -> "Pass"
cl/get-command-queue-info: "Fail" -> "Fail"
In both cases AMD Accelerated Parallel Processing reported an unexpected CL_INVALID_COMMAND_QUEUE
error while querying CL_QUEUE_SIZE
. No memory errors were detected.
cl/get-mem-object-info: "Pass" -> "Pass"
cl/get-image-info: "Pass" -> "Pass"
cl/get-program-info: "Pass" -> "Pass"
cl/get-program-build-info: "Pass" -> "Pass"
cl/get-kernel-arg-info: "Crash" -> "Fail"
Another different memory corruption due to a bug was causing crashes. Fixed the problem, the test reports CL_SUCCESS when CL_KERNEL_ARG_INFO_NOT_AVAILABLE was expected, but the canary has not detected any memory corruption.
cl/get-kernel-info: "Pass" -> "Fail"
Memory error is detected now for AMD Accelerated Parallel Processing platform, that would silently corrupt the program without the canary.
cl/get-kernel-work-group-info: "Pass" -> "Fail"
Memory error is detected now for AMD Accelerated Parallel Processing platform, that would silently corrupt the program without the canary.
cl/get-event-info: "Pass" -> "Pass"