Fix signed integer conversion (#36)

This fixes the last remaining bug preventing end-to-end GeekBench run, so also update Geekbench results in README
This commit is contained in:
Andrzej Janik 2021-01-26 21:05:09 +01:00 committed by GitHub
parent 3e2e73ac33
commit 972f612562
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
15 changed files with 522 additions and 27 deletions

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 101 KiB

After

Width:  |  Height:  |  Size: 259 KiB

View file

@ -11,23 +11,22 @@ Performance below is normalized to OpenCL performance. 110% means that ZLUDA-imp
![Performance graph](GeekBench_5_2_3.svg)
[ZLUDA detailed log on Geekbench.com](https://browser.geekbench.com/v5/compute/1918048)
[ZLUDA - detailed results on Geekbench.com](https://browser.geekbench.com/v5/compute/2305009)
[OpenCL detailed log on Geekbench.com](https://browser.geekbench.com/v5/compute/1918080)
[OpenCL - detailed results on Geekbench.com](https://browser.geekbench.com/v5/compute/2304997)
Overall in this suite of benchmarks faster by approximately 4% on ZLUDA.
Overall, ZLUDA is slower in GeekBench by roughly 2%.
### Explanation of the results
* Why is ZLUDA faster in Stereo Matching, Gaussian Blur and Depth of Field?\
* Why is ZLUDA faster in some benchmarks?\
This has not been precisely pinpointed to one thing or another but it's likely a combination of things:
* ZLUDA uses Level 0, which in general is a more level, higher performance API
* ZLUDA uses [Level 0](https://spec.oneapi.com/level-zero/latest/index.html), which in general is a more low level, high performance API than OpenCL
* Tying to the previous point, currently ZLUDA does not support asynchronous execution. This gives us an unfair advantage in a benchmark like GeekBench. GeekBench exclusively uses CUDA synchronous APIs
* There is a set of GPU instructions which are available on both NVIDIA hardware and Intel hardware, but are not exposed through OpenCL. We are comparing NVIDIA GPU optimized code with the more general OpenCL code. It's a lucky coincidence (and a credit to the underlying Intel Graphics Compiler) that this code also works well on an Intel GPU
* Why is OpenCL faster in Canny and Horizon Detection?\
Authors of CUDA benchmarks used CUDA functions `atomicInc` and `atomicDec` which have direct hardware support on NVIDIA cards, but no hardware support on Intel cards. They have to be emulated in software, which limits performance
* Why are some benchmarks failing?\
ZLUDA itself supports all the operations used in the failing benchmarks. From the limited debugging that has been done so far, the problem is most likely somewhere else. Intel GPU compiler stack is very capable when it comes to compiling OpenCL, C for Metal and DPC++. It's not yet very good at compiling ZLUDA. ZLUDA emits code patterns never seen before by the Intel GPU compiler stack and hits some rarely used (or not used before) code paths in the compiler.\
Current status of failing GeekBench tests is tracked [here](https://github.com/vosen/ZLUDA/pull/12)
* Why is ZLUDA slower in the remaining benchmarks?\
The reason is unknown. Most likely, in some tests we compile from suboptimal NVIDIA GPU code and in other tests ZLUDA itself is emitting suboptimal Intel GPU code. For example, SFFT used to be even slower before PR [#22](https://github.com/vosen/ZLUDA/pull/22)
## Details
@ -35,7 +34,7 @@ Overall in this suite of benchmarks faster by approximately 4% on ZLUDA.
* Is ZLUDA a drop-in replacement for CUDA?\
Yes, but certain applications use CUDA in ways which make it incompatible with ZLUDA
* What is the status of the project?\
This project is a Proof of Concept. About the only thing that works currently is Geekbench (and not even completely). It's amazingly buggy and incomplete. You should not rely on it for anything serious
This project is a Proof of Concept. About the only thing that works currently is Geekbench. It's amazingly buggy and incomplete. You should not rely on it for anything serious
* Is it an Intel project? Is it an NVIDIA project?\
No, it's a private project
* What is the performance?\
@ -56,8 +55,8 @@ You should have the most recent Intel GPU drivers installed.\
Copy `nvcuda.dll` to the application directory (the directory where .exe file is) and launch it normally
### Linux
A very recent version of [compute-runtime](https://github.com/intel/compute-runtime) and [Level Zero loader](https://github.com/oneapi-src/level-zero/releases) is required. At the time of the writing 20.45.18403 is the losest recommended version.
Unpack the archive somewhere and run your application like this:
A very recent version of [compute-runtime](https://github.com/intel/compute-runtime) and [Level Zero loader](https://github.com/oneapi-src/level-zero/releases) is required. At the time of the writing 20.45.18403 is the oldest recommended version.
Run your application like this:
```
LD_LIBRARY_PATH=<PATH_TO_THE_DIRECTORY_WITH_ZLUDA_PROVIDED_LIBCUDA> <YOUR_APPLICATION>
```

View file

@ -0,0 +1,25 @@
.version 6.5
.target sm_30
.address_size 64
.visible .entry cvt_rzi(
.param .u64 input,
.param .u64 output
)
{
.reg .u64 in_addr;
.reg .u64 out_addr;
.reg .f32 temp1;
.reg .f32 temp2;
ld.param.u64 in_addr, [input];
ld.param.u64 out_addr, [output];
ld.f32 temp1, [in_addr];
ld.f32 temp2, [in_addr+4];
cvt.rzi.f32.f32 temp1, temp1;
cvt.rzi.f32.f32 temp2, temp2;
st.f32 [out_addr], temp1;
st.f32 [out_addr+4], temp2;
ret;
}

View file

@ -0,0 +1,63 @@
OpCapability GenericPointer
OpCapability Linkage
OpCapability Addresses
OpCapability Kernel
OpCapability Int8
OpCapability Int16
OpCapability Int64
OpCapability Float16
OpCapability Float64
%34 = OpExtInstImport "OpenCL.std"
OpMemoryModel Physical64 OpenCL
OpEntryPoint Kernel %1 "cvt_rzi"
%void = OpTypeVoid
%ulong = OpTypeInt 64 0
%37 = OpTypeFunction %void %ulong %ulong
%_ptr_Function_ulong = OpTypePointer Function %ulong
%float = OpTypeFloat 32
%_ptr_Function_float = OpTypePointer Function %float
%_ptr_Generic_float = OpTypePointer Generic %float
%ulong_4 = OpConstant %ulong 4
%ulong_4_0 = OpConstant %ulong 4
%1 = OpFunction %void None %37
%8 = OpFunctionParameter %ulong
%9 = OpFunctionParameter %ulong
%32 = OpLabel
%2 = OpVariable %_ptr_Function_ulong Function
%3 = OpVariable %_ptr_Function_ulong Function
%4 = OpVariable %_ptr_Function_ulong Function
%5 = OpVariable %_ptr_Function_ulong Function
%6 = OpVariable %_ptr_Function_float Function
%7 = OpVariable %_ptr_Function_float Function
OpStore %2 %8
OpStore %3 %9
%10 = OpLoad %ulong %2 Aligned 8
OpStore %4 %10
%11 = OpLoad %ulong %3 Aligned 8
OpStore %5 %11
%13 = OpLoad %ulong %4
%28 = OpConvertUToPtr %_ptr_Generic_float %13
%12 = OpLoad %float %28 Aligned 4
OpStore %6 %12
%15 = OpLoad %ulong %4
%25 = OpIAdd %ulong %15 %ulong_4
%29 = OpConvertUToPtr %_ptr_Generic_float %25
%14 = OpLoad %float %29 Aligned 4
OpStore %7 %14
%17 = OpLoad %float %6
%16 = OpExtInst %float %34 trunc %17
OpStore %6 %16
%19 = OpLoad %float %7
%18 = OpExtInst %float %34 trunc %19
OpStore %7 %18
%20 = OpLoad %ulong %5
%21 = OpLoad %float %6
%30 = OpConvertUToPtr %_ptr_Generic_float %20
OpStore %30 %21 Aligned 4
%22 = OpLoad %ulong %5
%23 = OpLoad %float %7
%27 = OpIAdd %ulong %22 %ulong_4_0
%31 = OpConvertUToPtr %_ptr_Generic_float %27
OpStore %31 %23 Aligned 4
OpReturn
OpFunctionEnd

View file

@ -0,0 +1,25 @@
.version 6.5
.target sm_30
.address_size 64
.visible .entry cvt_s32_f32(
.param .u64 input,
.param .u64 output
)
{
.reg .u64 in_addr;
.reg .u64 out_addr;
.reg .b32 temp1;
.reg .b32 temp2;
ld.param.u64 in_addr, [input];
ld.param.u64 out_addr, [output];
ld.f32 temp1, [in_addr];
ld.f32 temp2, [in_addr+4];
cvt.rpi.ftz.s32.f32 temp1, temp1;
cvt.rpi.ftz.s32.f32 temp2, temp2;
st.global.s32 [out_addr], temp1;
st.global.s32 [out_addr+4], temp2;
ret;
}

View file

@ -0,0 +1,75 @@
OpCapability GenericPointer
OpCapability Linkage
OpCapability Addresses
OpCapability Kernel
OpCapability Int8
OpCapability Int16
OpCapability Int64
OpCapability Float16
OpCapability Float64
%42 = OpExtInstImport "OpenCL.std"
OpMemoryModel Physical64 OpenCL
OpEntryPoint Kernel %1 "cvt_s32_f32"
OpDecorate %32 FPRoundingMode RTP
OpDecorate %34 FPRoundingMode RTP
%void = OpTypeVoid
%ulong = OpTypeInt 64 0
%45 = OpTypeFunction %void %ulong %ulong
%_ptr_Function_ulong = OpTypePointer Function %ulong
%uint = OpTypeInt 32 0
%_ptr_Function_uint = OpTypePointer Function %uint
%float = OpTypeFloat 32
%_ptr_Generic_float = OpTypePointer Generic %float
%ulong_4 = OpConstant %ulong 4
%_ptr_CrossWorkgroup_uint = OpTypePointer CrossWorkgroup %uint
%ulong_4_0 = OpConstant %ulong 4
%1 = OpFunction %void None %45
%8 = OpFunctionParameter %ulong
%9 = OpFunctionParameter %ulong
%40 = OpLabel
%2 = OpVariable %_ptr_Function_ulong Function
%3 = OpVariable %_ptr_Function_ulong Function
%4 = OpVariable %_ptr_Function_ulong Function
%5 = OpVariable %_ptr_Function_ulong Function
%6 = OpVariable %_ptr_Function_uint Function
%7 = OpVariable %_ptr_Function_uint Function
OpStore %2 %8
OpStore %3 %9
%10 = OpLoad %ulong %2 Aligned 8
OpStore %4 %10
%11 = OpLoad %ulong %3 Aligned 8
OpStore %5 %11
%13 = OpLoad %ulong %4
%29 = OpConvertUToPtr %_ptr_Generic_float %13
%28 = OpLoad %float %29 Aligned 4
%12 = OpBitcast %uint %28
OpStore %6 %12
%15 = OpLoad %ulong %4
%25 = OpIAdd %ulong %15 %ulong_4
%31 = OpConvertUToPtr %_ptr_Generic_float %25
%30 = OpLoad %float %31 Aligned 4
%14 = OpBitcast %uint %30
OpStore %7 %14
%17 = OpLoad %uint %6
%33 = OpBitcast %float %17
%32 = OpConvertFToS %uint %33
%16 = OpCopyObject %uint %32
OpStore %6 %16
%19 = OpLoad %uint %7
%35 = OpBitcast %float %19
%34 = OpConvertFToS %uint %35
%18 = OpCopyObject %uint %34
OpStore %7 %18
%20 = OpLoad %ulong %5
%21 = OpLoad %uint %6
%36 = OpConvertUToPtr %_ptr_CrossWorkgroup_uint %20
%37 = OpCopyObject %uint %21
OpStore %36 %37 Aligned 4
%22 = OpLoad %ulong %5
%23 = OpLoad %uint %7
%27 = OpIAdd %ulong %22 %ulong_4_0
%38 = OpConvertUToPtr %_ptr_CrossWorkgroup_uint %27
%39 = OpCopyObject %uint %23
OpStore %38 %39 Aligned 4
OpReturn
OpFunctionEnd

View file

@ -0,0 +1,22 @@
.version 6.5
.target sm_30
.address_size 64
.visible .entry cvt_s64_s32(
.param .u64 input,
.param .u64 output
)
{
.reg .u64 in_addr;
.reg .u64 out_addr;
.reg .s32 r_32;
.reg .s64 r_64;
ld.param.u64 in_addr, [input];
ld.param.u64 out_addr, [output];
ld.b32 r_32, [in_addr];
cvt.s64.s32 r_64, r_32;
st.b64 [out_addr], r_64;
ret;
}

View file

@ -0,0 +1,53 @@
OpCapability GenericPointer
OpCapability Linkage
OpCapability Addresses
OpCapability Kernel
OpCapability Int8
OpCapability Int16
OpCapability Int64
OpCapability Float16
OpCapability Float64
%24 = OpExtInstImport "OpenCL.std"
OpMemoryModel Physical64 OpenCL
OpEntryPoint Kernel %1 "cvt_s64_s32"
%void = OpTypeVoid
%ulong = OpTypeInt 64 0
%27 = OpTypeFunction %void %ulong %ulong
%_ptr_Function_ulong = OpTypePointer Function %ulong
%uint = OpTypeInt 32 0
%_ptr_Function_uint = OpTypePointer Function %uint
%_ptr_Generic_uint = OpTypePointer Generic %uint
%_ptr_Generic_ulong = OpTypePointer Generic %ulong
%1 = OpFunction %void None %27
%8 = OpFunctionParameter %ulong
%9 = OpFunctionParameter %ulong
%22 = OpLabel
%2 = OpVariable %_ptr_Function_ulong Function
%3 = OpVariable %_ptr_Function_ulong Function
%4 = OpVariable %_ptr_Function_ulong Function
%5 = OpVariable %_ptr_Function_ulong Function
%6 = OpVariable %_ptr_Function_uint Function
%7 = OpVariable %_ptr_Function_ulong Function
OpStore %2 %8
OpStore %3 %9
%10 = OpLoad %ulong %2 Aligned 8
OpStore %4 %10
%11 = OpLoad %ulong %3 Aligned 8
OpStore %5 %11
%13 = OpLoad %ulong %4
%19 = OpConvertUToPtr %_ptr_Generic_uint %13
%18 = OpLoad %uint %19 Aligned 4
%12 = OpCopyObject %uint %18
OpStore %6 %12
%15 = OpLoad %uint %6
%32 = OpBitcast %uint %15
%33 = OpSConvert %ulong %32
%14 = OpCopyObject %ulong %33
OpStore %7 %14
%16 = OpLoad %ulong %5
%17 = OpLoad %ulong %7
%20 = OpConvertUToPtr %_ptr_Generic_ulong %16
%21 = OpCopyObject %ulong %17
OpStore %20 %21 Aligned 8
OpReturn
OpFunctionEnd

View file

@ -49,6 +49,8 @@ test_ptx!(mul_lo, [1u64], [2u64]);
test_ptx!(mul_hi, [u64::max_value()], [1u64]);
test_ptx!(add, [1u64], [2u64]);
test_ptx!(setp, [10u64, 11u64], [1u64, 0u64]);
test_ptx!(setp_gt, [f32::NAN, 1f32], [1f32]);
test_ptx!(setp_leu, [1f32, f32::NAN], [1f32]);
test_ptx!(bra, [10u64], [11u64]);
test_ptx!(not, [0u64], [u64::max_value()]);
test_ptx!(shl, [11u64], [44u64]);
@ -114,6 +116,8 @@ test_ptx!(cos, [std::f32::consts::PI], [-1f32]);
test_ptx!(lg2, [512f32], [9f32]);
test_ptx!(ex2, [10f32], [1024f32]);
test_ptx!(cvt_rni, [9.5f32, 10.5f32], [10f32, 10f32]);
test_ptx!(cvt_rzi, [-13.8f32, 12.9f32], [-13f32, 13f32]);
test_ptx!(cvt_s32_f32, [-13.8f32, 12.9f32], [-13i32, 13i32]);
test_ptx!(clz, [0b00000101_00101101_00010011_10101011u32], [5u32]);
test_ptx!(popc, [0b10111100_10010010_01001001_10001010u32], [14u32]);
test_ptx!(
@ -140,8 +144,9 @@ test_ptx!(stateful_ld_st_ntid, [123u64], [123u64]);
test_ptx!(stateful_ld_st_ntid_chain, [12651u64], [12651u64]);
test_ptx!(stateful_ld_st_ntid_sub, [96311u64], [96311u64]);
test_ptx!(shared_ptr_take_address, [97815231u64], [97815231u64]);
// For now, we just that it builds and links
// For now, we just make sure that it builds and links
test_ptx!(assertfail, [716523871u64], [716523872u64]);
test_ptx!(cvt_s64_s32, [-1i32], [-1i64]);
struct DisplayError<T: Debug> {
err: T,
@ -161,11 +166,15 @@ impl<T: Debug> Debug for DisplayError<T> {
impl<T: Debug> error::Error for DisplayError<T> {}
fn test_ptx_assert<'a, T: From<u8> + ze::SafeRepr + Debug + Copy + PartialEq>(
fn test_ptx_assert<
'a,
Input: From<u8> + ze::SafeRepr + Debug + Copy + PartialEq,
Output: From<u8> + ze::SafeRepr + Debug + Copy + PartialEq,
>(
name: &str,
ptx_text: &'a str,
input: &[T],
output: &mut [T],
input: &[Input],
output: &mut [Output],
) -> Result<(), Box<dyn error::Error + 'a>> {
let mut errors = Vec::new();
let ast = ptx::ModuleParser::new().parse(&mut errors, ptx_text)?;
@ -178,12 +187,15 @@ fn test_ptx_assert<'a, T: From<u8> + ze::SafeRepr + Debug + Copy + PartialEq>(
Ok(())
}
fn run_spirv<T: From<u8> + ze::SafeRepr + Copy + Debug>(
fn run_spirv<
Input: From<u8> + ze::SafeRepr + Copy + Debug,
Output: From<u8> + ze::SafeRepr + Copy + Debug,
>(
name: &CStr,
module: translate::Module,
input: &[T],
output: &mut [T],
) -> ze::Result<Vec<T>> {
input: &[Input],
output: &mut [Output],
) -> ze::Result<Vec<Output>> {
ze::init()?;
let spirv = module.spirv.assemble();
let byte_il = unsafe {
@ -237,15 +249,15 @@ fn run_spirv<T: From<u8> + ze::SafeRepr + Copy + Debug>(
kernel.set_indirect_access(
ze::sys::ze_kernel_indirect_access_flags_t::ZE_KERNEL_INDIRECT_ACCESS_FLAG_DEVICE,
)?;
let mut inp_b = ze::DeviceBuffer::<T>::new(&mut ctx, &dev, cmp::max(input.len(), 1))?;
let mut out_b = ze::DeviceBuffer::<T>::new(&mut ctx, &dev, cmp::max(output.len(), 1))?;
let inp_b_ptr_mut: ze::BufferPtrMut<T> = (&mut inp_b).into();
let mut inp_b = ze::DeviceBuffer::<Input>::new(&mut ctx, &dev, cmp::max(input.len(), 1))?;
let mut out_b = ze::DeviceBuffer::<Output>::new(&mut ctx, &dev, cmp::max(output.len(), 1))?;
let inp_b_ptr_mut: ze::BufferPtrMut<Input> = (&mut inp_b).into();
let event_pool = ze::EventPool::new(&mut ctx, 3, Some(&[&dev]))?;
let ev0 = ze::Event::new(&event_pool, 0)?;
let ev1 = ze::Event::new(&event_pool, 1)?;
let mut ev2 = ze::Event::new(&event_pool, 2)?;
let mut cmd_list = ze::CommandList::new(&mut ctx, &dev)?;
let out_b_ptr_mut: ze::BufferPtrMut<T> = (&mut out_b).into();
let out_b_ptr_mut: ze::BufferPtrMut<Output> = (&mut out_b).into();
let mut init_evs = [ev0, ev1];
cmd_list.append_memory_copy(inp_b_ptr_mut, input, Some(&mut init_evs[0]), &mut [])?;
cmd_list.append_memory_fill(out_b_ptr_mut, 0, Some(&mut init_evs[1]), &mut [])?;

View file

@ -0,0 +1,27 @@
.version 6.5
.target sm_30
.address_size 64
.visible .entry setp_gt(
.param .u64 input,
.param .u64 output
)
{
.reg .u64 in_addr;
.reg .u64 out_addr;
.reg .f32 r1;
.reg .f32 r2;
.reg .f32 r3;
.reg .pred pred;
ld.param.u64 in_addr, [input];
ld.param.u64 out_addr, [output];
ld.f32 r1, [in_addr];
ld.f32 r2, [in_addr + 4];
setp.gt.ftz.f32 pred, r1, r2;
@pred mov.f32 r3, r1;
@!pred mov.f32 r3, r2;
st.f32 [out_addr], r3;
ret;
}

View file

@ -0,0 +1,75 @@
OpCapability GenericPointer
OpCapability Linkage
OpCapability Addresses
OpCapability Kernel
OpCapability Int8
OpCapability Int16
OpCapability Int64
OpCapability Float16
OpCapability Float64
%40 = OpExtInstImport "OpenCL.std"
OpMemoryModel Physical64 OpenCL
OpEntryPoint Kernel %1 "setp_gt"
%void = OpTypeVoid
%ulong = OpTypeInt 64 0
%43 = OpTypeFunction %void %ulong %ulong
%_ptr_Function_ulong = OpTypePointer Function %ulong
%float = OpTypeFloat 32
%_ptr_Function_float = OpTypePointer Function %float
%bool = OpTypeBool
%_ptr_Function_bool = OpTypePointer Function %bool
%_ptr_Generic_float = OpTypePointer Generic %float
%ulong_4 = OpConstant %ulong 4
%1 = OpFunction %void None %43
%14 = OpFunctionParameter %ulong
%15 = OpFunctionParameter %ulong
%38 = OpLabel
%2 = OpVariable %_ptr_Function_ulong Function
%3 = OpVariable %_ptr_Function_ulong Function
%4 = OpVariable %_ptr_Function_ulong Function
%5 = OpVariable %_ptr_Function_ulong Function
%6 = OpVariable %_ptr_Function_float Function
%7 = OpVariable %_ptr_Function_float Function
%8 = OpVariable %_ptr_Function_float Function
%9 = OpVariable %_ptr_Function_bool Function
OpStore %2 %14
OpStore %3 %15
%16 = OpLoad %ulong %2 Aligned 8
OpStore %4 %16
%17 = OpLoad %ulong %3 Aligned 8
OpStore %5 %17
%19 = OpLoad %ulong %4
%35 = OpConvertUToPtr %_ptr_Generic_float %19
%18 = OpLoad %float %35 Aligned 4
OpStore %6 %18
%21 = OpLoad %ulong %4
%34 = OpIAdd %ulong %21 %ulong_4
%36 = OpConvertUToPtr %_ptr_Generic_float %34
%20 = OpLoad %float %36 Aligned 4
OpStore %7 %20
%23 = OpLoad %float %6
%24 = OpLoad %float %7
%22 = OpFOrdGreaterThan %bool %23 %24
OpStore %9 %22
%25 = OpLoad %bool %9
OpBranchConditional %25 %10 %11
%10 = OpLabel
%27 = OpLoad %float %6
%26 = OpCopyObject %float %27
OpStore %8 %26
OpBranch %11
%11 = OpLabel
%28 = OpLoad %bool %9
OpBranchConditional %28 %13 %12
%12 = OpLabel
%30 = OpLoad %float %7
%29 = OpCopyObject %float %30
OpStore %8 %29
OpBranch %13
%13 = OpLabel
%31 = OpLoad %ulong %5
%32 = OpLoad %float %8
%37 = OpConvertUToPtr %_ptr_Generic_float %31
OpStore %37 %32 Aligned 4
OpReturn
OpFunctionEnd

View file

@ -0,0 +1,27 @@
.version 6.5
.target sm_30
.address_size 64
.visible .entry setp_leu(
.param .u64 input,
.param .u64 output
)
{
.reg .u64 in_addr;
.reg .u64 out_addr;
.reg .f32 r1;
.reg .f32 r2;
.reg .f32 r3;
.reg .pred pred;
ld.param.u64 in_addr, [input];
ld.param.u64 out_addr, [output];
ld.f32 r1, [in_addr];
ld.f32 r2, [in_addr + 4];
setp.leu.ftz.f32 pred, r1, r2;
@pred mov.f32 r3, r1;
@!pred mov.f32 r3, r2;
st.f32 [out_addr], r3;
ret;
}

View file

@ -0,0 +1,75 @@
OpCapability GenericPointer
OpCapability Linkage
OpCapability Addresses
OpCapability Kernel
OpCapability Int8
OpCapability Int16
OpCapability Int64
OpCapability Float16
OpCapability Float64
%40 = OpExtInstImport "OpenCL.std"
OpMemoryModel Physical64 OpenCL
OpEntryPoint Kernel %1 "setp_leu"
%void = OpTypeVoid
%ulong = OpTypeInt 64 0
%43 = OpTypeFunction %void %ulong %ulong
%_ptr_Function_ulong = OpTypePointer Function %ulong
%float = OpTypeFloat 32
%_ptr_Function_float = OpTypePointer Function %float
%bool = OpTypeBool
%_ptr_Function_bool = OpTypePointer Function %bool
%_ptr_Generic_float = OpTypePointer Generic %float
%ulong_4 = OpConstant %ulong 4
%1 = OpFunction %void None %43
%14 = OpFunctionParameter %ulong
%15 = OpFunctionParameter %ulong
%38 = OpLabel
%2 = OpVariable %_ptr_Function_ulong Function
%3 = OpVariable %_ptr_Function_ulong Function
%4 = OpVariable %_ptr_Function_ulong Function
%5 = OpVariable %_ptr_Function_ulong Function
%6 = OpVariable %_ptr_Function_float Function
%7 = OpVariable %_ptr_Function_float Function
%8 = OpVariable %_ptr_Function_float Function
%9 = OpVariable %_ptr_Function_bool Function
OpStore %2 %14
OpStore %3 %15
%16 = OpLoad %ulong %2 Aligned 8
OpStore %4 %16
%17 = OpLoad %ulong %3 Aligned 8
OpStore %5 %17
%19 = OpLoad %ulong %4
%35 = OpConvertUToPtr %_ptr_Generic_float %19
%18 = OpLoad %float %35 Aligned 4
OpStore %6 %18
%21 = OpLoad %ulong %4
%34 = OpIAdd %ulong %21 %ulong_4
%36 = OpConvertUToPtr %_ptr_Generic_float %34
%20 = OpLoad %float %36 Aligned 4
OpStore %7 %20
%23 = OpLoad %float %6
%24 = OpLoad %float %7
%22 = OpFUnordLessThanEqual %bool %23 %24
OpStore %9 %22
%25 = OpLoad %bool %9
OpBranchConditional %25 %10 %11
%10 = OpLabel
%27 = OpLoad %float %6
%26 = OpCopyObject %float %27
OpStore %8 %26
OpBranch %11
%11 = OpLabel
%28 = OpLoad %bool %9
OpBranchConditional %28 %13 %12
%12 = OpLabel
%30 = OpLoad %float %7
%29 = OpCopyObject %float %30
OpStore %8 %29
OpBranch %13
%13 = OpLabel
%31 = OpLoad %ulong %5
%32 = OpLoad %float %8
%37 = OpConvertUToPtr %_ptr_Generic_float %31
OpStore %37 %32 Aligned 4
OpReturn
OpFunctionEnd

View file

@ -4087,8 +4087,15 @@ fn emit_implicit_conversion(
{
builder.u_convert(wide_bit_type_spirv, Some(cv.dst), same_width_bit_value)?;
} else {
let conversion_fn = if from_parts.scalar_kind == ScalarKind::Signed
&& to_parts.scalar_kind == ScalarKind::Signed
{
dr::Builder::s_convert
} else {
dr::Builder::u_convert
};
let wide_bit_value =
builder.u_convert(wide_bit_type_spirv, None, same_width_bit_value)?;
conversion_fn(builder, wide_bit_type_spirv, None, same_width_bit_value)?;
emit_implicit_conversion(
builder,
map,

View file

@ -50,6 +50,14 @@ def parse_arguments(dump_path, prefix):
arg_files = os.listdir(dir)
return [load_arguments(path.join(dir, f)) for f in sorted(arg_files)]
def append_debug_buffer(args):
args = list(args)
debug_buff = np.zeros(1024 * 1024, np.single)
args.append((drv.InOut(debug_buff), debug_buff))
return args
def verify_single_dump(input_path, max_block_threads):
print(input_path)
kernel_name = path.basename(input_path).split("_", 1)[1]
@ -58,11 +66,12 @@ def verify_single_dump(input_path, max_block_threads):
block = tuple(launch_lines[3:6])
launch_block_size = block[0] * block[1] * block[2]
if launch_block_size > max_block_threads:
print(f" Skipping, launch block size ({launch_block_size}) bigger than maximum block size ({max_block_threads})")
print(
f" Skipping, launch block size ({launch_block_size}) bigger than maximum block size ({max_block_threads})")
return
module = drv.module_from_file(path.join(input_path, "module.ptx"))
kernel = module.get_function(kernel_name)
pre_args = parse_arguments(input_path, "pre")
pre_args = append_debug_buffer(parse_arguments(input_path, "pre"))
kernel_pre_args, host_pre_args = zip(*pre_args)
kernel(*list(kernel_pre_args), grid=tuple(launch_lines[:3]), block=block, shared=launch_lines[6])
post_args = parse_arguments(input_path, "post")
@ -75,6 +84,7 @@ def verify_single_dump(input_path, max_block_threads):
except Exception as e:
print(f"{idx}: {e}")
def main(argv):
device = drv.Device(0)
max_threads = device.get_attribute(drv.device_attribute.MAX_THREADS_PER_BLOCK)