diff --git a/GeekBench_5_2_3.svg b/GeekBench_5_2_3.svg index 1bc47be..900e2b1 100644 --- a/GeekBench_5_2_3.svg +++ b/GeekBench_5_2_3.svg @@ -1 +1 @@ - \ No newline at end of file + \ No newline at end of file diff --git a/README.md b/README.md index 8d9b05a..a6ea202 100644 --- a/README.md +++ b/README.md @@ -11,23 +11,22 @@ Performance below is normalized to OpenCL performance. 110% means that ZLUDA-imp ![Performance graph](GeekBench_5_2_3.svg) -[ZLUDA detailed log on Geekbench.com](https://browser.geekbench.com/v5/compute/1918048) +[ZLUDA - detailed results on Geekbench.com](https://browser.geekbench.com/v5/compute/2305009) -[OpenCL detailed log on Geekbench.com](https://browser.geekbench.com/v5/compute/1918080) +[OpenCL - detailed results on Geekbench.com](https://browser.geekbench.com/v5/compute/2304997) -Overall in this suite of benchmarks faster by approximately 4% on ZLUDA. +Overall, ZLUDA is slower in GeekBench by roughly 2%. ### Explanation of the results - * Why is ZLUDA faster in Stereo Matching, Gaussian Blur and Depth of Field?\ + * Why is ZLUDA faster in some benchmarks?\ This has not been precisely pinpointed to one thing or another but it's likely a combination of things: - * ZLUDA uses Level 0, which in general is a more level, higher performance API + * ZLUDA uses [Level 0](https://spec.oneapi.com/level-zero/latest/index.html), which in general is a more low level, high performance API than OpenCL * Tying to the previous point, currently ZLUDA does not support asynchronous execution. This gives us an unfair advantage in a benchmark like GeekBench. GeekBench exclusively uses CUDA synchronous APIs * There is a set of GPU instructions which are available on both NVIDIA hardware and Intel hardware, but are not exposed through OpenCL. We are comparing NVIDIA GPU optimized code with the more general OpenCL code. It's a lucky coincidence (and a credit to the underlying Intel Graphics Compiler) that this code also works well on an Intel GPU * Why is OpenCL faster in Canny and Horizon Detection?\ Authors of CUDA benchmarks used CUDA functions `atomicInc` and `atomicDec` which have direct hardware support on NVIDIA cards, but no hardware support on Intel cards. They have to be emulated in software, which limits performance - * Why are some benchmarks failing?\ - ZLUDA itself supports all the operations used in the failing benchmarks. From the limited debugging that has been done so far, the problem is most likely somewhere else. Intel GPU compiler stack is very capable when it comes to compiling OpenCL, C for Metal and DPC++. It's not yet very good at compiling ZLUDA. ZLUDA emits code patterns never seen before by the Intel GPU compiler stack and hits some rarely used (or not used before) code paths in the compiler.\ - Current status of failing GeekBench tests is tracked [here](https://github.com/vosen/ZLUDA/pull/12) + * Why is ZLUDA slower in the remaining benchmarks?\ + The reason is unknown. Most likely, in some tests we compile from suboptimal NVIDIA GPU code and in other tests ZLUDA itself is emitting suboptimal Intel GPU code. For example, SFFT used to be even slower before PR [#22](https://github.com/vosen/ZLUDA/pull/22) ## Details @@ -35,7 +34,7 @@ Overall in this suite of benchmarks faster by approximately 4% on ZLUDA. * Is ZLUDA a drop-in replacement for CUDA?\ Yes, but certain applications use CUDA in ways which make it incompatible with ZLUDA * What is the status of the project?\ - This project is a Proof of Concept. About the only thing that works currently is Geekbench (and not even completely). It's amazingly buggy and incomplete. You should not rely on it for anything serious + This project is a Proof of Concept. About the only thing that works currently is Geekbench. It's amazingly buggy and incomplete. You should not rely on it for anything serious * Is it an Intel project? Is it an NVIDIA project?\ No, it's a private project * What is the performance?\ @@ -56,8 +55,8 @@ You should have the most recent Intel GPU drivers installed.\ Copy `nvcuda.dll` to the application directory (the directory where .exe file is) and launch it normally ### Linux -A very recent version of [compute-runtime](https://github.com/intel/compute-runtime) and [Level Zero loader](https://github.com/oneapi-src/level-zero/releases) is required. At the time of the writing 20.45.18403 is the losest recommended version. -Unpack the archive somewhere and run your application like this: +A very recent version of [compute-runtime](https://github.com/intel/compute-runtime) and [Level Zero loader](https://github.com/oneapi-src/level-zero/releases) is required. At the time of the writing 20.45.18403 is the oldest recommended version. +Run your application like this: ``` LD_LIBRARY_PATH= ``` diff --git a/ptx/src/test/spirv_run/cvt_rzi.ptx b/ptx/src/test/spirv_run/cvt_rzi.ptx new file mode 100644 index 0000000..ba5cc0e --- /dev/null +++ b/ptx/src/test/spirv_run/cvt_rzi.ptx @@ -0,0 +1,25 @@ +.version 6.5 +.target sm_30 +.address_size 64 + +.visible .entry cvt_rzi( + .param .u64 input, + .param .u64 output +) +{ + .reg .u64 in_addr; + .reg .u64 out_addr; + .reg .f32 temp1; + .reg .f32 temp2; + + ld.param.u64 in_addr, [input]; + ld.param.u64 out_addr, [output]; + + ld.f32 temp1, [in_addr]; + ld.f32 temp2, [in_addr+4]; + cvt.rzi.f32.f32 temp1, temp1; + cvt.rzi.f32.f32 temp2, temp2; + st.f32 [out_addr], temp1; + st.f32 [out_addr+4], temp2; + ret; +} diff --git a/ptx/src/test/spirv_run/cvt_rzi.spvtxt b/ptx/src/test/spirv_run/cvt_rzi.spvtxt new file mode 100644 index 0000000..68c12c6 --- /dev/null +++ b/ptx/src/test/spirv_run/cvt_rzi.spvtxt @@ -0,0 +1,63 @@ + OpCapability GenericPointer + OpCapability Linkage + OpCapability Addresses + OpCapability Kernel + OpCapability Int8 + OpCapability Int16 + OpCapability Int64 + OpCapability Float16 + OpCapability Float64 + %34 = OpExtInstImport "OpenCL.std" + OpMemoryModel Physical64 OpenCL + OpEntryPoint Kernel %1 "cvt_rzi" + %void = OpTypeVoid + %ulong = OpTypeInt 64 0 + %37 = OpTypeFunction %void %ulong %ulong +%_ptr_Function_ulong = OpTypePointer Function %ulong + %float = OpTypeFloat 32 +%_ptr_Function_float = OpTypePointer Function %float +%_ptr_Generic_float = OpTypePointer Generic %float + %ulong_4 = OpConstant %ulong 4 + %ulong_4_0 = OpConstant %ulong 4 + %1 = OpFunction %void None %37 + %8 = OpFunctionParameter %ulong + %9 = OpFunctionParameter %ulong + %32 = OpLabel + %2 = OpVariable %_ptr_Function_ulong Function + %3 = OpVariable %_ptr_Function_ulong Function + %4 = OpVariable %_ptr_Function_ulong Function + %5 = OpVariable %_ptr_Function_ulong Function + %6 = OpVariable %_ptr_Function_float Function + %7 = OpVariable %_ptr_Function_float Function + OpStore %2 %8 + OpStore %3 %9 + %10 = OpLoad %ulong %2 Aligned 8 + OpStore %4 %10 + %11 = OpLoad %ulong %3 Aligned 8 + OpStore %5 %11 + %13 = OpLoad %ulong %4 + %28 = OpConvertUToPtr %_ptr_Generic_float %13 + %12 = OpLoad %float %28 Aligned 4 + OpStore %6 %12 + %15 = OpLoad %ulong %4 + %25 = OpIAdd %ulong %15 %ulong_4 + %29 = OpConvertUToPtr %_ptr_Generic_float %25 + %14 = OpLoad %float %29 Aligned 4 + OpStore %7 %14 + %17 = OpLoad %float %6 + %16 = OpExtInst %float %34 trunc %17 + OpStore %6 %16 + %19 = OpLoad %float %7 + %18 = OpExtInst %float %34 trunc %19 + OpStore %7 %18 + %20 = OpLoad %ulong %5 + %21 = OpLoad %float %6 + %30 = OpConvertUToPtr %_ptr_Generic_float %20 + OpStore %30 %21 Aligned 4 + %22 = OpLoad %ulong %5 + %23 = OpLoad %float %7 + %27 = OpIAdd %ulong %22 %ulong_4_0 + %31 = OpConvertUToPtr %_ptr_Generic_float %27 + OpStore %31 %23 Aligned 4 + OpReturn + OpFunctionEnd diff --git a/ptx/src/test/spirv_run/cvt_s32_f32.ptx b/ptx/src/test/spirv_run/cvt_s32_f32.ptx new file mode 100644 index 0000000..d432a91 --- /dev/null +++ b/ptx/src/test/spirv_run/cvt_s32_f32.ptx @@ -0,0 +1,25 @@ +.version 6.5 +.target sm_30 +.address_size 64 + +.visible .entry cvt_s32_f32( + .param .u64 input, + .param .u64 output +) +{ + .reg .u64 in_addr; + .reg .u64 out_addr; + .reg .b32 temp1; + .reg .b32 temp2; + + ld.param.u64 in_addr, [input]; + ld.param.u64 out_addr, [output]; + + ld.f32 temp1, [in_addr]; + ld.f32 temp2, [in_addr+4]; + cvt.rpi.ftz.s32.f32 temp1, temp1; + cvt.rpi.ftz.s32.f32 temp2, temp2; + st.global.s32 [out_addr], temp1; + st.global.s32 [out_addr+4], temp2; + ret; +} diff --git a/ptx/src/test/spirv_run/cvt_s32_f32.spvtxt b/ptx/src/test/spirv_run/cvt_s32_f32.spvtxt new file mode 100644 index 0000000..d9ae053 --- /dev/null +++ b/ptx/src/test/spirv_run/cvt_s32_f32.spvtxt @@ -0,0 +1,75 @@ + OpCapability GenericPointer + OpCapability Linkage + OpCapability Addresses + OpCapability Kernel + OpCapability Int8 + OpCapability Int16 + OpCapability Int64 + OpCapability Float16 + OpCapability Float64 + %42 = OpExtInstImport "OpenCL.std" + OpMemoryModel Physical64 OpenCL + OpEntryPoint Kernel %1 "cvt_s32_f32" + OpDecorate %32 FPRoundingMode RTP + OpDecorate %34 FPRoundingMode RTP + %void = OpTypeVoid + %ulong = OpTypeInt 64 0 + %45 = OpTypeFunction %void %ulong %ulong +%_ptr_Function_ulong = OpTypePointer Function %ulong + %uint = OpTypeInt 32 0 +%_ptr_Function_uint = OpTypePointer Function %uint + %float = OpTypeFloat 32 +%_ptr_Generic_float = OpTypePointer Generic %float + %ulong_4 = OpConstant %ulong 4 +%_ptr_CrossWorkgroup_uint = OpTypePointer CrossWorkgroup %uint + %ulong_4_0 = OpConstant %ulong 4 + %1 = OpFunction %void None %45 + %8 = OpFunctionParameter %ulong + %9 = OpFunctionParameter %ulong + %40 = OpLabel + %2 = OpVariable %_ptr_Function_ulong Function + %3 = OpVariable %_ptr_Function_ulong Function + %4 = OpVariable %_ptr_Function_ulong Function + %5 = OpVariable %_ptr_Function_ulong Function + %6 = OpVariable %_ptr_Function_uint Function + %7 = OpVariable %_ptr_Function_uint Function + OpStore %2 %8 + OpStore %3 %9 + %10 = OpLoad %ulong %2 Aligned 8 + OpStore %4 %10 + %11 = OpLoad %ulong %3 Aligned 8 + OpStore %5 %11 + %13 = OpLoad %ulong %4 + %29 = OpConvertUToPtr %_ptr_Generic_float %13 + %28 = OpLoad %float %29 Aligned 4 + %12 = OpBitcast %uint %28 + OpStore %6 %12 + %15 = OpLoad %ulong %4 + %25 = OpIAdd %ulong %15 %ulong_4 + %31 = OpConvertUToPtr %_ptr_Generic_float %25 + %30 = OpLoad %float %31 Aligned 4 + %14 = OpBitcast %uint %30 + OpStore %7 %14 + %17 = OpLoad %uint %6 + %33 = OpBitcast %float %17 + %32 = OpConvertFToS %uint %33 + %16 = OpCopyObject %uint %32 + OpStore %6 %16 + %19 = OpLoad %uint %7 + %35 = OpBitcast %float %19 + %34 = OpConvertFToS %uint %35 + %18 = OpCopyObject %uint %34 + OpStore %7 %18 + %20 = OpLoad %ulong %5 + %21 = OpLoad %uint %6 + %36 = OpConvertUToPtr %_ptr_CrossWorkgroup_uint %20 + %37 = OpCopyObject %uint %21 + OpStore %36 %37 Aligned 4 + %22 = OpLoad %ulong %5 + %23 = OpLoad %uint %7 + %27 = OpIAdd %ulong %22 %ulong_4_0 + %38 = OpConvertUToPtr %_ptr_CrossWorkgroup_uint %27 + %39 = OpCopyObject %uint %23 + OpStore %38 %39 Aligned 4 + OpReturn + OpFunctionEnd diff --git a/ptx/src/test/spirv_run/cvt_s64_s32.ptx b/ptx/src/test/spirv_run/cvt_s64_s32.ptx new file mode 100644 index 0000000..5242864 --- /dev/null +++ b/ptx/src/test/spirv_run/cvt_s64_s32.ptx @@ -0,0 +1,22 @@ +.version 6.5 +.target sm_30 +.address_size 64 + +.visible .entry cvt_s64_s32( + .param .u64 input, + .param .u64 output +) +{ + .reg .u64 in_addr; + .reg .u64 out_addr; + .reg .s32 r_32; + .reg .s64 r_64; + + ld.param.u64 in_addr, [input]; + ld.param.u64 out_addr, [output]; + + ld.b32 r_32, [in_addr]; + cvt.s64.s32 r_64, r_32; + st.b64 [out_addr], r_64; + ret; +} diff --git a/ptx/src/test/spirv_run/cvt_s64_s32.spvtxt b/ptx/src/test/spirv_run/cvt_s64_s32.spvtxt new file mode 100644 index 0000000..3f46103 --- /dev/null +++ b/ptx/src/test/spirv_run/cvt_s64_s32.spvtxt @@ -0,0 +1,53 @@ + OpCapability GenericPointer + OpCapability Linkage + OpCapability Addresses + OpCapability Kernel + OpCapability Int8 + OpCapability Int16 + OpCapability Int64 + OpCapability Float16 + OpCapability Float64 + %24 = OpExtInstImport "OpenCL.std" + OpMemoryModel Physical64 OpenCL + OpEntryPoint Kernel %1 "cvt_s64_s32" + %void = OpTypeVoid + %ulong = OpTypeInt 64 0 + %27 = OpTypeFunction %void %ulong %ulong +%_ptr_Function_ulong = OpTypePointer Function %ulong + %uint = OpTypeInt 32 0 +%_ptr_Function_uint = OpTypePointer Function %uint +%_ptr_Generic_uint = OpTypePointer Generic %uint +%_ptr_Generic_ulong = OpTypePointer Generic %ulong + %1 = OpFunction %void None %27 + %8 = OpFunctionParameter %ulong + %9 = OpFunctionParameter %ulong + %22 = OpLabel + %2 = OpVariable %_ptr_Function_ulong Function + %3 = OpVariable %_ptr_Function_ulong Function + %4 = OpVariable %_ptr_Function_ulong Function + %5 = OpVariable %_ptr_Function_ulong Function + %6 = OpVariable %_ptr_Function_uint Function + %7 = OpVariable %_ptr_Function_ulong Function + OpStore %2 %8 + OpStore %3 %9 + %10 = OpLoad %ulong %2 Aligned 8 + OpStore %4 %10 + %11 = OpLoad %ulong %3 Aligned 8 + OpStore %5 %11 + %13 = OpLoad %ulong %4 + %19 = OpConvertUToPtr %_ptr_Generic_uint %13 + %18 = OpLoad %uint %19 Aligned 4 + %12 = OpCopyObject %uint %18 + OpStore %6 %12 + %15 = OpLoad %uint %6 + %32 = OpBitcast %uint %15 + %33 = OpSConvert %ulong %32 + %14 = OpCopyObject %ulong %33 + OpStore %7 %14 + %16 = OpLoad %ulong %5 + %17 = OpLoad %ulong %7 + %20 = OpConvertUToPtr %_ptr_Generic_ulong %16 + %21 = OpCopyObject %ulong %17 + OpStore %20 %21 Aligned 8 + OpReturn + OpFunctionEnd diff --git a/ptx/src/test/spirv_run/mod.rs b/ptx/src/test/spirv_run/mod.rs index 86f9c16..3976c76 100644 --- a/ptx/src/test/spirv_run/mod.rs +++ b/ptx/src/test/spirv_run/mod.rs @@ -49,6 +49,8 @@ test_ptx!(mul_lo, [1u64], [2u64]); test_ptx!(mul_hi, [u64::max_value()], [1u64]); test_ptx!(add, [1u64], [2u64]); test_ptx!(setp, [10u64, 11u64], [1u64, 0u64]); +test_ptx!(setp_gt, [f32::NAN, 1f32], [1f32]); +test_ptx!(setp_leu, [1f32, f32::NAN], [1f32]); test_ptx!(bra, [10u64], [11u64]); test_ptx!(not, [0u64], [u64::max_value()]); test_ptx!(shl, [11u64], [44u64]); @@ -114,6 +116,8 @@ test_ptx!(cos, [std::f32::consts::PI], [-1f32]); test_ptx!(lg2, [512f32], [9f32]); test_ptx!(ex2, [10f32], [1024f32]); test_ptx!(cvt_rni, [9.5f32, 10.5f32], [10f32, 10f32]); +test_ptx!(cvt_rzi, [-13.8f32, 12.9f32], [-13f32, 13f32]); +test_ptx!(cvt_s32_f32, [-13.8f32, 12.9f32], [-13i32, 13i32]); test_ptx!(clz, [0b00000101_00101101_00010011_10101011u32], [5u32]); test_ptx!(popc, [0b10111100_10010010_01001001_10001010u32], [14u32]); test_ptx!( @@ -140,8 +144,9 @@ test_ptx!(stateful_ld_st_ntid, [123u64], [123u64]); test_ptx!(stateful_ld_st_ntid_chain, [12651u64], [12651u64]); test_ptx!(stateful_ld_st_ntid_sub, [96311u64], [96311u64]); test_ptx!(shared_ptr_take_address, [97815231u64], [97815231u64]); -// For now, we just that it builds and links +// For now, we just make sure that it builds and links test_ptx!(assertfail, [716523871u64], [716523872u64]); +test_ptx!(cvt_s64_s32, [-1i32], [-1i64]); struct DisplayError { err: T, @@ -161,11 +166,15 @@ impl Debug for DisplayError { impl error::Error for DisplayError {} -fn test_ptx_assert<'a, T: From + ze::SafeRepr + Debug + Copy + PartialEq>( +fn test_ptx_assert< + 'a, + Input: From + ze::SafeRepr + Debug + Copy + PartialEq, + Output: From + ze::SafeRepr + Debug + Copy + PartialEq, +>( name: &str, ptx_text: &'a str, - input: &[T], - output: &mut [T], + input: &[Input], + output: &mut [Output], ) -> Result<(), Box> { let mut errors = Vec::new(); let ast = ptx::ModuleParser::new().parse(&mut errors, ptx_text)?; @@ -178,12 +187,15 @@ fn test_ptx_assert<'a, T: From + ze::SafeRepr + Debug + Copy + PartialEq>( Ok(()) } -fn run_spirv + ze::SafeRepr + Copy + Debug>( +fn run_spirv< + Input: From + ze::SafeRepr + Copy + Debug, + Output: From + ze::SafeRepr + Copy + Debug, +>( name: &CStr, module: translate::Module, - input: &[T], - output: &mut [T], -) -> ze::Result> { + input: &[Input], + output: &mut [Output], +) -> ze::Result> { ze::init()?; let spirv = module.spirv.assemble(); let byte_il = unsafe { @@ -237,15 +249,15 @@ fn run_spirv + ze::SafeRepr + Copy + Debug>( kernel.set_indirect_access( ze::sys::ze_kernel_indirect_access_flags_t::ZE_KERNEL_INDIRECT_ACCESS_FLAG_DEVICE, )?; - let mut inp_b = ze::DeviceBuffer::::new(&mut ctx, &dev, cmp::max(input.len(), 1))?; - let mut out_b = ze::DeviceBuffer::::new(&mut ctx, &dev, cmp::max(output.len(), 1))?; - let inp_b_ptr_mut: ze::BufferPtrMut = (&mut inp_b).into(); + let mut inp_b = ze::DeviceBuffer::::new(&mut ctx, &dev, cmp::max(input.len(), 1))?; + let mut out_b = ze::DeviceBuffer::::new(&mut ctx, &dev, cmp::max(output.len(), 1))?; + let inp_b_ptr_mut: ze::BufferPtrMut = (&mut inp_b).into(); let event_pool = ze::EventPool::new(&mut ctx, 3, Some(&[&dev]))?; let ev0 = ze::Event::new(&event_pool, 0)?; let ev1 = ze::Event::new(&event_pool, 1)?; let mut ev2 = ze::Event::new(&event_pool, 2)?; let mut cmd_list = ze::CommandList::new(&mut ctx, &dev)?; - let out_b_ptr_mut: ze::BufferPtrMut = (&mut out_b).into(); + let out_b_ptr_mut: ze::BufferPtrMut = (&mut out_b).into(); let mut init_evs = [ev0, ev1]; cmd_list.append_memory_copy(inp_b_ptr_mut, input, Some(&mut init_evs[0]), &mut [])?; cmd_list.append_memory_fill(out_b_ptr_mut, 0, Some(&mut init_evs[1]), &mut [])?; diff --git a/ptx/src/test/spirv_run/setp_gt.ptx b/ptx/src/test/spirv_run/setp_gt.ptx new file mode 100644 index 0000000..5f45300 --- /dev/null +++ b/ptx/src/test/spirv_run/setp_gt.ptx @@ -0,0 +1,27 @@ +.version 6.5 +.target sm_30 +.address_size 64 + +.visible .entry setp_gt( + .param .u64 input, + .param .u64 output +) +{ + .reg .u64 in_addr; + .reg .u64 out_addr; + .reg .f32 r1; + .reg .f32 r2; + .reg .f32 r3; + .reg .pred pred; + + ld.param.u64 in_addr, [input]; + ld.param.u64 out_addr, [output]; + + ld.f32 r1, [in_addr]; + ld.f32 r2, [in_addr + 4]; + setp.gt.ftz.f32 pred, r1, r2; + @pred mov.f32 r3, r1; + @!pred mov.f32 r3, r2; + st.f32 [out_addr], r3; + ret; +} diff --git a/ptx/src/test/spirv_run/setp_gt.spvtxt b/ptx/src/test/spirv_run/setp_gt.spvtxt new file mode 100644 index 0000000..77f6546 --- /dev/null +++ b/ptx/src/test/spirv_run/setp_gt.spvtxt @@ -0,0 +1,75 @@ + OpCapability GenericPointer + OpCapability Linkage + OpCapability Addresses + OpCapability Kernel + OpCapability Int8 + OpCapability Int16 + OpCapability Int64 + OpCapability Float16 + OpCapability Float64 + %40 = OpExtInstImport "OpenCL.std" + OpMemoryModel Physical64 OpenCL + OpEntryPoint Kernel %1 "setp_gt" + %void = OpTypeVoid + %ulong = OpTypeInt 64 0 + %43 = OpTypeFunction %void %ulong %ulong +%_ptr_Function_ulong = OpTypePointer Function %ulong + %float = OpTypeFloat 32 +%_ptr_Function_float = OpTypePointer Function %float + %bool = OpTypeBool +%_ptr_Function_bool = OpTypePointer Function %bool +%_ptr_Generic_float = OpTypePointer Generic %float + %ulong_4 = OpConstant %ulong 4 + %1 = OpFunction %void None %43 + %14 = OpFunctionParameter %ulong + %15 = OpFunctionParameter %ulong + %38 = OpLabel + %2 = OpVariable %_ptr_Function_ulong Function + %3 = OpVariable %_ptr_Function_ulong Function + %4 = OpVariable %_ptr_Function_ulong Function + %5 = OpVariable %_ptr_Function_ulong Function + %6 = OpVariable %_ptr_Function_float Function + %7 = OpVariable %_ptr_Function_float Function + %8 = OpVariable %_ptr_Function_float Function + %9 = OpVariable %_ptr_Function_bool Function + OpStore %2 %14 + OpStore %3 %15 + %16 = OpLoad %ulong %2 Aligned 8 + OpStore %4 %16 + %17 = OpLoad %ulong %3 Aligned 8 + OpStore %5 %17 + %19 = OpLoad %ulong %4 + %35 = OpConvertUToPtr %_ptr_Generic_float %19 + %18 = OpLoad %float %35 Aligned 4 + OpStore %6 %18 + %21 = OpLoad %ulong %4 + %34 = OpIAdd %ulong %21 %ulong_4 + %36 = OpConvertUToPtr %_ptr_Generic_float %34 + %20 = OpLoad %float %36 Aligned 4 + OpStore %7 %20 + %23 = OpLoad %float %6 + %24 = OpLoad %float %7 + %22 = OpFOrdGreaterThan %bool %23 %24 + OpStore %9 %22 + %25 = OpLoad %bool %9 + OpBranchConditional %25 %10 %11 + %10 = OpLabel + %27 = OpLoad %float %6 + %26 = OpCopyObject %float %27 + OpStore %8 %26 + OpBranch %11 + %11 = OpLabel + %28 = OpLoad %bool %9 + OpBranchConditional %28 %13 %12 + %12 = OpLabel + %30 = OpLoad %float %7 + %29 = OpCopyObject %float %30 + OpStore %8 %29 + OpBranch %13 + %13 = OpLabel + %31 = OpLoad %ulong %5 + %32 = OpLoad %float %8 + %37 = OpConvertUToPtr %_ptr_Generic_float %31 + OpStore %37 %32 Aligned 4 + OpReturn + OpFunctionEnd diff --git a/ptx/src/test/spirv_run/setp_leu.ptx b/ptx/src/test/spirv_run/setp_leu.ptx new file mode 100644 index 0000000..be7538a --- /dev/null +++ b/ptx/src/test/spirv_run/setp_leu.ptx @@ -0,0 +1,27 @@ +.version 6.5 +.target sm_30 +.address_size 64 + +.visible .entry setp_leu( + .param .u64 input, + .param .u64 output +) +{ + .reg .u64 in_addr; + .reg .u64 out_addr; + .reg .f32 r1; + .reg .f32 r2; + .reg .f32 r3; + .reg .pred pred; + + ld.param.u64 in_addr, [input]; + ld.param.u64 out_addr, [output]; + + ld.f32 r1, [in_addr]; + ld.f32 r2, [in_addr + 4]; + setp.leu.ftz.f32 pred, r1, r2; + @pred mov.f32 r3, r1; + @!pred mov.f32 r3, r2; + st.f32 [out_addr], r3; + ret; +} diff --git a/ptx/src/test/spirv_run/setp_leu.spvtxt b/ptx/src/test/spirv_run/setp_leu.spvtxt new file mode 100644 index 0000000..f80880a --- /dev/null +++ b/ptx/src/test/spirv_run/setp_leu.spvtxt @@ -0,0 +1,75 @@ + OpCapability GenericPointer + OpCapability Linkage + OpCapability Addresses + OpCapability Kernel + OpCapability Int8 + OpCapability Int16 + OpCapability Int64 + OpCapability Float16 + OpCapability Float64 + %40 = OpExtInstImport "OpenCL.std" + OpMemoryModel Physical64 OpenCL + OpEntryPoint Kernel %1 "setp_leu" + %void = OpTypeVoid + %ulong = OpTypeInt 64 0 + %43 = OpTypeFunction %void %ulong %ulong +%_ptr_Function_ulong = OpTypePointer Function %ulong + %float = OpTypeFloat 32 +%_ptr_Function_float = OpTypePointer Function %float + %bool = OpTypeBool +%_ptr_Function_bool = OpTypePointer Function %bool +%_ptr_Generic_float = OpTypePointer Generic %float + %ulong_4 = OpConstant %ulong 4 + %1 = OpFunction %void None %43 + %14 = OpFunctionParameter %ulong + %15 = OpFunctionParameter %ulong + %38 = OpLabel + %2 = OpVariable %_ptr_Function_ulong Function + %3 = OpVariable %_ptr_Function_ulong Function + %4 = OpVariable %_ptr_Function_ulong Function + %5 = OpVariable %_ptr_Function_ulong Function + %6 = OpVariable %_ptr_Function_float Function + %7 = OpVariable %_ptr_Function_float Function + %8 = OpVariable %_ptr_Function_float Function + %9 = OpVariable %_ptr_Function_bool Function + OpStore %2 %14 + OpStore %3 %15 + %16 = OpLoad %ulong %2 Aligned 8 + OpStore %4 %16 + %17 = OpLoad %ulong %3 Aligned 8 + OpStore %5 %17 + %19 = OpLoad %ulong %4 + %35 = OpConvertUToPtr %_ptr_Generic_float %19 + %18 = OpLoad %float %35 Aligned 4 + OpStore %6 %18 + %21 = OpLoad %ulong %4 + %34 = OpIAdd %ulong %21 %ulong_4 + %36 = OpConvertUToPtr %_ptr_Generic_float %34 + %20 = OpLoad %float %36 Aligned 4 + OpStore %7 %20 + %23 = OpLoad %float %6 + %24 = OpLoad %float %7 + %22 = OpFUnordLessThanEqual %bool %23 %24 + OpStore %9 %22 + %25 = OpLoad %bool %9 + OpBranchConditional %25 %10 %11 + %10 = OpLabel + %27 = OpLoad %float %6 + %26 = OpCopyObject %float %27 + OpStore %8 %26 + OpBranch %11 + %11 = OpLabel + %28 = OpLoad %bool %9 + OpBranchConditional %28 %13 %12 + %12 = OpLabel + %30 = OpLoad %float %7 + %29 = OpCopyObject %float %30 + OpStore %8 %29 + OpBranch %13 + %13 = OpLabel + %31 = OpLoad %ulong %5 + %32 = OpLoad %float %8 + %37 = OpConvertUToPtr %_ptr_Generic_float %31 + OpStore %37 %32 Aligned 4 + OpReturn + OpFunctionEnd diff --git a/ptx/src/translate.rs b/ptx/src/translate.rs index 471a2d7..18d750f 100644 --- a/ptx/src/translate.rs +++ b/ptx/src/translate.rs @@ -4087,8 +4087,15 @@ fn emit_implicit_conversion( { builder.u_convert(wide_bit_type_spirv, Some(cv.dst), same_width_bit_value)?; } else { + let conversion_fn = if from_parts.scalar_kind == ScalarKind::Signed + && to_parts.scalar_kind == ScalarKind::Signed + { + dr::Builder::s_convert + } else { + dr::Builder::u_convert + }; let wide_bit_value = - builder.u_convert(wide_bit_type_spirv, None, same_width_bit_value)?; + conversion_fn(builder, wide_bit_type_spirv, None, same_width_bit_value)?; emit_implicit_conversion( builder, map, diff --git a/zluda_dump/src/replay.py b/zluda_dump/src/replay.py index 07c1780..9c78754 100644 --- a/zluda_dump/src/replay.py +++ b/zluda_dump/src/replay.py @@ -50,6 +50,14 @@ def parse_arguments(dump_path, prefix): arg_files = os.listdir(dir) return [load_arguments(path.join(dir, f)) for f in sorted(arg_files)] + +def append_debug_buffer(args): + args = list(args) + debug_buff = np.zeros(1024 * 1024, np.single) + args.append((drv.InOut(debug_buff), debug_buff)) + return args + + def verify_single_dump(input_path, max_block_threads): print(input_path) kernel_name = path.basename(input_path).split("_", 1)[1] @@ -58,11 +66,12 @@ def verify_single_dump(input_path, max_block_threads): block = tuple(launch_lines[3:6]) launch_block_size = block[0] * block[1] * block[2] if launch_block_size > max_block_threads: - print(f" Skipping, launch block size ({launch_block_size}) bigger than maximum block size ({max_block_threads})") + print( + f" Skipping, launch block size ({launch_block_size}) bigger than maximum block size ({max_block_threads})") return module = drv.module_from_file(path.join(input_path, "module.ptx")) kernel = module.get_function(kernel_name) - pre_args = parse_arguments(input_path, "pre") + pre_args = append_debug_buffer(parse_arguments(input_path, "pre")) kernel_pre_args, host_pre_args = zip(*pre_args) kernel(*list(kernel_pre_args), grid=tuple(launch_lines[:3]), block=block, shared=launch_lines[6]) post_args = parse_arguments(input_path, "post") @@ -75,6 +84,7 @@ def verify_single_dump(input_path, max_block_threads): except Exception as e: print(f"{idx}: {e}") + def main(argv): device = drv.Device(0) max_threads = device.get_attribute(drv.device_attribute.MAX_THREADS_PER_BLOCK)