diff --git a/GeekBench_5_2_3.svg b/GeekBench_5_2_3.svg
index 1bc47be..900e2b1 100644
--- a/GeekBench_5_2_3.svg
+++ b/GeekBench_5_2_3.svg
@@ -1 +1 @@
-
\ No newline at end of file
+
\ No newline at end of file
diff --git a/README.md b/README.md
index 8d9b05a..a6ea202 100644
--- a/README.md
+++ b/README.md
@@ -11,23 +11,22 @@ Performance below is normalized to OpenCL performance. 110% means that ZLUDA-imp

-[ZLUDA detailed log on Geekbench.com](https://browser.geekbench.com/v5/compute/1918048)
+[ZLUDA - detailed results on Geekbench.com](https://browser.geekbench.com/v5/compute/2305009)
-[OpenCL detailed log on Geekbench.com](https://browser.geekbench.com/v5/compute/1918080)
+[OpenCL - detailed results on Geekbench.com](https://browser.geekbench.com/v5/compute/2304997)
-Overall in this suite of benchmarks faster by approximately 4% on ZLUDA.
+Overall, ZLUDA is slower in GeekBench by roughly 2%.
### Explanation of the results
- * Why is ZLUDA faster in Stereo Matching, Gaussian Blur and Depth of Field?\
+ * Why is ZLUDA faster in some benchmarks?\
This has not been precisely pinpointed to one thing or another but it's likely a combination of things:
- * ZLUDA uses Level 0, which in general is a more level, higher performance API
+ * ZLUDA uses [Level 0](https://spec.oneapi.com/level-zero/latest/index.html), which in general is a more low level, high performance API than OpenCL
* Tying to the previous point, currently ZLUDA does not support asynchronous execution. This gives us an unfair advantage in a benchmark like GeekBench. GeekBench exclusively uses CUDA synchronous APIs
* There is a set of GPU instructions which are available on both NVIDIA hardware and Intel hardware, but are not exposed through OpenCL. We are comparing NVIDIA GPU optimized code with the more general OpenCL code. It's a lucky coincidence (and a credit to the underlying Intel Graphics Compiler) that this code also works well on an Intel GPU
* Why is OpenCL faster in Canny and Horizon Detection?\
Authors of CUDA benchmarks used CUDA functions `atomicInc` and `atomicDec` which have direct hardware support on NVIDIA cards, but no hardware support on Intel cards. They have to be emulated in software, which limits performance
- * Why are some benchmarks failing?\
- ZLUDA itself supports all the operations used in the failing benchmarks. From the limited debugging that has been done so far, the problem is most likely somewhere else. Intel GPU compiler stack is very capable when it comes to compiling OpenCL, C for Metal and DPC++. It's not yet very good at compiling ZLUDA. ZLUDA emits code patterns never seen before by the Intel GPU compiler stack and hits some rarely used (or not used before) code paths in the compiler.\
- Current status of failing GeekBench tests is tracked [here](https://github.com/vosen/ZLUDA/pull/12)
+ * Why is ZLUDA slower in the remaining benchmarks?\
+ The reason is unknown. Most likely, in some tests we compile from suboptimal NVIDIA GPU code and in other tests ZLUDA itself is emitting suboptimal Intel GPU code. For example, SFFT used to be even slower before PR [#22](https://github.com/vosen/ZLUDA/pull/22)
## Details
@@ -35,7 +34,7 @@ Overall in this suite of benchmarks faster by approximately 4% on ZLUDA.
* Is ZLUDA a drop-in replacement for CUDA?\
Yes, but certain applications use CUDA in ways which make it incompatible with ZLUDA
* What is the status of the project?\
- This project is a Proof of Concept. About the only thing that works currently is Geekbench (and not even completely). It's amazingly buggy and incomplete. You should not rely on it for anything serious
+ This project is a Proof of Concept. About the only thing that works currently is Geekbench. It's amazingly buggy and incomplete. You should not rely on it for anything serious
* Is it an Intel project? Is it an NVIDIA project?\
No, it's a private project
* What is the performance?\
@@ -56,8 +55,8 @@ You should have the most recent Intel GPU drivers installed.\
Copy `nvcuda.dll` to the application directory (the directory where .exe file is) and launch it normally
### Linux
-A very recent version of [compute-runtime](https://github.com/intel/compute-runtime) and [Level Zero loader](https://github.com/oneapi-src/level-zero/releases) is required. At the time of the writing 20.45.18403 is the losest recommended version.
-Unpack the archive somewhere and run your application like this:
+A very recent version of [compute-runtime](https://github.com/intel/compute-runtime) and [Level Zero loader](https://github.com/oneapi-src/level-zero/releases) is required. At the time of the writing 20.45.18403 is the oldest recommended version.
+Run your application like this:
```
LD_LIBRARY_PATH=
```
diff --git a/ptx/src/test/spirv_run/cvt_rzi.ptx b/ptx/src/test/spirv_run/cvt_rzi.ptx
new file mode 100644
index 0000000..ba5cc0e
--- /dev/null
+++ b/ptx/src/test/spirv_run/cvt_rzi.ptx
@@ -0,0 +1,25 @@
+.version 6.5
+.target sm_30
+.address_size 64
+
+.visible .entry cvt_rzi(
+ .param .u64 input,
+ .param .u64 output
+)
+{
+ .reg .u64 in_addr;
+ .reg .u64 out_addr;
+ .reg .f32 temp1;
+ .reg .f32 temp2;
+
+ ld.param.u64 in_addr, [input];
+ ld.param.u64 out_addr, [output];
+
+ ld.f32 temp1, [in_addr];
+ ld.f32 temp2, [in_addr+4];
+ cvt.rzi.f32.f32 temp1, temp1;
+ cvt.rzi.f32.f32 temp2, temp2;
+ st.f32 [out_addr], temp1;
+ st.f32 [out_addr+4], temp2;
+ ret;
+}
diff --git a/ptx/src/test/spirv_run/cvt_rzi.spvtxt b/ptx/src/test/spirv_run/cvt_rzi.spvtxt
new file mode 100644
index 0000000..68c12c6
--- /dev/null
+++ b/ptx/src/test/spirv_run/cvt_rzi.spvtxt
@@ -0,0 +1,63 @@
+ OpCapability GenericPointer
+ OpCapability Linkage
+ OpCapability Addresses
+ OpCapability Kernel
+ OpCapability Int8
+ OpCapability Int16
+ OpCapability Int64
+ OpCapability Float16
+ OpCapability Float64
+ %34 = OpExtInstImport "OpenCL.std"
+ OpMemoryModel Physical64 OpenCL
+ OpEntryPoint Kernel %1 "cvt_rzi"
+ %void = OpTypeVoid
+ %ulong = OpTypeInt 64 0
+ %37 = OpTypeFunction %void %ulong %ulong
+%_ptr_Function_ulong = OpTypePointer Function %ulong
+ %float = OpTypeFloat 32
+%_ptr_Function_float = OpTypePointer Function %float
+%_ptr_Generic_float = OpTypePointer Generic %float
+ %ulong_4 = OpConstant %ulong 4
+ %ulong_4_0 = OpConstant %ulong 4
+ %1 = OpFunction %void None %37
+ %8 = OpFunctionParameter %ulong
+ %9 = OpFunctionParameter %ulong
+ %32 = OpLabel
+ %2 = OpVariable %_ptr_Function_ulong Function
+ %3 = OpVariable %_ptr_Function_ulong Function
+ %4 = OpVariable %_ptr_Function_ulong Function
+ %5 = OpVariable %_ptr_Function_ulong Function
+ %6 = OpVariable %_ptr_Function_float Function
+ %7 = OpVariable %_ptr_Function_float Function
+ OpStore %2 %8
+ OpStore %3 %9
+ %10 = OpLoad %ulong %2 Aligned 8
+ OpStore %4 %10
+ %11 = OpLoad %ulong %3 Aligned 8
+ OpStore %5 %11
+ %13 = OpLoad %ulong %4
+ %28 = OpConvertUToPtr %_ptr_Generic_float %13
+ %12 = OpLoad %float %28 Aligned 4
+ OpStore %6 %12
+ %15 = OpLoad %ulong %4
+ %25 = OpIAdd %ulong %15 %ulong_4
+ %29 = OpConvertUToPtr %_ptr_Generic_float %25
+ %14 = OpLoad %float %29 Aligned 4
+ OpStore %7 %14
+ %17 = OpLoad %float %6
+ %16 = OpExtInst %float %34 trunc %17
+ OpStore %6 %16
+ %19 = OpLoad %float %7
+ %18 = OpExtInst %float %34 trunc %19
+ OpStore %7 %18
+ %20 = OpLoad %ulong %5
+ %21 = OpLoad %float %6
+ %30 = OpConvertUToPtr %_ptr_Generic_float %20
+ OpStore %30 %21 Aligned 4
+ %22 = OpLoad %ulong %5
+ %23 = OpLoad %float %7
+ %27 = OpIAdd %ulong %22 %ulong_4_0
+ %31 = OpConvertUToPtr %_ptr_Generic_float %27
+ OpStore %31 %23 Aligned 4
+ OpReturn
+ OpFunctionEnd
diff --git a/ptx/src/test/spirv_run/cvt_s32_f32.ptx b/ptx/src/test/spirv_run/cvt_s32_f32.ptx
new file mode 100644
index 0000000..d432a91
--- /dev/null
+++ b/ptx/src/test/spirv_run/cvt_s32_f32.ptx
@@ -0,0 +1,25 @@
+.version 6.5
+.target sm_30
+.address_size 64
+
+.visible .entry cvt_s32_f32(
+ .param .u64 input,
+ .param .u64 output
+)
+{
+ .reg .u64 in_addr;
+ .reg .u64 out_addr;
+ .reg .b32 temp1;
+ .reg .b32 temp2;
+
+ ld.param.u64 in_addr, [input];
+ ld.param.u64 out_addr, [output];
+
+ ld.f32 temp1, [in_addr];
+ ld.f32 temp2, [in_addr+4];
+ cvt.rpi.ftz.s32.f32 temp1, temp1;
+ cvt.rpi.ftz.s32.f32 temp2, temp2;
+ st.global.s32 [out_addr], temp1;
+ st.global.s32 [out_addr+4], temp2;
+ ret;
+}
diff --git a/ptx/src/test/spirv_run/cvt_s32_f32.spvtxt b/ptx/src/test/spirv_run/cvt_s32_f32.spvtxt
new file mode 100644
index 0000000..d9ae053
--- /dev/null
+++ b/ptx/src/test/spirv_run/cvt_s32_f32.spvtxt
@@ -0,0 +1,75 @@
+ OpCapability GenericPointer
+ OpCapability Linkage
+ OpCapability Addresses
+ OpCapability Kernel
+ OpCapability Int8
+ OpCapability Int16
+ OpCapability Int64
+ OpCapability Float16
+ OpCapability Float64
+ %42 = OpExtInstImport "OpenCL.std"
+ OpMemoryModel Physical64 OpenCL
+ OpEntryPoint Kernel %1 "cvt_s32_f32"
+ OpDecorate %32 FPRoundingMode RTP
+ OpDecorate %34 FPRoundingMode RTP
+ %void = OpTypeVoid
+ %ulong = OpTypeInt 64 0
+ %45 = OpTypeFunction %void %ulong %ulong
+%_ptr_Function_ulong = OpTypePointer Function %ulong
+ %uint = OpTypeInt 32 0
+%_ptr_Function_uint = OpTypePointer Function %uint
+ %float = OpTypeFloat 32
+%_ptr_Generic_float = OpTypePointer Generic %float
+ %ulong_4 = OpConstant %ulong 4
+%_ptr_CrossWorkgroup_uint = OpTypePointer CrossWorkgroup %uint
+ %ulong_4_0 = OpConstant %ulong 4
+ %1 = OpFunction %void None %45
+ %8 = OpFunctionParameter %ulong
+ %9 = OpFunctionParameter %ulong
+ %40 = OpLabel
+ %2 = OpVariable %_ptr_Function_ulong Function
+ %3 = OpVariable %_ptr_Function_ulong Function
+ %4 = OpVariable %_ptr_Function_ulong Function
+ %5 = OpVariable %_ptr_Function_ulong Function
+ %6 = OpVariable %_ptr_Function_uint Function
+ %7 = OpVariable %_ptr_Function_uint Function
+ OpStore %2 %8
+ OpStore %3 %9
+ %10 = OpLoad %ulong %2 Aligned 8
+ OpStore %4 %10
+ %11 = OpLoad %ulong %3 Aligned 8
+ OpStore %5 %11
+ %13 = OpLoad %ulong %4
+ %29 = OpConvertUToPtr %_ptr_Generic_float %13
+ %28 = OpLoad %float %29 Aligned 4
+ %12 = OpBitcast %uint %28
+ OpStore %6 %12
+ %15 = OpLoad %ulong %4
+ %25 = OpIAdd %ulong %15 %ulong_4
+ %31 = OpConvertUToPtr %_ptr_Generic_float %25
+ %30 = OpLoad %float %31 Aligned 4
+ %14 = OpBitcast %uint %30
+ OpStore %7 %14
+ %17 = OpLoad %uint %6
+ %33 = OpBitcast %float %17
+ %32 = OpConvertFToS %uint %33
+ %16 = OpCopyObject %uint %32
+ OpStore %6 %16
+ %19 = OpLoad %uint %7
+ %35 = OpBitcast %float %19
+ %34 = OpConvertFToS %uint %35
+ %18 = OpCopyObject %uint %34
+ OpStore %7 %18
+ %20 = OpLoad %ulong %5
+ %21 = OpLoad %uint %6
+ %36 = OpConvertUToPtr %_ptr_CrossWorkgroup_uint %20
+ %37 = OpCopyObject %uint %21
+ OpStore %36 %37 Aligned 4
+ %22 = OpLoad %ulong %5
+ %23 = OpLoad %uint %7
+ %27 = OpIAdd %ulong %22 %ulong_4_0
+ %38 = OpConvertUToPtr %_ptr_CrossWorkgroup_uint %27
+ %39 = OpCopyObject %uint %23
+ OpStore %38 %39 Aligned 4
+ OpReturn
+ OpFunctionEnd
diff --git a/ptx/src/test/spirv_run/cvt_s64_s32.ptx b/ptx/src/test/spirv_run/cvt_s64_s32.ptx
new file mode 100644
index 0000000..5242864
--- /dev/null
+++ b/ptx/src/test/spirv_run/cvt_s64_s32.ptx
@@ -0,0 +1,22 @@
+.version 6.5
+.target sm_30
+.address_size 64
+
+.visible .entry cvt_s64_s32(
+ .param .u64 input,
+ .param .u64 output
+)
+{
+ .reg .u64 in_addr;
+ .reg .u64 out_addr;
+ .reg .s32 r_32;
+ .reg .s64 r_64;
+
+ ld.param.u64 in_addr, [input];
+ ld.param.u64 out_addr, [output];
+
+ ld.b32 r_32, [in_addr];
+ cvt.s64.s32 r_64, r_32;
+ st.b64 [out_addr], r_64;
+ ret;
+}
diff --git a/ptx/src/test/spirv_run/cvt_s64_s32.spvtxt b/ptx/src/test/spirv_run/cvt_s64_s32.spvtxt
new file mode 100644
index 0000000..3f46103
--- /dev/null
+++ b/ptx/src/test/spirv_run/cvt_s64_s32.spvtxt
@@ -0,0 +1,53 @@
+ OpCapability GenericPointer
+ OpCapability Linkage
+ OpCapability Addresses
+ OpCapability Kernel
+ OpCapability Int8
+ OpCapability Int16
+ OpCapability Int64
+ OpCapability Float16
+ OpCapability Float64
+ %24 = OpExtInstImport "OpenCL.std"
+ OpMemoryModel Physical64 OpenCL
+ OpEntryPoint Kernel %1 "cvt_s64_s32"
+ %void = OpTypeVoid
+ %ulong = OpTypeInt 64 0
+ %27 = OpTypeFunction %void %ulong %ulong
+%_ptr_Function_ulong = OpTypePointer Function %ulong
+ %uint = OpTypeInt 32 0
+%_ptr_Function_uint = OpTypePointer Function %uint
+%_ptr_Generic_uint = OpTypePointer Generic %uint
+%_ptr_Generic_ulong = OpTypePointer Generic %ulong
+ %1 = OpFunction %void None %27
+ %8 = OpFunctionParameter %ulong
+ %9 = OpFunctionParameter %ulong
+ %22 = OpLabel
+ %2 = OpVariable %_ptr_Function_ulong Function
+ %3 = OpVariable %_ptr_Function_ulong Function
+ %4 = OpVariable %_ptr_Function_ulong Function
+ %5 = OpVariable %_ptr_Function_ulong Function
+ %6 = OpVariable %_ptr_Function_uint Function
+ %7 = OpVariable %_ptr_Function_ulong Function
+ OpStore %2 %8
+ OpStore %3 %9
+ %10 = OpLoad %ulong %2 Aligned 8
+ OpStore %4 %10
+ %11 = OpLoad %ulong %3 Aligned 8
+ OpStore %5 %11
+ %13 = OpLoad %ulong %4
+ %19 = OpConvertUToPtr %_ptr_Generic_uint %13
+ %18 = OpLoad %uint %19 Aligned 4
+ %12 = OpCopyObject %uint %18
+ OpStore %6 %12
+ %15 = OpLoad %uint %6
+ %32 = OpBitcast %uint %15
+ %33 = OpSConvert %ulong %32
+ %14 = OpCopyObject %ulong %33
+ OpStore %7 %14
+ %16 = OpLoad %ulong %5
+ %17 = OpLoad %ulong %7
+ %20 = OpConvertUToPtr %_ptr_Generic_ulong %16
+ %21 = OpCopyObject %ulong %17
+ OpStore %20 %21 Aligned 8
+ OpReturn
+ OpFunctionEnd
diff --git a/ptx/src/test/spirv_run/mod.rs b/ptx/src/test/spirv_run/mod.rs
index 86f9c16..3976c76 100644
--- a/ptx/src/test/spirv_run/mod.rs
+++ b/ptx/src/test/spirv_run/mod.rs
@@ -49,6 +49,8 @@ test_ptx!(mul_lo, [1u64], [2u64]);
test_ptx!(mul_hi, [u64::max_value()], [1u64]);
test_ptx!(add, [1u64], [2u64]);
test_ptx!(setp, [10u64, 11u64], [1u64, 0u64]);
+test_ptx!(setp_gt, [f32::NAN, 1f32], [1f32]);
+test_ptx!(setp_leu, [1f32, f32::NAN], [1f32]);
test_ptx!(bra, [10u64], [11u64]);
test_ptx!(not, [0u64], [u64::max_value()]);
test_ptx!(shl, [11u64], [44u64]);
@@ -114,6 +116,8 @@ test_ptx!(cos, [std::f32::consts::PI], [-1f32]);
test_ptx!(lg2, [512f32], [9f32]);
test_ptx!(ex2, [10f32], [1024f32]);
test_ptx!(cvt_rni, [9.5f32, 10.5f32], [10f32, 10f32]);
+test_ptx!(cvt_rzi, [-13.8f32, 12.9f32], [-13f32, 13f32]);
+test_ptx!(cvt_s32_f32, [-13.8f32, 12.9f32], [-13i32, 13i32]);
test_ptx!(clz, [0b00000101_00101101_00010011_10101011u32], [5u32]);
test_ptx!(popc, [0b10111100_10010010_01001001_10001010u32], [14u32]);
test_ptx!(
@@ -140,8 +144,9 @@ test_ptx!(stateful_ld_st_ntid, [123u64], [123u64]);
test_ptx!(stateful_ld_st_ntid_chain, [12651u64], [12651u64]);
test_ptx!(stateful_ld_st_ntid_sub, [96311u64], [96311u64]);
test_ptx!(shared_ptr_take_address, [97815231u64], [97815231u64]);
-// For now, we just that it builds and links
+// For now, we just make sure that it builds and links
test_ptx!(assertfail, [716523871u64], [716523872u64]);
+test_ptx!(cvt_s64_s32, [-1i32], [-1i64]);
struct DisplayError {
err: T,
@@ -161,11 +166,15 @@ impl Debug for DisplayError {
impl error::Error for DisplayError {}
-fn test_ptx_assert<'a, T: From + ze::SafeRepr + Debug + Copy + PartialEq>(
+fn test_ptx_assert<
+ 'a,
+ Input: From + ze::SafeRepr + Debug + Copy + PartialEq,
+ Output: From + ze::SafeRepr + Debug + Copy + PartialEq,
+>(
name: &str,
ptx_text: &'a str,
- input: &[T],
- output: &mut [T],
+ input: &[Input],
+ output: &mut [Output],
) -> Result<(), Box> {
let mut errors = Vec::new();
let ast = ptx::ModuleParser::new().parse(&mut errors, ptx_text)?;
@@ -178,12 +187,15 @@ fn test_ptx_assert<'a, T: From + ze::SafeRepr + Debug + Copy + PartialEq>(
Ok(())
}
-fn run_spirv + ze::SafeRepr + Copy + Debug>(
+fn run_spirv<
+ Input: From + ze::SafeRepr + Copy + Debug,
+ Output: From + ze::SafeRepr + Copy + Debug,
+>(
name: &CStr,
module: translate::Module,
- input: &[T],
- output: &mut [T],
-) -> ze::Result> {
+ input: &[Input],
+ output: &mut [Output],
+) -> ze::Result> {
ze::init()?;
let spirv = module.spirv.assemble();
let byte_il = unsafe {
@@ -237,15 +249,15 @@ fn run_spirv + ze::SafeRepr + Copy + Debug>(
kernel.set_indirect_access(
ze::sys::ze_kernel_indirect_access_flags_t::ZE_KERNEL_INDIRECT_ACCESS_FLAG_DEVICE,
)?;
- let mut inp_b = ze::DeviceBuffer::::new(&mut ctx, &dev, cmp::max(input.len(), 1))?;
- let mut out_b = ze::DeviceBuffer::::new(&mut ctx, &dev, cmp::max(output.len(), 1))?;
- let inp_b_ptr_mut: ze::BufferPtrMut = (&mut inp_b).into();
+ let mut inp_b = ze::DeviceBuffer::::new(&mut ctx, &dev, cmp::max(input.len(), 1))?;
+ let mut out_b = ze::DeviceBuffer::