From a94ec0952ad8d902a5c9fbe491b664710ca4525f Mon Sep 17 00:00:00 2001 From: GitHub Actions Date: Thu, 3 Apr 2025 17:25:55 +0000 Subject: [PATCH] Deploy vosen/ZLUDA to vosen/ZLUDA:gh-pages --- 404.html | 3 + blog/index.html | 110 ++++++ blog/zluda-update-q1-2025/25q1-1.svg | 1 + blog/zluda-update-q1-2025/25q1-2.svg | 1 + blog/zluda-update-q1-2025/25q1-3.svg | 1 + blog/zluda-update-q1-2025/25q1-4.svg | 1 + blog/zluda-update-q1-2025/index.html | 339 ++++++++++++++++++ blog/zluda-update-q4-2024/geekbench.svg | 1 + .../zluda-update-q4-2024/geekbench_detail.svg | 1 + blog/zluda-update-q4-2024/index.html | 186 ++++++++++ blog/zludas-third-life/index.html | 156 ++++++++ index.html | 106 ++++++ robots.txt | 4 + sitemap.xml | 21 ++ 14 files changed, 931 insertions(+) create mode 100644 404.html create mode 100644 blog/index.html create mode 100644 blog/zluda-update-q1-2025/25q1-1.svg create mode 100644 blog/zluda-update-q1-2025/25q1-2.svg create mode 100644 blog/zluda-update-q1-2025/25q1-3.svg create mode 100644 blog/zluda-update-q1-2025/25q1-4.svg create mode 100644 blog/zluda-update-q1-2025/index.html create mode 100644 blog/zluda-update-q4-2024/geekbench.svg create mode 100644 blog/zluda-update-q4-2024/geekbench_detail.svg create mode 100644 blog/zluda-update-q4-2024/index.html create mode 100644 blog/zludas-third-life/index.html create mode 100644 index.html create mode 100644 robots.txt create mode 100644 sitemap.xml diff --git a/404.html b/404.html new file mode 100644 index 0000000..f8414f0 --- /dev/null +++ b/404.html @@ -0,0 +1,3 @@ + +404 Not Found +

404 Not Found

diff --git a/blog/index.html b/blog/index.html new file mode 100644 index 0000000..b54f80a --- /dev/null +++ b/blog/index.html @@ -0,0 +1,110 @@ + + + + + + + + + ZLUDA + + + +
+
+

+
+ ZLUDA +
+ +
+
+ +

ZLUDA allows to run unmodified CUDA applications on non-NVIDIA GPUs

+
+

+ +

+ List of blog posts +

+ + +
+
+ + + \ No newline at end of file diff --git a/blog/zluda-update-q1-2025/25q1-1.svg b/blog/zluda-update-q1-2025/25q1-1.svg new file mode 100644 index 0000000..d278e62 --- /dev/null +++ b/blog/zluda-update-q1-2025/25q1-1.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/blog/zluda-update-q1-2025/25q1-2.svg b/blog/zluda-update-q1-2025/25q1-2.svg new file mode 100644 index 0000000..052f7c2 --- /dev/null +++ b/blog/zluda-update-q1-2025/25q1-2.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/blog/zluda-update-q1-2025/25q1-3.svg b/blog/zluda-update-q1-2025/25q1-3.svg new file mode 100644 index 0000000..076bf3d --- /dev/null +++ b/blog/zluda-update-q1-2025/25q1-3.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/blog/zluda-update-q1-2025/25q1-4.svg b/blog/zluda-update-q1-2025/25q1-4.svg new file mode 100644 index 0000000..4a692ac --- /dev/null +++ b/blog/zluda-update-q1-2025/25q1-4.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/blog/zluda-update-q1-2025/index.html b/blog/zluda-update-q1-2025/index.html new file mode 100644 index 0000000..4f7d0f9 --- /dev/null +++ b/blog/zluda-update-q1-2025/index.html @@ -0,0 +1,339 @@ + + + + + + + + + ZLUDA - ZLUDA update Q1 2025 - roadmap update, LLVM tests, denormals + + + +
+
+

+
+ ZLUDA +
+ +
+
+ +

ZLUDA allows to run unmodified CUDA applications on non-NVIDIA GPUs

+
+

+ +

+ ZLUDA update Q1 2025 - roadmap update, LLVM tests, denormals +
2025-04-03
+

+

Welcome to the new ZLUDA update. Read about our plans for the nearest future (that include PyTorch and PhysX) in Roadmap update and about progress made this quarter in LLVM bitcode unit tests and Correct rounding and denormal modes on AMD GPUs.

+

Roadmap update

+

PyTorch

+

PyTorch remains my top priority and I still aim at being able to have PyTorch running on ZLUDA Q3/Q4 this year. Before PyTorch is up and running I am aiming for an intermediate goal: llm.c. You can see the progress towards getting llm.c up and running here.

+

PhysX

+

As you might have read here, here and on multiple other sites, NVIDIA dropped support for 32-bit PhysX in their latest generation of GPUs, leaving a number of older games stranded.

+

This reignited the debate about ZLUDA’s PhysX support. After reading through it several times, it’s clear to me that there is a path in ZLUDA to rescuing those games and getting them to run on both AMD and NVIDIA GPUs.

+

I broke down the implementation into tasks here. If you can program Rust and want to make a lot of people happy, I encourage you to contribute. I won't be able to work on it myself because I'll be busy with PyTorch support, but I'll help in any way I can.

+

LLVM bitcode unit tests

+

The ZLUDA compiler is the cornerstone of the project. It processes PTX modules by applying a series of transformations, ultimately generating LLVM bitcode. This LLVM bitcode is subsequently fed into the installed ROCm/HIP driver, which compiles it into a binary suitable for the currently installed GPU.

+

The compiler codebase includes multiple unit tests. Each test asserts that for:

+
    +
  • given PTX source code
  • +
  • given input data
  • +
  • given output data
  • +
+

It can compile successfully and execute compiled binary with input data and produce the output data.

+

While this covers the entire end-to-end flow, there is a valuable sub-flow hiding here that could be tested too: the compilation from PTX to the LLVM bitcode. For each PTX source module, we could commit the compiled LLVM bitcode in a textual format and implement tests to ensure it remains unchanged. This approach is particularly useful for newly written complex compiler transformations that modify the emitted LLVM across the board. By using LLVM bitcode tests, you can observe how your modifications impact LLVM generation across various use cases, even those you might assume are unrelated.

+

This feature sat on the "help wanted" list for quite some time and I’m happy to see the first external contributor address this issue. JoelleJS merged it in #324. Just in time for a significant feature that will use these tests.

+

Correct rounding and denormal modes on AMD GPUs

+

This is an important feature that I have wanted to do for years. It is not present even in the old (pre-rollback) ZLUDA. The priority was always given to enabling new workloads, instead of making everything perfectly correct. Now we are out of proof-of-concept mode and can spend some time on correctness. As you will read below, it is a complex feature that is quite often invisible to the end user. It was acceptable for old ZLUDA do things incorrectly.

+
+

Warning
+The remainder of this article assumes you know what PTX, floating-point numbers, control flow graphs, and basic blocks are. You don't need to be an expert, but a lack of familiarity with these concept will make everything below incomprehensible.

+
+

If you know what floating-point denormals and rounding modes are you can skip to the next section (Previously on "ZLUDA" ...).

+

First, some definitions. What exactly is denormal mode, and what are denormal numbers? Denormals (subnormals), represent a category of very small floating-point values. For the most common floating point size (32 bit), these values fall within the range of -3.4×1038 to 3.4×1038 (excluding 0). Due to the encoding of floating-point numbers, this category necessitates additional processing and has historically been either unsupported or supported with reduced performance. When we say "unsupported," it means that denormal values are treated as zeros. In the context of PTX, denormal mode refers to a flag (.ftz) on floating-point instructions that determines whether they process denormal values or treat them as zeros, "flushing to zero." In general, modern, mainstream hardware architectures can handle basic operations - add, multiply, fused multiply add, etc. - with denormal values at full speed.

+

Now rounding mode. Most of the "simple" operations floating-point operations are formally defined as "performs the operation with infinite precision and then rounds infinite value to a finite value using chosen mode". Usual rounding modes are "round to nearest even", "round to zero", "round to positive infinity", "round to negative infinity". Rounding mode effectively controls the least-significant bit of the mantissa of the floating-point result. Although a single least-significant bit may seem insignificant, it can have a noticeable impact. For instance, consider two values that differ only by the least significant bit: 1.0000000 and 1.0000001. In certain contexts, the difference of 0.0000001 can be substantial.

+

Now that we understand the denormal and rounding part, let's focus on the mode part. Typically, CPUs will do some mix of integer calculations and floating-point calculations, with the specific proportions varying based on the workload. In contrast, GPUs—regardless of whether they are tailored for gaming, high-performance computing (HPC), or machine learning—primarily dedicate their processing cycles to floating-point operations. This focus prompts GPU architects to prioritise floating-point support in their hardware designs.

+

One notable feature found in NVIDIA hardware, and consequently in PTX, is the per-instruction control for denormal and rounding operations. In a CPU, a common approach to managing this issue is to implement a global control (as seen in x86 and ARM architectures) or to forgo denormal control altogether (as in RISC-V). While this design choice is beneficial for programmers, it presents unique challenges for ZLUDA when translating to an AMD GPU which uses global control (like a CPU).

+

Previously on "ZLUDA" ...

+

Pre-rollback ZLUDA used the simplest possible approach that almost works:

+
    +
  • For denormal mode (which is either "flush-to-zero" or "preserve denormals") hold a "vote" for each function. Count the number of instructions using each mode and then just use the more prolific mode across the function
  • +
  • For rounding mode, ignore it completely and always use "round to nearest even"
  • +
+

PTX module compiled from C++ CUDA sources will usually use the same denormal mode across the whole module with particular mode depending on the compiler flags. Rounding mode use is somewhat uncommon.

+

Sure, this approach is not correct, but it worked somewhat okayish and it led to only a single major bug (that I’ve noticed). Still, ZLUDA is now out of proof-of-concept mode and we are now doing things correctly.

+

Dead end#1: LLVM & HIP/ROCm

+

When implementing a new compiler feature in ZLUDA, the first step is to check if it's implemented by the baseline LLVM. The perfect LLVM support would allow ZLUDA to do a trivial per-instruction transformation like this:

+

from (PTX pseudocode):

+
z = add.ftz x, y
+a = add.ftz b, c
+
+

to (LLVM pseduocode):

+
old_fpstate1 = llvm.get_fpstate()
+llvm.set_ftz(true)
+z = add x, y
+llvm.set_fpstate(old_fpstate1)
+old_fpstate2 = llvm.get_fpstate()
+llvm.set_ftz(true)
+a = add b, c
+llvm.set_fpstate(old_fpstate2)
+
+

and have LLVM optimize that to (AMD GPU assembler pseudocode):

+
S_DENORM_MODE flush, flush
+V_ADD_NC_U32 z, x, y
+V_ADD_NC_U32 a, b, c
+
+

The initial research on LLVM floating point builtins appeared promising, as this collection of intrinsics seemed to address our specific use case:

+
    +
  • llvm.get.fpenv/llvm.set.fpenv
  • +
  • llvm.get.fpmode/llvm.set.fpmode
  • +
  • llvm.experimental.* family
  • +
+

Sadly, they are all deficient in some way. They either compile down to poor, unoptimized AMD GPU code or do not work at all . Granted, llvm.experimental.* support is being worked on by AMD and should appear in the future ROCm versions, but this does not help us today.

+

This raises the question: in CUDA C++ you have a bunch of builtins to do operations with the specified rounding mode, e.g. __fadd_rz for floating point addition with "round-to-zero" mode. What happens on ROCm?

+

Further exploration revealed (source):

+
+

Only the nearest-even rounding mode is supported by default on AMD GPUs. The _rz, _ru, and _rd suffixed intrinsic functions exist in the HIP AMD backend if the OCML_BASIC_ROUNDED_OPERATIONS macro is defined.

+
+

Ok. You can use those functions, but they are hidden behind a define. That’s weird. Time to try it!

+

The HIP/ROCm source code:

+
#include <hip/hip_runtime.h>
+
+__global__ void foobar(int* array, int n) {
+    int tid = blockDim.x * blockIdx.x + threadIdx.x;
+    array[tid] = __fadd_rz(array[tid], array[tid]);
+    array[tid+1] = __fadd_rz(array[tid+1], array[tid+1]);
+}
+
+

When using ROCm 6.3, compiles down to this (some output omitted for clarity):

+
0000000000001900 <__ocml_add_rtz_f32>:
+        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+        s_setreg_imm32_b32 hwreg(HW_REG_MODE, 0, 2), 3
+        v_add_f32_e32 v0, v0, v1
+        s_setreg_imm32_b32 hwreg(HW_REG_MODE, 0, 2), 0
+        s_setpc_b64 s[30:31]
+
+...
+
+0000000000001a00 <_Z6foobarPii>:
+        ...
+        s_getpc_b64 s[0:1]
+        s_add_u32 s0, s0, 0xfffffeac
+        s_addc_u32 s1, s1, -1
+        ...
+        s_swappc_b64 s[30:31], s[0:1]
+        v_cvt_f32_i32_e32 v1, v5
+        v_cvt_i32_f32_e32 v4, v0
+        v_mov_b32_e32 v0, v1
+        s_swappc_b64 s[30:31], s[0:1]
+        s_delay_alu instid0(VALU_DEP_1)
+        v_cvt_i32_f32_e32 v5, v0
+        global_store_b64 v[2:3], v[4:5], off
+        s_endpgm
+
+

Ok, mystery solved. I too, would like to hide this compiler output.

+

For those of us who are not proficient in AMD GPU assembly: every use of __fadd_rz requires a function call (s_swappc_b64, expensive) and two calls to set rounding mode (s_setreg_imm32_b32, also expensive). This is simply too much overhead to be acceptable.

+

We are going to build our own support. Our goal, for the code above, is a single instruction to set the rounding mode (or even zero instructions as we will see later).

+

Building support in ZLUDA

+

Our new goal is to write a complete transformation (compiler pass) in ZLUDA that will insert instructions that set the global modes (rounding and denormal). We want to insert as few instructions as possible for the best possible performance - there’s no LLVM pass that is going to optimize the insertions for us.

+

Let’s take half of a step back. We know that the trivial (and slow) approach is to simply set the global mode before every instruction that makes use of a mode. It can be improved by omitting the mode-setting instructions if we know that the previous instruction uses the same mode. We can always track this in straight-line code, but what happens if there are branches? What happens if there are multiple branches with from different sources, but into the same target? It seems to be sufficient to figure out which branches require mode change and which do not.

+

This leads us to a new reformulation. We can express this problem as a control flow graph augmented with a little bit of extra information: for each mode (denormal, rounding) each node (basic block) will have "entry" state and "exit" state. Entry state for a basic block is the mode of the first mode-using instruction in the basic block. Similarly, exit mode is the mode of the last mode-using instruction. This simplifies problem quite a bit. We must now compute which edges (jumps) in the control flow graph require an insertion of mode change.

+

For illustrative purpose we will only consider mode that takes two values: true (green) and false (red). Picture below is node "A" that has a "true" entry mode and "false" exit mode and jumps to node B that has "true" entry mode and "false" exit mode:

+

+drawing +

+

Dead end #2: mode forward propagation

+

Something I did not mention explicitly, but is important: some nodes lack both entry and exit modes. Consider the following example:

+

+drawing +

+

There’s no need to enter mode-setting instruction - node B will propagate the "false" value, but in this example:

+

+drawing +

+

we need to insert mode change from "false" to "true" somewhere between nodes A and C.

+

My first instinct was to propagate modes forward: for each node propagate its exit mode to all its successor nodes. While it is instinctively correct and solves two examples above, there are two problems:

+
    +
  • +

    It’s relatively awakward to implement. Remember, a node can have more than one predecessor nodes. What happens if there is a node with an empty incoming edge and a "true" incoming edge? Should we do post-processing?

    +
  • +
  • +

    More concretely, this does not really handle codependence patterns like this:

    +

    +In this example node A can’t propagate its mode to B or C outright because they have more incoming edges. B and C can’t propagate their mode either because they have no mode - they depend on A. +
  • +
+

Better approach: backward propagation

+

Dependency problems from the previous solution hint at a better approach: backward propagation. Instead of propagating the exit mode we can compute the set of incoming modes. This set is the set of all possible values a given mode can have on the first instruction of the basic block. Sounds complex, but can be computed easily if you have our augmented control flow graph. Take all incoming nodes and if an incoming node’s exit mode is non-empty then add that value to the set, if the incoming node’s exit mode is empty then recursively check its incoming nodes.

+

We now have the core of our algorithm, but it’s not a complete solution yet: the realities of AMD GPU hardware make it far more complex.

+

Hardware quirks

+

When targeting AMD GPUs, there are several hardware properties that we should take into account:

+
    +
  • +

    Kernel, on startup, has a certain initial state that is controlled by the programmer (or the compiler in our case). Part of the initial state is the initial state of denormal and rounding registers (global modes). We get this initial mode for free, no extra instructions needed

    +
  • +
  • +

    Each mode (denormal and rounding) is actually split into two registers (global modes). One for f32 and one for joint f16 and f64. In total there are four registers: denormal f32, denormal f16+f64, rounding f32, rounding f16+f64

    +
  • +
+

Registers (global modes) of the same kind (denormal, rounding), but with different width (f32, f16+f64) are for our purpose twin registers. One quirk of AMD GPU is that there are three instructions for settings global mode: S_SETREG to set any hardware (non-generic purpose) register and S_ROUND_MODE, S_DENORM_MODE to set just the rounding or denormal mode. S_ROUND_MODE, S_DENORM_MODE are much cheaper than S_SETREG. The annoying limitation of S_ROUND_MODE, S_DENORM_MODE is that they can only set both f32 and f16+f64. For this reason we will only do mode insertions for both f32 and f16+f64

+

Final algorithm

+

If you made it this far, congratulations, you made it through the introduction. Now we can start implementing our algorithm.

+
Create control flow graph
+

Our first step is to compute the control flow graph. Every basic block contains entry and exit mode. For efficiency each node actually contains four entry modes and four exit modes. One for each AMD GPU mode: denormal f32, denormal f16+f64, rounding f32, rounding f16+f64.

+

We handle function calls by including them in the graph. Call from function "foo" to function "bar" is expressed as a node from the caling basic block of "foo" to the first basic block of "bar". We don’t support virtual calls in the current ZLUDA, because they are extremally rare. They can be easily added later.

+

During this step we compute both entry and exit mode for each basic block. Additionally, each kernel starts with an artificial starting node. This node get a special "entry" and "exit" value: the numeric identifier of the kernel. This numeric identifier is used across the whole ZLUDA compiler. It is already present (generated by previous compiler passes) and unique for a kernel. For example: while denormals register can take one of two values: true or false, in our CFG, the values that represent denormals can be true, false or arbitrary numeric id of a kernel. While going from a bounded to an ubounded set does not intuitively sound like a good decision, it’s temporary. We will optimize it back to the bounded set soon.

+
Compute minimal insertions
+

Our next goal is, for each of the four modes, compute minimal set of insertions. In other words: figure out which basic blocks can be reached with different mode than expected by the first instruction. We do this computation for each of the four modes separately.

+

We start by computing two sets: required insertions and potential insertions. We choose nodes which have an entry mode (we skip the nodes with empty entry mode and kernel nodes with numeric ids). Then, for each node, we compute the set of incoming modes:

+
    +
  • If the set contains a value that is different from the node’s entry mode then we add the node to required insertions
  • +
  • If the set of incoming modes is purely a set of kernel numeric ids (with no conflicting specific mode values) then we add the node id along with its mode and kernel ids to the potential insertions
  • +
+

Required insertions are set in stone: if we jump from another node with different mode then we must insert a mode set instruction. Potential insertions on the other hand can be omitted: for a given node, if all the related kernels have the same initial value as the node then we can skip the mode set instruction.

+

E.g. if we have kernels "foo" and "bar" that both call function "asdf" and "asdf" entry mode is "true", then we should set initial mode for "foo" and "bar" to "true" and avoid inserting additional mode-setting instructions.

+

The problem is easy to solve in the example above, the general case is not trivial. I could not come up with a non-brute force algorithm and opted to encode the problem as an integer linear programming problem and use an external solver. This excellent post helped encode my constraints. As for the solver I went with microlp, mainly because it’s a relatively small dependency. I wanted to avoid dragging something big like SCIP or even Z3 into the project. Our problem sizes are not going to be big. PTX modules tend to have a handful of kernels and simple control flow.

+
Compute full insertions
+

Now we have:

+
    +
  • Provisional control flow graph (with some nodes empty and kernel starting nodes containing numeric ids instead of specific values)
  • +
  • List of nodes that require a mode change on entry (if the incoming mode is different - there might be multiple nodes incoming, each with their own mode)
  • +
  • For kernels that were subject to optimization in the previous step: its initial state
  • +
+

We are almost ready to start inserting S_ROUND_MODE and S_DENORM_MODE. We have all the necessary information, we just need to do some more preprocessing. Specifically we need to know two things:

+
    +
  • +

    What is the effective entry mode for each block
    +Note that even though mode instructions are inserted along edges in the CFG (jumps in code), we don’t explictly store edges. That’s because when inserting mode-setting instructions in a basic blocks we will implictly calculate exit mode anyway. And since we know what identifier we jump into, as long as we have information what are the modes of our jump target we know if they are different and in consequence if the jump requires a mode change

    +
  • +
  • +

    What is the exit mode for a function
    +This is necessary because functions calls are mechanically different from normal jumps. Function calls terminate a basic block and we need to know if the new basic block starting from the first post-call instruction requires a mode change. Since a function can be called from many places it is a responsibility of the caller to do post-call mode adjustments (if necessary)

    +
  • +
+

Computing both of those is relatively straightforward. First, we take our incomplete control flow graph and resolve all empty nodes and special kernel nodes. For empty nodes we compute the incoming set - if the set contains more than a single value, we use a special value "conflict". For special starting kernel nodes we have a list of kernel with their initial values from the previous optimization pass.

+

Lastly, we join four separate logical CFGs (each for one AMD GPU mode) into two lookup tables. One lookup table contains all the necessary information to support mode changes for branches, the other lookup table contains all the necessary information to support mode changes for functions calls.

+
Apply mode control
+

In this stage we walk through every function (kernel and non-kernel) and modify it accordingly:

+
    +
  • If necessary, insert mode change "prelude" basic block before each basic block
  • +
  • If necessary, redirect branch to go into mode change "prelude"
  • +
  • Insert all mode changes inside a basic block. We fold twin registers together. For example pseudocode like this:
    add.ftz.f32 a, b, c;
    +add.no_ftz.f16 x, y, z;
    +
    +gets converted into this pseudocode:
    set_denormal.f32.f16 ftz, no_ftz;
    +add.f32 a, b, c;
    +add.f16 x, y, z;
    +
    +
  • +
+

After all this hard work we now get a new module with a small number of freshly inserted mode change instructions. It’s not optimal in the absolute sense, but it’s much better than the alternatives. The AMD GPU code is now as correct as we can make it. Unfortunately, after all this hard work, our code can still miscompute some code. Read below for more.

+

LLVM sadness

+

Sadly, there are still some issues outside of our control.

+

Firstly, a minor issue. As mentioned previously, for each AMD GPU kernel we can sat initial denormal mode and initial rounding mode. This is true in the general sense, but for some reason LLVM AMDGPU backend exposes the control for initial denormal mode, but not for initial rounding mode. Right now, we set initial rounding mode by inserting the instruction for it at the start of the kernel. We could skip this single instruction with better LLVM AMD GPU support.

+

Secondly, a bigger issue. Hardware-agnostic LLVM passes don’t understand AMD GPU instructions that set global state. So this pseudocode:

+
set_denormal.f32.f16 ftz, ftz;
+add.f32 x, b, c;
+set_denormal.f32.f16 no_ftz, no_ftz;
+add.f32 y, b, c;
+
+

after LLVM optimizations ends up as:

+
set_denormal.f32.f16 ftz, ftz;
+add.f32 x, b, c;
+mov.f32 y, x;
+
+

Which gives incorrect result. While it’s rare to see the same input being computed twice with different modes, it’s concerning.

+

Fixing this would require deeper changes in LLVM (making mode part of the instruction, like in llvm.experimental.constrained.*) and probably porting this pass to LLVM. We might do eventually do it, but that’s enough effort for now.

+
+

If you made it this far, let me know in the comments what do you think. See you next time.

+ + + + +
+
+ + + \ No newline at end of file diff --git a/blog/zluda-update-q4-2024/geekbench.svg b/blog/zluda-update-q4-2024/geekbench.svg new file mode 100644 index 0000000..8c4166e --- /dev/null +++ b/blog/zluda-update-q4-2024/geekbench.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/blog/zluda-update-q4-2024/geekbench_detail.svg b/blog/zluda-update-q4-2024/geekbench_detail.svg new file mode 100644 index 0000000..2277501 --- /dev/null +++ b/blog/zluda-update-q4-2024/geekbench_detail.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/blog/zluda-update-q4-2024/index.html b/blog/zluda-update-q4-2024/index.html new file mode 100644 index 0000000..6c16eef --- /dev/null +++ b/blog/zluda-update-q4-2024/index.html @@ -0,0 +1,186 @@ + + + + + + + + + ZLUDA - ZLUDA update Q4 2024 + + + +
+
+

+
+ ZLUDA +
+ +
+
+ +

ZLUDA allows to run unmodified CUDA applications on non-NVIDIA GPUs

+
+

+ +

+ ZLUDA update Q4 2024 +
2024-12-31
+

+

Hello everyone, it's the first of many ZLUDA updates. I've been working hard and I'm happy to announce that we reached the first milestone: we have a new version of ZLUDA with an actual working application. ZLUDA can run Geekbench 5.

+

This update also includes a few words on how to contribute (Contributing to ZLUDA) and changes in the internals of the "new" ZLUDA (New parser, Atomics modulo).

+

Geekbench 5

+

While Geekbench is far from being the most requested application, it's important for ZLUDA's development:

+
    +
  • It uses a relatively small CUDA API surface, which makes it easy for ZLUDA to support (at least easy when compared to Blender or PyTorch).
  • +
  • It's closed-source, so it's not possible to port it to HIP (via HIPIFY or other means).
  • +
  • It has both a generic OpenCL backend and an NVIDIA-specific CUDA backend, so we can measure the performance gain when using ZLUDA.
  • +
+

The "old" ZLUDA was about 1% faster than the native OpenCL. I was worried that the fresh new code would be slow, but the "new" ZLUDA turned out to be even better than the "old" one and is approximately 10% faster than the native OpenCL. Note that this performance improvement is Geekbench specific and not generalizable. Still, I'm happy with how things turned out. If you are interested in the technical details read the Atomics modulo section down below.

+

(The graphs below show slightly inconsistent results because the top graph uses previously collected numbers for OpenCL and ZLUDA 3, the bottom graph uses freshly collected numbers for OpenCL)

+

Next on the roadmap is llm.c.

+

+

+

Contributing to ZLUDA

+

I regularly get questions about how to contribute to ZLUDA, here's how (this information is now also in the project's README):

+

ZLUDA project has a commercial backing and does not accept donations. +ZLUDA project accepts pull requests and other non-monetary contributions.

+

If you want to contribute a code fix or documentation update feel free to open a Pull Request.

+

There's no architecture document (yet). Two most important crates in ZLUDA are ptx (PTX compiler) and zluda (AMD GPU runtime). A good starting point to tinkering the project is to run one of the ptx unit tests under a debugger and understand what it is doing. cargo test -p ptx -- ::add_hip is a simple test that adds two numbers.

+

Github issues tagged with "help wanted" are tasks that are self-containted. Their level of difficulty varies, they are not always good beginner tasks, but they defined unambiguously.

+

If you have questions feel free to ask on #devtalk channel on Discord.

+

New parser

+

This is the first time I've written an extensive write-up about an issue like this and I'm curious to know what do you think. Is this too detailed? Not detailed enough? Should all issues be broken down like this? Leave a comment.

+

Commit 193eb29 finally brought a major feature that solves one of the least visible and hardest to fix problems in ZLUDA.

+

First, you need to understand what PTX is. PTX is the NVIDIA GPU intermediate language. Intermediate languages work like this:

+
    +
  • Programmer writes source code
  • +
  • Programmer compiles their source code into an intermediate language X and sends it to the user
  • +
  • User runs the application. At some point, the intermediate code X is compiled (finalized) into binary for his particular hardware
  • +
+

Intermediate languages are a fairly common solution: Java has JVM bytecode .NET has CIL, gaming GPUs have SPIR-V, LLVM has LLVM IR. They all solve slightly different problems, but in the GPU context they are used to to avoid the forward compatibility problem. That's why GPU code written ten years ago works just fine on modern GPUs even though your GPU vendor has made major changes to his GPU architecture.

+

What if your software stack does not have an intermediate language? Then either:

+
    +
  • You declare your hardware to be strictly forward-compatible. All changes are strictly additive: code compiled for older hardware will work on the newer hardware, but will not be able to take advantage of the hardware features. This is what the x86 CPU family does
  • +
  • You simply ignore the forward compatibility and compile from scratch for each new hardware target. This is the AMD GPU way
  • +
+

The CUDA driver ships with a compiler that compiles (finalizes) from PTX to the particular NVIDIA GPU architecture and of course ZLUDA does the same, but for AMD GPUs.

+

The compilation itself is divided into several steps and the first step is parsing: converting from textual representation (PTX is a text format) to in-memory representation.

+

PTX, being a language, follows certain grammatical rules. For example, this line:

+
ld.global.cs.b32  r1, [addr1];
+
+

means "load (ld) from global address space (.global) with streaming cache behavior (cs) 32-bit integer (.b32) into variable r1 from address stored in variable addr1". You don't need to understand what all this means, just that there is an order to words in an instruction: operand, operands, registers. If the same instruction were written this way, it would violate grammar rules and result in an error:

+
ld r1, [addr1] .global.cs.b32;
+
+

Writing a PTX parser is not hard. As long as you are familiar with a parser generator you can get a high quality parser working relatively quickly and painlessly. ZLUDA used lalrpop for this task

+

It turns out that there is an important undocumented "feature" of the PTX language. Although the documentation lays out a certain language grammar and the NVIDIA PTX-generating compiler follows it, the NVIDIA PTX-consuming (finalizing) compiler is more permissive. NVIDIA PTX-consuming (fnalizing) compiler allows some (but not all) words in an instruction to be passed out-of-order, so both ld.global.cs.b32 r1, [addr1]; and ld.cs.global.b32 r1, [addr1]; are accepted. For 99.99% of the code out there, it's not a problem: the compiler will correctly generate all the instructions in the documented form. The problem is "inline assembly". The CUDA the programming language (dialect of C++) allows programmers to write PTX instructions directly. And programmers get the PTX grammar wrong all the time. NVIDIA's PTX parser is tolerant of the mistakes, but ZLUDA's old parser was strict and was special cased for every new project that got its PTX instructions out-of-order.

+

ZLUDA's parser is strict because we want to have a strongly-typed representation of instructions as soon as possible and carry the same representation through all stages of compilation. Strongly-typed means that invalid combinations of operands are not only rejected by the parser but impossible to even express in the code.

+

I can only speculate about NVIDIA's PTX parser, but its tolerance for out-of-order operands is probably an artifact of a more weakly typed internal representation or a two-stage parsing strategy (first do a simple parse to a weakly-typed representation and then validate and convert weakly-typed to strongly-typed).

+

Back to ZLUDA's parser: it's easy enough to support the previous example: just have one rule for ld.<address_space>.<cache_hint>.<type> and one for ld.<cache_hint>.<address_space>.<type>. The problem is that ld operation can be very long. Its full form is:

+
ld{.weak}{.ss}{.cop}{.level::cache_hint}{.level::prefetch_size}{.vec}.type
+
+

With 5 possible operands (ld is always at the start, .vec and .type are always at the end), there are up to 120 separate rules. And this does not even take into account optionality (every segment in { } brackets is optional).

+

"Out-of-orderness" is difficult to express well in a lalrpop-style parser (very few grammars want this "feature"). I replaced our old parser with the one based on winnow. Since ZLUDA tries to be strongly-typed this had a knock-on changes across all the compiler passes. But we now support all the broken PTX in the wild (which funnily enough comes mostly from NVIDIA's own libraries).

+

Atomics modulo

+

NVIDIA hardware supports a weird little atomic modulo increment/decrement instruction (atom.inc/atom.dec) with semantics like this:

+
unsigned atomic_inc(unsigned volatile* p, unsigned modulo) {
+  unsigned result;
+  atomic {
+    result = *p;
+    *p = (result >= modulo) ? 0 : result+1;
+  }
+  return result;
+}
+
+

For the longest time, I simply did not realize that AMD hardware natively supports this instruction and ZLUDA emulated it with a cmpxchg loop. Now that it is natively supported in ZLUDA, code using it is much faster. Unfortunately, other than GeekBench, there really aren't that many users of this instruction, so it won't have much performance impact overall.

+

To my knowledge, this instruction is not commonly available on CPUs. Do you know of any algorithms or data structures that benefit from this instruction? If so, let us know in the comments, I've been wondering about this for a few years now.

+

Bonus content: interview

+

I was interviewed about ZLUDA for Youtube channel "Tech over Tea". Watch it here.

+ + + + +
+
+ + + \ No newline at end of file diff --git a/blog/zludas-third-life/index.html b/blog/zludas-third-life/index.html new file mode 100644 index 0000000..83f68d0 --- /dev/null +++ b/blog/zludas-third-life/index.html @@ -0,0 +1,156 @@ + + + + + + + + + ZLUDA - ZLUDA's third life + + + +
+
+

+
+ ZLUDA +
+ +
+
+ +

ZLUDA allows to run unmodified CUDA applications on non-NVIDIA GPUs

+
+

+ +

+ ZLUDA's third life +
2024-10-04
+

+

ZLUDA is back. For the last few months, I've been trying to find a commercial organization that would guarantee the continued development of the project. I am happy to announce that I have found one that is not only willing to fund further development, but also has an excellent vision for the future of ZLUDA. +I share their long-term vision and I can't wait to talk more about it. We don’t want to disclose everything just yet, but for now, we know that we want to make ZLUDA better. If you think ZLUDA is a cool project, we have even cooler projects in the works. Development has begun, and as soon as we have something to share, we will. +What I can talk about now is the current state and the direction of ZLUDA itself.

+

Where we are now:

+

The code has been rolled back to the pre-AMD state and I've been working furiously on improving the codebase. I’ve been writing the improved PTX parser I always wanted and laid the groundwork for the rebuild. Currently, some very simple synthetic GPU test programs can be run on an AMD GPU, but we are not yet at the point where ZLUDA can support a full application.

+

Where we are going:

+
    +
  • The year of rebuild
    +

    The ultimate goal is to bring "new" ZLUDA to a similar state as before the rollback in one year (Q3 2025). "Similar state" is very subjective here. I don't have precise criteria, but an application of similar complexity should work just as well. Not every pre-rollback application will be supported again due to new priorities (more below).

    +
  • +
  • Focus on machine learning
    +

    In the past, ZLUDA focused mainly on professional creator workloads. This meant focusing on applications like Arnold Render, Blender, 3DF Zephyr, etc. We even had a working prototype of GameWorks. While all of these workloads are important extremely satisfying to have running, machine learning workloads are in much higher demand. We are targeting for llm.c, llama.cpp, PyTorch, TensorFlow and others.
    +Additionally, HIP support for anything image-related is disappointing. The time saved by skipping layers of workarounds can be spent more productively writing more tests and enabling more applications.

    +
  • +
  • Raytracing is gone
    +

    This is related to the previous point. Not many people realized it, but ZLUDA had an OptiX implementation. While ZLUDA-OptiX only supported just a handful of OptiX demos and simple Arnold scenes, it required a lot of code and broke all the time. Considering how underpowered it was and how much maintenance it required, it is a feature that is unlikely to ever come back.

    +
  • +
  • GPU support
    +

    The new ZLUDA will be built to support multiple GPU architectures. The mainline development will happen on AMD GPUs as that's what most of our users have. Still, I do realize there is lot of interest in other GPUs (e.g. Intel) and hopefully this will lead to more code contributions and new backends.
    +Pre-rollback ZLUDA stayed on ROCm 5 mainly because I did not want to re-test all the version-specific workarounds. Since we are starting with a clean slate, AMD backend will target ROCm 6.1+.

    +
  • +
  • More modest set of supported AMD GPUs
    +

    We will only support RDNA1 and newer non-server AMD GPUs. Supporting pre-RDNA1 and server GPU architectures was an additional support burden and never worked as well as RDNA1+ GPUs due to the wavefront 64 configuration they use.
    +Note that this applies to the current architectures. AMD recently announced the merging of RDNA and CDNA into a single architecture (UDNA). I have high hopes for this new architecture and expect it to simplify porting CUDA → HIP and to bring ZLUDA to server GPUs.

    +
  • +
  • Downgraded Windows support
    +

    Windows will still work and be supported, it will just be less user-friendly. zluda.exe will be gone. Windows developers have invented several imaginative ways to load CUDA into a process. zluda.exe has tried to support all of them, and even succeeded. Most of the time.
    +As a user, you have to fashion some other way to load ZLUDA into the target process. Usually copying the ZLUDA binaries to the application is sufficient. We will provide ready-to-download Windows binaries.

    +
  • +
  • Code improvements
    +

    The current ZLUDA code is not the worst, but there is clearly room for improvement. During its second life, ZLUDA was written as a proof-of-concept solution for closed-source graphics applications (Arnold, 3DF Zephyr, etc.). This had two important consequences. First, since we were only concerned with one-time proof, it was enough to enable an application once and move on to the next without worrying about regressions. Second, some floating-point operations were handled with too little (or too much) precision - if you are rendering a scene, you can probably live with some pixels being imperceptibly different shades of red.
    +Now that the concept has been thoroughly proven, ZLUDA will maintain application-level testing and more rigorously test for floating-point correctness (and document differences where strict compatibility is not possible).

    +
  • +
  • Let’s talk
    +

    If you think there's not enough ZLUDA in your life, there's now a ZLUDA Discord channel here. Feel free to drop by and say hello. +I will also try to post development updates from time to time. I hope that learning about all the creative ways developers are abusing CUDA APIs will be as exciting as it is for me to implement them.

    +
  • +
+

Of course, ZLUDA will remain open source. This means that any features that are not part of the plan are fair game if someone steps up and submits a pull request. Personally, I think there is no better way to express your undying love for your Radeon VII than to add support for it in ZLUDA.

+ + + + +
+
+ + + \ No newline at end of file diff --git a/index.html b/index.html new file mode 100644 index 0000000..8072acc --- /dev/null +++ b/index.html @@ -0,0 +1,106 @@ + + + + + + + + + ZLUDA + + + +
+
+

+
+ ZLUDA +
+ +
+
+ +

ZLUDA allows to run unmodified CUDA applications on non-NVIDIA GPUs

+
+

+ + + +
+
+ + + \ No newline at end of file diff --git a/robots.txt b/robots.txt new file mode 100644 index 0000000..b9b91d6 --- /dev/null +++ b/robots.txt @@ -0,0 +1,4 @@ +User-agent: * +Disallow: +Allow: / +Sitemap: https://vosen.github.io/ZLUDA/sitemap.xml diff --git a/sitemap.xml b/sitemap.xml new file mode 100644 index 0000000..d1792ad --- /dev/null +++ b/sitemap.xml @@ -0,0 +1,21 @@ + + + + https://vosen.github.io/ZLUDA/ + + + https://vosen.github.io/ZLUDA/blog/ + + + https://vosen.github.io/ZLUDA/blog/zluda-update-q1-2025/ + 2025-04-03 + + + https://vosen.github.io/ZLUDA/blog/zluda-update-q4-2024/ + 2024-12-31 + + + https://vosen.github.io/ZLUDA/blog/zludas-third-life/ + 2024-10-04 + +