Skip to content

Commit 48d90cf

Browse files
Merge pull request #5 from codeplaysoftware/add-maxime-blog
Fixed broken images.
2 parents 75ec152 + aef588f commit 48d90cf

File tree

2 files changed

+9
-9
lines changed

2 files changed

+9
-9
lines changed

_collections/_authors/maxime-france-pillois.markdown

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ user_id: 72195828122
33
disabled: 0
44
title: "Maxime France-Pillois"
55
position: "Research Development Software Engineer"
6-
avatar: /assets/images/portal/article-images/2025-08-25-intel-gpu/maxime.jpeg
6+
avatar: /assets/images/portal/article-images/2025-09-02-intel-gpu/maxime.jpeg
77
social_media:
88
- https://www.linkedin.com/in/mfrancepillois
99
---

_collections/_portal_posts/2025-09-02-gpu-tensor-core-and-data-feeding.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -61,10 +61,10 @@ Intel GPUs) using registers.
6161
Registers are a kind of small and fast memory bank (called Register File) located just beside the compute engine, as
6262
this can be seen on the following diagrams showing selected parts of an Intel GPU architecture.
6363

64-
![Xe2 GPU Vector engine Illustration](/assets/images/portal/article-images/2025-09-02-intel-gpu/ComputeUnit.jpg)<br>
64+
![Xe2 GPU Vector engine Illustration]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/ComputeUnit.jpg' | relative_url }})<br>
6565
*Illustration of an Intel Xe2 GPU Vector engine architecture (simplified)*
6666

67-
![XeCore GPU Illustration](/assets/images/portal/article-images/2025-09-02-intel-gpu/XeCore.jpg)<br>
67+
![XeCore GPU Illustration]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/XeCore.jpg' | relative_url }})<br>
6868
*Illustration of an Intel XeCore architecture (simplified)*
6969

7070
Basically, the tensor core reads operands A and B from a the *register file* and then writes the accumulated output C
@@ -158,7 +158,7 @@ from Global Memory to the L1 Cache, then the second step is carried out by the `
158158
Registers, hopefully from the L1 cache if the data is still available in cache (cache hit).
159159
The diagram below illustrates this process:
160160

161-
![Intel Backend Memory Semantic](/assets/images/portal/article-images/2025-09-02-intel-gpu/IntelMemory.jpg)<br>
161+
![Intel Backend Memory Semantic]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/IntelMemory.jpg' | relative_url }})<br>
162162
*Intel Backend Memory Semantic (synchronous)*
163163

164164
Nvidia has chosen to leverage the Share Local Memory (SMEM) instead of the cache. SMEM is indeed a scratch pad memory
@@ -168,7 +168,7 @@ a memory buffer in SMEM, but also `TritonGPU::LocalLoadOp` and `TritonGPU::Local
168168
between SMEM and Registers.
169169
Consequently, the triton process for loading and storing data (synchronously) in the Nvidia architecture is as follow:
170170

171-
![Nvidia Backend Memory Semantic](/assets/images/portal/article-images/2025-09-02-intel-gpu/NvidiaMemory.jpg)<br>
171+
![Nvidia Backend Memory Semantic]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/NvidiaMemory.jpg' | relative_url }})<br>
172172
*Nvidia Backend Memory Semantic (synchronous)*
173173

174174

@@ -195,7 +195,7 @@ So, in our example, if A needs $NumReg_A$ registers to be stored, this means tha
195195
for A across the loop, and thus the compiler needs to fit the variables used between line 1 and 7 in $N - NumReg_A$
196196
registers, with $N$ being the total number of registers available.
197197

198-
![variable liveness simple example](/assets/images/portal/article-images/2025-09-02-intel-gpu/liveness_example_annotated.jpg)<br>
198+
![variable liveness simple example]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/liveness_example_annotated.jpg' | relative_url }})<br>
199199
*Variable liveness simple example*
200200

201201
It is therefore easy to understand that in such a kernel, if the variable A is large and the kernel processing between
@@ -387,7 +387,7 @@ an [optimization pass](https://github.com/intel/intel-xpu-backend-for-triton/blo
387387
which aims to reduce variable liveness where possible.
388388
To this ends, the pass attempts to bring load operations closer to the actual uses of the loaded data.
389389

390-
![Reduce Variable Liveness pass diagram](/assets/images/portal/article-images/2025-09-02-intel-gpu/liveness-pass-diagram.jpg)<br>
390+
![Reduce Variable Liveness pass diagram]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/liveness-pass-diagram.jpg' | relative_url }})<br>
391391
*Reduce Variable Liveness pass diagram*
392392

393393
The diagram above shows how the compiler pass works to reduce the liveness of `DotOp` operands.
@@ -436,10 +436,10 @@ We have evaluated the performance of Triton FlashAttention v2 on Intel PVC GPU.
436436
The following plots show the normalised performance of the FlashAttention kernel with the *reduce-liveness-pass* enabled
437437
for different input configurations.
438438

439-
![Normalized performance PVC1100](/assets/images/portal/article-images/2025-09-02-intel-gpu/pvc1100_new.png)<br>
439+
![Normalized performance PVC1100]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/pvc1100_new.png' | relative_url }})<br>
440440
*FlashAttention v2 Normalized performance PVC1100*
441441

442-
![Normalized performance PVC1550](/assets/images/portal/article-images/2025-09-02-intel-gpu/pvc1550_new.png)<br>
442+
![Normalized performance PVC1550]({{ '/assets/images/portal/article-images/2025-09-02-intel-gpu/pvc1550_new.png' | relative_url }})<br>
443443
*FlashAttention v2 Normalized performance PVC1550*
444444

445445
We can see that the pass has improved the performance for several configurations on all the targets evaluated by more

0 commit comments

Comments
 (0)