Skip to content

Commit 34496ee

Browse files
committed
Add slides and post for Support Thrust in Clad project
1 parent 628d56f commit 34496ee

File tree

4 files changed

+191
-1
lines changed

4 files changed

+191
-1
lines changed

_data/crconlist2025.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -194,4 +194,4 @@
194194
generic functor handling for transformations, demonstration applications, and
195195
comprehensive unit tests.
196196
197-
# slides: /assets/presentations/...
197+
slides: /assets/presentations/Abdelrhman_final_presentation_support_usage_of_Thrust_API_in_clad.pdf

_data/standing_meetings.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,10 @@
33
time_cest: "17:00"
44
connect: "[Link to zoom](https://princeton.zoom.us/j/97915651167?pwd=MXJ1T2lhc3Z5QWlYbUFnMTZYQlNRdz09)"
55
agenda:
6+
- title: "Wrap-Up: Support usage of Thrust API in Clad"
7+
date: 2025-10-30 15:00:00 +0200
8+
speaker: "Abdelrhman Elrawy"
9+
link: "[Slides](/assets/presentations/Abdelrhman_final_presentation_support_usage_of_Thrust_API_in_clad.pdf)"
610
- title: "Supporting Automatic Differentiation in CMS Combine profile likelihood scans"
711
date: 2025-09-25 15:00:00 +0200
812
speaker: "Galin Bistrev"
Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
---
2+
title: "Supporting Thrust API in Clad - Final Report"
3+
layout: post
4+
excerpt: "A comprehensive wrap-up of my Google Summer of Code 2025 project on enabling automatic differentiation of GPU-accelerated code through Thrust API support in Clad."
5+
sitemap: false
6+
author: Abdelrhman Elrawy
7+
permalink: blogs/gsoc25_elrawy_wrapup_blog/
8+
banner_image: /images/blog/gsoc-banner.png
9+
date: 2025-11-04
10+
tags: gsoc automatic-differentiation clad gpu cuda thrust
11+
---
12+
13+
## Project Summary
14+
15+
This summer, I successfully completed my Google Summer of Code 2025 project on **Supporting Thrust API in Clad**, bringing GPU-accelerated automatic differentiation to the world of high-performance computing. Working under the mentorship of Vassil Vassilev and Alexander Penev, I merged **16 pull requests** that extended Clad's capabilities to differentiate through NVIDIA's Thrust parallel algorithms library.
16+
17+
[Clad](https://github.com/vgvassilev/clad) is a source-transformation automatic differentiation (AD) tool built as a Clang plugin. Before this project, Clad couldn't handle GPU-parallel primitives from Thrust, limiting its applicability in scientific computing and machine learning applications that leverage GPU acceleration. This project bridges that gap, enabling researchers and engineers to automatically compute gradients for CUDA-accelerated code with minimal code changes.
18+
19+
## What Was Accomplished
20+
21+
### Core Algorithm Support (8 PRs)
22+
23+
The foundation of this project was implementing differentiation support for Thrust's most fundamental parallel primitives:
24+
25+
1. **`thrust::reduce`** - Parallel reductions with multiple binary operators (sum, max, min, product)
26+
- Implemented special handling for mathematical edge cases, particularly multiplication with zeros
27+
- Added support for multiple binary operator variants
28+
- Developed both forward and reverse-mode AD implementations
29+
30+
2. **`thrust::inner_product`** - Dot products and inner products
31+
- Implemented both 4-argument and 6-argument versions
32+
- Critical for linear algebra operations on GPU
33+
- Enables efficient gradient computation for vector operations
34+
35+
3. **`thrust::transform`** - Element-wise transformations
36+
- Generic functor support for arbitrary user-defined transformations
37+
- Maintains efficient GPU parallelization in derivative code
38+
- Foundation for many higher-level operations
39+
40+
4. **`thrust::transform_reduce`** - Fused transformation and reduction
41+
- Combines transform and reduce to minimize memory traffic
42+
- Essential for ML operations like computing norms and loss functions
43+
44+
5. **`thrust::copy`** - Memory operations with gradient tracking
45+
- Ensures proper gradient flow during data movement
46+
- Handles device-to-device and host-device transfers
47+
48+
6. **`thrust::adjacent_difference`** - Computing differences between adjacent elements
49+
- Useful for finite difference schemes and time-series analysis
50+
- Proper derivative handling for sequential dependencies
51+
52+
### Advanced Operations (4 PRs)
53+
54+
Building on the core algorithms, I implemented support for more sophisticated parallel primitives:
55+
56+
1. **Scan Operations** - Inclusive and exclusive prefix sums
57+
- Fundamental building blocks for many parallel algorithms
58+
- Applications in cumulative distributions, parallel scheduling, and dynamic programming
59+
- Efficient parallel backward pass for gradient accumulation
60+
- Technical challenge: output at position *i* depends on all inputs up to *i*
61+
62+
2. **`thrust::sort_by_key`** - Sorting key-value pairs with gradient preservation
63+
- Forward pass records index permutation
64+
- Backward pass applies inverse permutation to gradients
65+
- Critical for algorithms requiring sorted data structures
66+
67+
3. **`thrust::reduce_by_key`** - Segmented reductions for grouped data
68+
- SQL-like GROUP BY operations on GPU
69+
- Essential for batch processing in neural networks
70+
- Complex gradient routing through irregular partition boundaries
71+
72+
4. **Segmented Scans** - Advanced partitioned prefix sum operations
73+
- Prefix sums within each segment
74+
- Handles complex gradient flow through segmented data structures
75+
76+
### Infrastructure Improvements (2 PRs)
77+
78+
1. **`thrust::device_vector` Support**
79+
- Differntiate against the constructors of the containers in Thrust
80+
- Interoperability with existing Thrust code
81+
82+
2. **Generic Functor Support for Transform**
83+
- Users can define custom functors
84+
- Greatly extends the flexibility of `thrust::transform`
85+
86+
### Demonstration Applications (2 PRs)
87+
88+
To showcase the practical utility of this work, I developed several real-world demonstrations:
89+
90+
1. **Multiple Thrust-based Demo Applications**
91+
- Linear regression with GPU-accelerated gradient computation
92+
- Particle simulation with automatic derivative tracking
93+
- Demonstrates end-to-end workflows from problem setup to gradient computation
94+
95+
2. **Bag-of-Words Logistic Regression**
96+
- Complete machine learning pipeline using Thrust and Clad
97+
- GPU-accelerated logistic regression with gradient descent
98+
- Cross-entropy loss function with automatic differentiation
99+
- Showcases how Thrust operations (`reduce`, `transform`, `inner_product`) combine for ML workflows
100+
- All computations remain on GPU device memory for maximum performance
101+
102+
103+
### Example: Differentiating `thrust::reduce`
104+
105+
Consider a simple reduction operation:
106+
```cpp
107+
double sum = thrust::reduce(vec.begin(), vec.end(), 0.0, thrust::plus<double>());
108+
```
109+
110+
The derivative with respect to the input vector is straightforward mathematically (each element contributes 1 to the sum), but the implementation must:
111+
1. Recognize the Thrust API call during Clad's AST traversal
112+
2. Generate GPU-compatible derivative code
113+
3. Properly allocate gradient storage on the device
114+
4. Handle edge cases (empty ranges, custom operators)
115+
116+
My implementation handles all of these automatically, generating efficient CUDA code that maintains the parallel performance characteristics of the original operation.
117+
118+
## Challenges and Solutions
119+
120+
### 1. GPU Memory Errors
121+
122+
**Problem**: Tracing memory access violations within the CUDA/Thrust environment proved complex. Pointer dereferencing errors on the GPU manifest differently than on CPU, making debugging challenging.
123+
124+
**Solution**: Leveraged NVIDIA's `compute-sanitizer` tool for precise memory error detection. Implemented careful GPU pointer management with explicit lifetime tracking.
125+
126+
### 2. Mathematical Edge Cases
127+
128+
**Problem**: Derivatives can be undefined or require special handling for certain operations. For example, the derivative of `x * y * 0` with respect to `x` is technically zero, but naive implementation might lose important gradient information.
129+
130+
**Solution**: Implemented sophisticated logic to count and track zero-value inputs. Developed special-case handling for single and multiple zero inputs. Added extensive unit tests covering edge cases including:
131+
- Multiplication chains with zeros
132+
- Division by small numbers
133+
- Overflow/underflow scenarios
134+
- Empty sequences
135+
136+
### 3. Correctness Validation
137+
138+
**Problem**: Verifying that GPU-accelerated derivatives are mathematically correct is non-trivial. Standard debugging tools don't work well with GPU code.
139+
140+
**Solution**: Multi-pronged validation approach:
141+
- **Finite difference comparison**: Compare AD results against numerical derivatives
142+
- **Comprehensive unit tests**: Test each primitive in isolation with known inputs/outputs
143+
- **Integration tests**: Verify derivatives in real-world demo applications
144+
145+
146+
## Impact and Applications
147+
148+
This work significantly expands Clad's applicability in several domains:
149+
150+
### High-Energy Physics
151+
- Gradient-based optimization for detector simulations
152+
- Parameter estimation in complex physical models
153+
- Enables GPU acceleration of gradient computations for large-scale simulations
154+
155+
### Machine Learning
156+
- GPU-accelerated training for custom models
157+
- Efficient gradient computation for loss functions
158+
- Enables researchers to prototype GPU-native ML algorithms with automatic differentiation
159+
160+
161+
## Future Work
162+
163+
While the core objectives were achieved, several exciting directions remain for future development:
164+
165+
### Additional Thrust Primitives
166+
167+
- **`thrust::gather` and `thrust::scatter`**: Memory access patterns with gradients
168+
- **`thrust::partition`**: Partitioning operations with gradient preservation
169+
- **`thrust::unique`**: Handling duplicate elimination in derivative code
170+
- **Additional sorting operations**: `thrust::stable_sort`, `thrust::sort_by_key` variants
171+
172+
### Real-World Applications
173+
174+
- **Neural network training**: Full GPU-native neural network training with Clad
175+
- **Physics simulations**: Large-scale physics simulations with gradient-based parameter optimization
176+
177+
178+
## Conclusion
179+
180+
This Google Summer of Code project successfully brought GPU-accelerated automatic differentiation to Clad through comprehensive Thrust API support. The 16 merged pull requests cover core algorithms, advanced operations, infrastructure improvements, and practical demonstrations. This work opens new possibilities for researchers and engineers who need efficient gradient computations in GPU-accelerated applications.
181+
182+
## Related Links
183+
184+
- [Clad GitHub Repository](https://github.com/vgvassilev/clad)
185+
- [Project Proposal](https://hepsoftwarefoundation.org/gsoc/2025/proposal_Clad-ThrustAPI.html)
186+
- [My GitHub Profile](https://github.com/a-elrawy)
Binary file not shown.

0 commit comments

Comments
 (0)