Skip to content

Conversation

@nnethercote
Copy link
Collaborator

@nnethercote nnethercote commented Dec 11, 2025

CUDA uses 32-bit unsigned integers for indices and dimensions, and rust-cuda currently follows suit. This is bad because Rust uses usize for indices and dimensions, which requires lots of as u32/as usize casts.

This PR change the u32 indices/dimensions to usize. This makes rust-cuda nicer to use. E.g. within examples/ the number of as usize casts drops from 14 to 0, and the number of as u32 casts drops from 10 to 2.

Specifically:
- `thread_idx*`
- `block_idx*`
- `block_dim*`
- `grid_dim*`
- `index*`

This removes lots of `as u32`/`as usize` casts.
@nnethercote nnethercote requested a review from LegNeato December 11, 2025 06:33
@nnethercote
Copy link
Collaborator Author

cc @FractalFir

Copy link
Collaborator

@FractalFir FractalFir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some issues with the use of as casts(I fear the will truncate things unexpectedly), plus some general things I noticed whilst going over the PR.

}
impl From<u32> for GridSize {
fn from(x: u32) -> GridSize {
impl From<usize> for GridSize {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider keeping the old impl, if that does not cause any issues(to allow casting from u32 too).

IDK if this would be of any worth.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it but none of the existing example code requires it so it doesn't seem useful. We can add it back if necessary.

pub fn thread_idx_x() -> usize {
// The range is derived from the `block_idx_x` range.
in_range!(core::arch::nvptx::_thread_idx_x, 0..1024)
in_range!(core::arch::nvptx::_thread_idx_x, 0..1024) as usize
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to current change, but I find the range here suspicious:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#thread-hierarchy

There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same streaming multiprocessor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads.
However, a kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks.

The "current GPUs" suggests this is not an API promise, but mearly the current highest value of this. Can this raise in the future? What happens then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See table 30 in this section for a more specific description of the limits here. You are right that it's not a guarantee, but the relevant numbers haven't changed from Compute Capability 5.0 all the way to 12.x.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this I realize that some of these comments have an error in them. I'll fix that.

let idx = thread::index_1d() as usize;
let idx = thread::index_1d();

if idx == 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have something like thread::first for this "is thread idx 0" check? Unrelated to the core changes, but would be nice to use that instead of using only one of the indices.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside the scope of this PR, however.

#[allow(improper_ctypes_definitions)]
pub unsafe fn add(a: &[T], b: &[T], c: *mut T) {
let i = thread::index_1d() as usize;
let i = thread::index_1d();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, unrelated to the core changes, but something I noticed while reviewing the code: Should we encourage people to use index_1d like this?

What happens if we have launch dimensions with values different from 1 on the Y / Z axis? (I belive this would be a data race then).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, and the comment on index talks about this here. But this is also outside the scope of this PR.

Instead of `u32`, because using `usize` for dimensions and indices is
more natural in Rust and avoids lots of casts.
These types are very similar to the samed-named types from `cust`, which
were changed in the previous commit. Currently these types are in a
module that is commented out and marked as "WIP", but it makes sense to
change them like the `cust` types in case they become used in the
future. (Note: I temporarily uncommented the code to make sure the
changes compile.)
The indices are derived from the dimensions.
@nnethercote
Copy link
Collaborator Author

I added checking to the truncations, and added a new commit that fixes the incorrect comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants