Releases: mitsuba-renderer/drjit
Release (v1.2.0)
New Features
-
Event API: Added an event API for fine-grained timing and synchronization of GPU kernels. This enables more detailed performance profiling and better control over asynchronous operations.
(Dr.Jit PR #441, Dr.Jit-Core PR #174). -
OpenGL Interoperability: Improved CUDA-OpenGL interoperability with simplified APIs. This enables efficient sharing of data between CUDA kernels and OpenGL rendering.
(Dr.Jit PR #429, Dr.Jit-Core PR #164, contributed by Merlin Nimier-David). -
Enhanced Int8/UInt8 Support: Improved support for 8-bit integer types with better casting and bitcast operations.
(Dr.Jit PR #428, Dr.Jit-Core PR #163, contributed by Merlin Nimier-David).
Performance Improvements
-
Register Spilling to Shared Memory: CUDA backend now supports spilling registers to shared memory, improving performance for kernels with high register pressure. (Dr.Jit-Core commit
fdc7cae7). -
Memory View Support: Arrays can now be converted to Python
memoryviewobjects for efficient zero-copy data access. (commitb7039184). -
DLPack GIL Release: The
dr.ArrayBase.dlpack()method now releases the GIL while waiting, improving multi-threaded performance. (commit0adf9b4a). -
Thread Synchronization:
dr.sync_thread()now releases the GIL while waiting, preventing unnecessary blocking in multi-threaded applications. (commit956d2f57).
API Improvements
-
Spherical Direction Utilities: Added Python implementation of spherical direction utilities (
dr.sphdir). (PR #432, contributed by Sébastien Speierer). -
Matrix Conversions: Added support for converting between 3D and 4D matrices:
Matrix4fcan be constructed from a 3D matrix andMatrix3ffrom a 4D matrix. (commit7f8ea890). -
Quaternion API: Improved the quaternion Python API for better usability and consistency. (commit
282da88a). -
Type casts: Allow casting between Dr.Jit types to properly allow AD<->non-AD conversions when required. (commit
72f1e6b2).
Bug Fixes
-
Fixed deadlock issues in
@dr.freezedecorator. (commite8fc555e). -
Fixed gradient tracking in
Texture.tensor()to ensure gradients are never dropped inadvertently. (PR #444). -
Fixed AD support for C++
repeatandtileoperations with proper gradient propagation. (commitsfd693056,282da88a). -
Fixed Python object traversal to check that
__dict__exists before accessing it, preventing crashes with certain object types. (commit433adaf0). -
Fixed symbolic loop size calculation to properly account for side-effects. (Dr.Jit-Core commit
31bf911). -
Fixed read-after-free issue in OptiX SBT data loading. (Dr.Jit-Core commit
009adef, contributed by Merlin Nimier-David).
Other Improvements
-
Updated to nanobind v2.9.2
-
Improved error messages by adding function names to vectorized call errors. (Dr.Jit-Core PR #165, contributed by Sébastien Speierer).
-
Added missing checks for JIT leak warnings. (Dr.Jit-Core PR #166, contributed by Sébastien Speierer).
-
Added warning for LLVM API initialization failures. (Dr.Jit-Core PR #168, contributed by Sébastien Speierer).
-
Fixed pytest warnings and improved test infrastructure. (PR #436).
Release (v1.1.0)
The v1.1.0 release of Dr.Jit includes several major new features:
Major Features
-
Cooperative Vectors: Dr.Jit now provides an API to efficiently evaluate matrix-vector products in parallel programs. The API targets small matrices (e.g., 128x128, 64×64, or smaller) and inlines all computation into the program. Threads work cooperatively to perform these operations efficiently. On NVIDIA GPUs (Turing or newer), this leverages the OptiX cooperative vector API with tensor core acceleration. On the LLVM backend, operations compile to sequences of packet instructions (e.g., AVX512). See the cooperative vector documentation for more details. Example:
import drjit as dr import drjit.nn as nn from drjit.cuda.ad import Float16, TensorXf16 # Create a random number generator rng = dr.rng(seed=0) # Create a matrix and bias representing an affine transformation A = rng.normal(TensorXf16, shape=(3, 16)) # 3×16 matrix b = TensorXf16([1, 2, 3]) # Bias vector # Pack into optimized memory layout buffer, A_view, b_view = nn.pack(A, b) # Create cooperative a vector from 16 inputs vec_in = nn.CoopVec(Float16(1), Float16(2), ...) # Perform matrix-vector multiplication: A @ vec_in + b vec_out = nn.matvec(A_view, vec_in, b_view) # Unpack result back to regular arrays x, y, z = vec_out
-
Neural Network Library: Building on the cooperative vector functionality, the new
drjit.nnmodule provides modular abstractions for constructing, evaluating, and optimizing neural networks, similar to PyTorch'snn.Module. This enables fully fused evaluation of small multilayer perceptrons (MLPs) within larger programs. See the neural network module documentation for more details. Example:import drjit.nn as nn from drjit.cuda.ad import TensorXf16, Float16 # Define a small MLP for function approximation net = nn.Sequential( nn.SinEncode(16), # Sinusoidal encoding nn.Linear(-1, -1, bias=False), # Hidden layer nn.ReLU(), nn.Linear(-1, -1, bias=False), # Hidden layer nn.ReLU(), nn.Linear(-1, 3, bias=False), # Output layer (3 outputs) nn.Tanh() ) # Instantiate and optimize for 16-bit tensor cores rng = dr.rng(seed=0) net = net.alloc(dtype=TensorXf16, size=2, rng=rng) weights, net = nn.pack(net, layout='training') # Evaluate the network inputs = nn.CoopVec(Float16(0.5), Float16(0.7)) outputs = net(inputs) x, y, z = outputs # Three output values
(PR #384).
-
Hash Grid Encoding: Added neural network hash grid encoding inspired by Instant NGP, providing multi-resolution spatial encodings. This includes both traditional hash grids and permutohedral encodings for efficient high-dimensional inputs. (PR #390, contributed by Christian Döring and Merlin Nimier-David).
-
Function Freezing: Added the
@dr.freezedecorator to eliminate repeated tracing overhead by caching and replaying JIT-compiled kernels. Dr.Jit normally traces operations to build computation graphs for compilation, which can become a bottleneck when the same complex computation is performed repeatedly (e.g., in optimization loops). The decorator records kernel launches on the first call and replays them directly on subsequent calls, avoiding re-tracing.This can dramatically accelerate programs and makes Dr.Jit usable for realtime rendering and other applications with strict timing requirements. See the function freezing documentation for more details. Example:
import drjit as dr from drjit.cuda import Float, UInt32 # Without freezing - traces every time def func(x): y = seriously_complicated_code(x) dr.eval(y) # ..intermediate evaluations.. return huge_function(y, x) # With freezing - traces only once @dr.freeze def frozen(x): ... # same code as above -- no changes needed
(Dr.Jit PR #336, Dr.Jit-Core PR #107, contributed by Christian Döring).
-
Shader Execution Reordering (SER): Added the function
dr.reorder_threads()to shuffle threads across the GPU to reduce warp-level divergence. When threads in a warp take different branches (e.g., indr.switch()statements or vectorized virtual function calls) performance can degrade significantly. SER can group threads with similar execution paths into coherent warps to avoid this. This feature is a no-op in LLVM mode. Example:import drjit as dr from drjit.cuda import Array3f, UInt32 arg = Array3f(...) # Prepare data and callable index callable_idx = UInt32(...) % 4 # 4 different callables # Reorder threads before dr.switch() to reduce divergence # The key uses 2 bits (for 4 callables) arg = dr.reorder_threads(key=callable_idx, num_bits=2, value=arg) # Now, threads with the same callable_idx are grouped together callables = [func0, func1, func2, func3] out = dr.switch(callable_idx, callables, arg)
(Dr.Jit PR #395, Dr.Jit-Core PR #145).
Related to this, the OptiX backend now requires the OptiX 8.0 ABI (specifically, ABI version 87). This is a requirement for SER. (Dr.Jit-Core PR #117).
-
Random Number Generation API: Introduced a new random number generation API around an abstract
Generatorobject analogous to NumPy. Under the hood, this API uses thePhilox4x32counter-based PRNG from [Salmon et al. (2011)][https://www.thesalmons.org/john/random123/papers/random123sc11.pdf], which provides high-quality random variates that are statistically independent within and across parallel streams. Users create generators withdr.rng()and call methods like.random()and.normal(). Example:import drjit as dr from drjit.cuda import Float, TensorXf # Create a random number generator rng = dr.rng(seed=42) # Generate various random distributions uniform = rng.random(Float, 1000) # Uniform [0, 1) normal = rng.normal(Float, 1000) # Standard normal tensor = rng.random(TensorXf, (32, 32)) # Random tensor
(PR #417).
-
Array Resampling and Convolution: Added
dr.resample()for changing the resolution of arrays/tensors along specified axes, anddr.convolve()for convolution with continuous kernels. Both operations are fully differentiable and support various reconstruction filters (box, linear, cubic, lanczos, gaussian). Example:# Resample a 2D signal to different resolution data = dr.cuda.TensorXf(original_data) # Shape: (128, 128) upsampled = dr.resample( data, shape=(256, 256), # Target resolution filter='lanczos' # High-quality filter ) # Apply Gaussian blur via convolution blurred = dr.convolve( data, filter='gaussian', radius=2.0 )
-
Gradient-Based Optimizers: Added an optimization framework that includes various standard optimizers inspired by PyTorch. It includes
dr.opt.SGDwith optional momentum and Nesterov acceleration,dr.opt.Adamwith adaptive learning rates, anddr.opt.RMSProp. The optimizers own the parameters and automatically handle mixed-precision training. An optional helper classdr.opt.GradScalarimplements adaptive gradient scaling for low-precision training.from drjit.opt import Adam from drjit.cuda import Float # Create optimizer and register parameters opt = Adam(lr=1e-3) rng = dr.rng(seed=0) opt['params'] = Float(rng.normal(Float, 100)) # Optimization loop for unknown function f(x) for i in range(1000): # Fetch current parameters params = opt['params'] # Compute loss and gradients loss = f(params) # Some function to optimize dr.backward(loss) # Update parameters opt.step()
-
TensorFlow Interoperability: Added TensorFlow interop via
@dr.wrap, supporting forward and backward automatic differentiation with comprehensive support for variables and tensors. (PR #301, contributed by Jakob Hoydis).
...
Release (v1.0.5)
Release (v1.0.4)
- Workaround for OptiX linking issue in driver version R570+ 0c9c54e
Release (v1.0.3)
- Fixes to
drjit.wrap166be21
Release (v1.0.2)
Release (v1.0.1)
- Fixes to various edges cases of
drjit.dda.dda()(commit4ce97d).
Release (v1.0.0)
The 1.0 release of Dr.Jit marks a major new phase of this project. We addressed long-standing limitations and thoroughly documented every part of Dr.Jit. Due to the magnitude of the changes, some incompatibilities are unavoidable: bullet points with an exclamation mark highlight changes with an impact on source-level compatibility.
Here is what's new:
-
Python bindings: Dr.Jit comes with an all-new set of Python bindings created using the nanobind library. This has several consequences:
-
Tracing Dr.Jit code written in Python is now significantly faster (we've observed speedups by a factor of ~10-20×). This should help in situations where performance is limited by tracing rather than kernel evaluation.
-
Thorough type annotations improve static type checking and code completion in editors like VS Code.
-
Dr.Jit can now target Python 3.12's stable ABI. This means that binary wheels will work on future versions of Python without recompilation.
-
-
Natural syntax: vectorized loops and conditionals can now be expressed using natural Python syntax. To see what this means, consider the following function that computes an integer power of a floating point array:
from drjit.cuda import Int, Float @dr.syntax # <-- new! def ipow(x: Float, n: Int): result = Float(1) while n != 0: # <-- vectorized loop ('n' is an array) if n & 1 != 0: # <-- vectorized conditional result *= x x *= x n >>= 1 return result
Given that this function processes arrays, we expect that condition of the
ifstatement may disagree among elements. Also, each element may need a different number of loop iterations. However, such component-wise conditionals and loops aren't supported by normal Python. Previously, Dr.Jit provided ways of expressing such code using masking and a specialdr.cuda.Loopobject, but this was rather tedious.The new :py:func:
@drjit.syntax <drjit.syntax>decorator greatly simplifies the development of programs with complex control flow. It performs an automatic source code transformation that replaces conditionals and loops with array-compatible variants (:py:func:drjit.while_loop, :py:func:drjit.if_stmt). The transformation leaves everything else as-is, including line number information that is relevant for debugging. -
Differentiable control flow: symbolic control flow constructs (loops) previously failed with an error message when they detected differentiable variables. In the new version of Dr.Jit, symbolic operations (loops, function calls, and conditionals) are now differentiable in both forward and reverse modes, with one exception: the reverse-mode derivative of loops is still incomplete and will be added in the next version of Dr.Jit.
-
Documentation: every Dr.Jit function now comes with extensive reference documentation that clearly specifies its behavior and accepted inputs. The behavior with respect to tensors and arbitrary object graphs (referred to as :ref:
"PyTrees" <pytrees>) was made consistent. -
Half-precision arithmetic: Dr.Jit now provides
float16-valued arrays and tensors on both the LLVM and CUDA backends (e.g.,drjit.cuda.ad.TensorXf16ordrjit.llvm.Float16). -
Mixed-precision optimization: Dr.Jit now maintains one global AD graph for all variables, enabling differentiation of computation combining single-, double, and half precision variables. Previously, there was a separate graph per type, and gradients did not propagate through casts between them.
-
Multi-framework computations: The
@drjit.wrapdecorator provides a differentiable bridge to other AD frameworks. In this new release of Dr.Jit, its capabilities were significantly revamped. Besides PyTorch, it now also supports JAX, and it consistently handles both forward and backward derivatives. The new interface admits functions with arbitrary fixed/variable-length positional and keyword arguments containing arbitrary PyTrees of differentiable and non-differentiable arrays, tensors, etc. -
Debug mode: A new debug validation mode (
drjit.JitFlag.Debug) inserts a number of additional checks to identify sources of undefined behavior. Enable it to catch out-of-bounds reads, writes, and calls to undefined callables. Such operations will trigger a warning that includes the responsible source code location.The following built-in assertion checks are also active in debug mode. They support both regular and symbolic inputs in a consistent fashion.
drjit.assert_truedrjit.assert_falsedrjit.assert_equal
-
Symbolic print statement: A new high-level symbolic print operation
drjit.printenables deferred printing from any symbolic context (i.e., within symbolic loops, conditionals, and function calls). It is compatible with Jupyter notebooks and displays arbitraryPyTreesin a structured manner. This operation replaces the functiondrjit.print_async()provided in previous releases. -
Swizzling: swizzle access and assignment operator are now provided. You can use them to arbitrarily reorder, grow, or shrink the input array.
a = Array4f(...), b = Array2f(...) a.xyw = a.xzy + b.xyx
-
Scatter-reductions: the performance of atomic scatter-reductions (
drjit.scatter_reduce,drjit.scatter_add) has been significantly improved. Both functions now provide amode=parameter to select between different implementation strategies. The new strategydrjit.ReduceMode.Expandoffers a speedup of over 10× on the LLVM backend compared to the previously used local reduction strategy. Furthermore, improved code generation fordrjit.ReduceMode.Localbrings a roughly 20-40% speedup on the CUDA backend. See the documentation section onatomic reductionsfor details and benchmarks with plots.
-
Packet memory operations: programs often gather or scatter several memory locations that are directly next to each other in memory. In principle, it should be possible to do such reads or writes more efficiently.
Dr.Jit now features improved code generation to realize this optimization for calls to
dr.gather()anddr.scatter()that access a power-of-two-sized chunk of contiguous array elements. On the CUDA backend, this operation leverages native package memory instruction, which can produce small speedups on the order of ~5-30%. On the LLVM backend, packet loads/stores now compile to aligned packet loads/stores with a transpose operation that brings data into the right shape. Speedups here are dramatic (up to >20× for scatters, 1.5 to 2× for gathers). See thedrjit.JitFlag.PacketOpsflag for details. On the LLVM backend, packet scatter-addition furthermore compose with thedrjit.ReduceMode.Expandoptimization explained in the last point, which combines the benefits of both steps. This is particularly useful when computing the reverse-mode derivative of packet reads.
-
Reductions: reduction operations previously existed as regular (e.g.,
drjit.all) and nested (e.g.drjit.all_nested) variants. Both are now subsumed by an optionalaxisargument similar to how this works in other array programming frameworks like NumPy. Reductions can now also process any number of axes on both regular Dr.Jit arrays and tensors.The reduction functions (
drjit.all,drjit.any,drjit.sum,drjit.prod,drjit.min,drjit.max) have different default axis values depending on the input type. For tensors,axis=Noneby default and the reduction is performed along the entire underlying array recursively, analogous to the previous nested reduction. For all other types, the reduction is performed over the outermost axis (axis=0) by default.Aliases for the
_nestedfunction variants still exist to help porting but are deprecated and will be removed in a future release. -
Prefix reductions: the functions
drjit.cumsum,drjit.prefix_sumcompute inclusive or exclusive prefix sums along arbitrary axes of a tensor or array. They wrap for the more generaldrjit.prefix_reducethat also supports other arithmetic operations (e.g. minimum/maximum/product/and/or reductions), reverse reductions, etc. -
Block reductions: the new functions
drjit.block_reduceanddrjit.block_prefix_reducecompute reductions within contiguous blocks of an array. -
Local memory: kernels can now allocate temporary thread-local memory and perform arbitrary indexed reads and writes. This is useful to implement a stack or other types of scratch space that might be needed by a calculation. See the separate documentation section about
local memoryfor details. -
DDA: a newly added digital differential analyzer (
drjit.dda.dda) can be used to traverse the intersection of a ray segment and an n-dimensional grid. The functiondrjit.dda.integrate()builds on this functionality to compute analytic differentiable line integrals of bi- and trilinear interpolants. -
Loop compression: the implementation of evaluated loops (previously referred to as wavefront mode) visits all entries of the loop state variables at every iteration, even when most of them have already finished executing the loop. Dr.Jit now provides an optional
compress=Trueparameter indrjit.while_loopto prune away inactive entries and accelerate later loop iterations. -
The new release has a strong focus on error resilience and leak avoidance. Exceptions raised in custom operations, function dispatch, symbolic loops, etc., should not cause failures or leaks. Both Dr.Jit and nanobind are very noisy if they detect that objects are still alive when the Python interpreter shuts down.
-
Terminology cleanup: Dr.Jit has two main...