Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
ebd152d
modular optimization paths - init
ben-schwen Oct 28, 2025
71b21ab
make linter happy
ben-schwen Oct 29, 2025
8a9e727
move tests
ben-schwen Oct 30, 2025
04e5782
add lapply(list(col1, col2, ...), fun) pattern
ben-schwen Oct 30, 2025
a8dde19
turn on optimization
ben-schwen Oct 31, 2025
67f2874
add type conversion support to GForce
ben-schwen Nov 1, 2025
2876ebe
remove stale branch
ben-schwen Nov 1, 2025
c445c38
add tests
ben-schwen Nov 2, 2025
5410e31
update man
ben-schwen Nov 2, 2025
dece1c6
merge tests
ben-schwen Nov 2, 2025
5e1789d
polish test fun
ben-schwen Nov 2, 2025
62f1c48
add arithmetic
ben-schwen Nov 2, 2025
c47ec27
add AST walker and update tests
ben-schwen Nov 2, 2025
1d324d6
add tests
ben-schwen Nov 2, 2025
6b54c1e
Merge branch 'master' into modular_gforce
ben-schwen Nov 2, 2025
22cf35e
add NEWS
ben-schwen Nov 2, 2025
25a7e2e
make function name in massageSD more expressive
ben-schwen Nov 3, 2025
eb8056c
rename levels argument to optimization
ben-schwen Nov 3, 2025
4544398
update docs
ben-schwen Nov 3, 2025
d40edb8
restore test nums
ben-schwen Nov 3, 2025
5e7efb7
remove double tests
ben-schwen Nov 3, 2025
3826927
simplify tests
ben-schwen Nov 3, 2025
982343f
phrasing
ben-schwen Nov 4, 2025
996b28c
Merge remote-tracking branch 'refs/remotes/origin/modular_gforce' int…
ben-schwen Nov 4, 2025
1e6ad03
use mget for all vector params
ben-schwen Nov 4, 2025
9e1297e
rename optimization parameter
ben-schwen Nov 4, 2025
f6981d6
rename optimization parameter also in test
ben-schwen Nov 4, 2025
9fc4734
add optimize param checks
ben-schwen Nov 4, 2025
6aaea51
Merge branch 'master' into modular_gforce
ben-schwen Nov 4, 2025
c07999a
remove trailing ws
ben-schwen Nov 4, 2025
6914818
Merge branch 'master' into modular_gforce
ben-schwen Dec 15, 2025
6c7e368
Update man/test.Rd
ben-schwen Dec 15, 2025
08c9524
Merge branch 'master' into modular_gforce
ben-schwen Jan 5, 2026
047f6be
readd context
ben-schwen Jan 5, 2026
5a7a9a3
Update NEWS.md
MichaelChirico Jan 7, 2026
b495503
revert spurious diff
MichaelChirico Jan 7, 2026
03bcdd8
?
MichaelChirico Jan 7, 2026
fe525bf
add space
ben-schwen Jan 7, 2026
71b9838
reference deletion of tests
ben-schwen Jan 7, 2026
494cfe2
reference deletion of tests2
ben-schwen Jan 7, 2026
6f42ff5
add comment about removed tests
ben-schwen Jan 7, 2026
ac306eb
add comment about optimization level comparison
ben-schwen Jan 7, 2026
431dfc2
add comment about removed test
ben-schwen Jan 7, 2026
158136b
fix typo
ben-schwen Jan 7, 2026
0c2f61f
remove doubled test
ben-schwen Jan 7, 2026
2c7ebaf
add comment
ben-schwen Jan 7, 2026
371e246
update subsuming comments
ben-schwen Jan 7, 2026
e2694e1
add subsuming comments
ben-schwen Jan 7, 2026
da771d4
finish double checking of moving tests
ben-schwen Jan 7, 2026
af15282
make optimize more robust
ben-schwen Jan 7, 2026
b61f280
add comment about removing tests in benchmark.Rraw
ben-schwen Jan 7, 2026
d8e34d3
be clearer in NEWS
ben-schwen Jan 7, 2026
c5fb65a
add nocovs for errors
ben-schwen Jan 7, 2026
9f0e5cf
add unwrapper for conversions
ben-schwen Jan 7, 2026
8129198
add more tests
ben-schwen Jan 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,14 @@

1. `nafill()`, `setnafill()` extended to work on logical vectors (part of [#3992](https://github.com/Rdatatable/data.table/issues/3992)). Thanks @jangorecki for the request and @MichaelChirico for the PR.

2. GForce and lapply optimization detection has been refactored to use modular optimization paths and an AST (Abstract Syntax Tree) walker for improved maintainability and extensibility. The new architecture separates optimization detection into distinct, composable phases. This makes future optimization enhancements a lot easier. Thanks to @grantmcdermott, @jangorecki, @MichaelChirico, and @HughParsonage for the suggestions and @ben-schwen for the implementation.

This rewrite also introduces several new optimizations:
- Enables Map in addition to lapply optimizations (e.g., `Map(fun, .SD)` -> `list(fun(col1), fun(col2), ...)`) [#5336](https://github.com/Rdatatable/data.table/issues/5336)
- lapply optimization works without .SD (e.g., `lapply(list(col1, col2), fun)` -> `list(fun(col1), fun(col2))` [#5032](https://github.com/Rdatatable/data.table/issues/5032)
- Type conversion support in GForce expressions (e.g., `sum(as.numeric(x))` will use GForce, saving the need to coerce `x` in a setup step) [#2934](https://github.com/Rdatatable/data.table/issues/2934)
- Arithmetic operation support in GForce (e.g., `max(x) - min(x)` will use GForce on both `max(x)` and `min(x)`, saving the need to do the subtraction in a follow-up step) [#3815](https://github.com/Rdatatable/data.table/issues/3815)

### Notes

1. {data.table} now depends on R 3.5.0 (2018).
Expand Down
707 changes: 441 additions & 266 deletions R/data.table.R

Large diffs are not rendered by default.

33 changes: 31 additions & 2 deletions R/test.data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -369,11 +369,40 @@ gc_mem = function() {
m
# nocov end
}

test = function(num, x, y=TRUE,
error=NULL, warning=NULL, message=NULL, output=NULL, notOutput=NULL, ignore.warning=NULL,
options=NULL, env=NULL,
context=NULL) {
context=NULL, optimize=NULL) {
# if optimization is provided, test across multiple optimization levels
if (!is.null(optimize)) {
if (!is.numeric(optimize) || length(optimize) < 1L || anyNA(optimize) || any(optimize < 0L))
stopf("optimize must be numeric, length >= 1, non-NA, and >= 0; got: %s", optimize) # nocov
cl = match.call()
if ("datatable.optimize" %in% names(cl$options))
stopf("Trying to set optimization level through both options= and optimize=") # nocov
cl$optimize = NULL # Remove optimization levels from the recursive call

# Check if y was explicitly provided (not just the default)
y_provided = !missing(y)
vector_params = mget(c("error", "warning", "message", "output", "notOutput", "ignore.warning"), environment())
compare = !y_provided && length(optimize)>1L && !any(lengths(vector_params))
Comment on lines +387 to +388
Copy link
Member

@MichaelChirico MichaelChirico Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just drop missing ones up front right? and then simplify below

Suggested change
vector_params = mget(c("error", "warning", "message", "output", "notOutput", "ignore.warning"), environment())
compare = !y_provided && length(optimize)>1L && !any(lengths(vector_params))
vector_params = mget(c("error", "warning", "message", "output", "notOutput", "ignore.warning"), environment())
vector_params = vector_params[lengths(vector_params) > 0L]
compare = !y_provided && length(optimize)>1L && !length(vector_params)


for (i in seq_along(optimize)) {
cl$num = num + (i - 1L) * 1e-6
opt_level = list(datatable.optimize = optimize[i])
cl$options = if (!is.null(options)) c(as.list(options), opt_level) else opt_level
for (param in names(vector_params)) {
val = vector_params[[param]]
if (length(val) > 0L) {
cl[[param]] = val[((i - 1L) %% length(val)) + 1L] # cycle through values if fewer than optimization levels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea to allow any type of recycling? Or should we enforce that everything either has length==1 or length==length(optimize)?

I guess the main concern is for warning=, which already naturally supports warning=c("msg1", "msg2") to capture multiple emitted warnings

}
Comment on lines +396 to +398
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (length(val) > 0L) {
cl[[param]] = val[((i - 1L) %% length(val)) + 1L] # cycle through values if fewer than optimization levels
}
cl[[param]] = val[((i - 1L) %% length(val)) + 1L] # cycle through values if fewer than optimization levels

}

if (compare && i == 1L) cl$y = eval(cl$x, parent.frame())
eval(cl, parent.frame()) # actual test call
}
return(invisible())
}
if (!is.null(env)) {
old = Sys.getenv(names(env), names=TRUE, unset=NA)
to_unset = !lengths(env)
Expand Down
26 changes: 8 additions & 18 deletions inst/tests/benchmark.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -190,24 +190,14 @@ DT = data.table(A=1:10,B=rnorm(10),C=paste("a",1:100010,sep=""))
test(301.1, nrow(DT[,sum(B),by=C])==100010)

# Test := by key, and that := to the key by key unsets the key. Make it non-trivial in size too.
local({
old = options(datatable.optimize=0L); on.exit(options(old))
set.seed(1)
DT = data.table(a=sample(1:100, 1e6, replace=TRUE), b=sample(1:1000, 1e6, replace=TRUE), key="a")
test(637.1, DT[, m:=sum(b), by=a][1:3], data.table(a=1L, b=c(156L, 808L, 848L), m=DT[J(1), sum(b)], key="a"))
test(637.2, key(DT[J(43L), a:=99L]), NULL)
setkey(DT, a)
test(637.3, key(DT[, a:=99L, by=a]), NULL)
})
local({
options(datatable.optimize=2L); on.exit(options(old))
set.seed(1)
DT = data.table(a=sample(1:100, 1e6, replace=TRUE), b=sample(1:1000, 1e6, replace=TRUE), key="a")
test(638.1, DT[, m:=sum(b), by=a][1:3], data.table(a=1L, b=c(156L, 808L, 848L), m=DT[J(1), sum(b)], key="a"))
test(638.2, key(DT[J(43L), a:=99L]), NULL)
setkey(DT,a)
test(638.3, key(DT[, a:=99L, by=a]), NULL)
})
set.seed(1)
DT = data.table(a=sample(1:100, 1e6, replace=TRUE), b=sample(1:1000, 1e6, replace=TRUE), key="a")
opt = c(0L,2L)
test(637.1, optimize=opt, copy(DT)[, m:=sum(b), by=a][1:3], data.table(a=1L, b=c(156L, 808L, 848L), m=DT[J(1), sum(b)], key="a"))
test(637.2, optimize=opt, key(copy(DT)[J(43L), a:=99L]), NULL)
setkey(DT, a)
test(637.3, optimize=opt, key(copy(DT)[, a:=99L, by=a]), NULL)
# test 637 subsumes 637 and 638 for different optimization levels

# Test X[Y] slowdown, #2216
# Many minutes in 1.8.2! Now well under 1s, but 10s for very wide tolerance for CRAN. We'd like CRAN to tell us if any changes
Expand Down
Loading
Loading