Conversation
ca494ae to
6ee362d
Compare
4b0c90d to
0a73757
Compare
|
The LLM prompts (and their responses) are shown in the comments below, although the prompts are still changing. You are an expert GPU kernel autotuner for Helion/Triton kernels. Use the provided Configuration Space and Default Configuration as the source of truth for:
General heuristics:
Output contract:
User Prompt Kernel Source Codedef attention(
q_in: torch.Tensor,
k_in: torch.Tensor,
v_in: torch.Tensor,
) -> torch.Tensor:
"""
Computes scaled dot-product attention.
Implements the attention mechanism: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
Args:
q_in: Query tensor of shape [..., seq_len_q, head_dim]
k_in: Key tensor of shape [..., seq_len_k, head_dim]
v_in: Value tensor of shape [..., seq_len_k, head_dim]
Returns:
Output tensor of shape [..., seq_len_q, head_dim]
"""
m_dim = q_in.size(-2)
n_dim = k_in.size(-2)
assert n_dim == v_in.size(-2)
head_dim = hl.specialize(q_in.size(-1))
assert head_dim == k_in.size(-1) == v_in.size(-1)
q_view = q_in.reshape([-1, m_dim, head_dim])
v_view = v_in.reshape([-1, n_dim, head_dim])
k_view = k_in.reshape([-1, n_dim, head_dim]).transpose(1, 2)
out = torch.empty_like(q_view)
sm_scale = 1.0 / math.sqrt(head_dim)
qk_scale = sm_scale * 1.44269504 # 1/log(2)
for tile_b, tile_m in hl.tile([q_view.size(0), m_dim]):
m_i = hl.full([tile_b, tile_m], float("-inf"), dtype=torch.float32)
l_i = torch.full_like(m_i, 1.0)
acc = hl.zeros([tile_b, tile_m, head_dim], dtype=torch.float32)
q = q_view[tile_b, tile_m, :]
for tile_n in hl.tile(v_view.size(1)):
k = k_view[tile_b, :, tile_n]
qk = torch.bmm(q, k)
m_ij = torch.maximum(m_i, torch.amax(qk, -1) * qk_scale)
qk = qk * qk_scale - m_ij[:, :, None]
p = torch.exp2(qk)
l_ij = torch.sum(p, -1)
alpha = torch.exp2(m_i - m_ij)
l_i = l_i * alpha + l_ij
acc = acc * alpha[:, :, None]
v = v_view[tile_b, tile_n, :]
p = p.to(v.dtype)
acc = torch.baddbmm(acc, p, v)
m_i = m_ij
m_i += torch.log2(l_i)
acc = acc / l_i[:, :, None]
out[tile_b, tile_m, :] = acc.to(out.dtype)
return out.view(q_in.size())Input TensorsGPU HardwareConfiguration Spacepower_of_2(min=16, max=1024, default=16)] Default Configuration{"block_sizes":[1,16,16],"indexing":["pointer","pointer","pointer","pointer"],"l2_groupings": Search Strategyreuse, accumulation, memory-vs-compute, and scheduling traits from the code itself and target hardware, TaskSuggest the first batch of configs. Include both near-default and exploratory candidates. Return minified |
|
The LLM response with: Round 0 Response { |
|
The next iteration of LLM prompt: Round 1 Refinement Prompt Search StateAnchor Configs"persistent_blocked"} Results (best first)"tensor_descriptor"], "num_warps": 8} Top Config PatternsFailed Config Patterns["", "last", "last"], "loop_orders": [[1, 0]], "num_stages": 8, "num_warps": 16, "range_flattens": [null, false], Next Stepinstead of rewriting every field. TaskSuggest up to 15 NEW UNIQUE configs around the anchors above. Avoid the failed patterns above and favor targeted edits with |
stack-info: PR: #2003, branch: choijon5/stack/3
0a73757 to
fd390a7
Compare
stack-info: PR: #2003, branch: choijon5/stack/3
fd390a7 to
eed2b6c
Compare
stack-info: PR: #2003, branch: choijon5/stack/3
eed2b6c to
2aabd36
Compare
stack-info: PR: #2003, branch: choijon5/stack/3
2aabd36 to
a9215bd
Compare
stack-info: PR: #2003, branch: choijon5/stack/3
a9215bd to
8c1e6b7
Compare
stack-info: PR: #2003, branch: choijon5/stack/3
8c1e6b7 to
c6a7f31
Compare
stack-info: PR: #2003, branch: choijon5/stack/3
3d73879 to
c869f37
Compare
stack-info: PR: #2003, branch: choijon5/stack/3
232f412 to
6cae326
Compare
|
test failures are unrelated |
stack-info: PR: #2003, branch: choijon5/stack/3
6cae326 to
5bbeb65
Compare
stack-info: PR: #2003, branch: choijon5/stack/3
Stacked PRs:
[Autotuner] Adding LLM-guided search
Prompting an LLM (GPT-5.2) to get the configs:

Rn indicates the number of rounds of LLM prompts.
Rn_speedup indicates the speedup of nth round of LLM vs LFBO full autotuning.
Rn_time/LFBO compares the wall clock time it takes to finish the nth round of LLM vs LFBO full autotuning.
For simple kernels (matmul, rms_norm), LLM is able to one shot the config with on par perf as LFBO full autotuning at a fraction (10s, 10-20%) of LFBO full autotuning time.
For 4th round, most kernels get pretty close in perf. 4 LLM rounds take 20-30% of LFBO full autotuning time.
There are cases where perf does not improve after more LLM rounds.
Opus 4.6 results - rms_norm 1.57x speedup @ 32% tuning time.

High-level flow:
config so the first LLM call sees both the workload description and the
available tuning knobs.
default config plus a few random seed configs while that request is in
flight.
configs and folds those results into the running set of top configs.
later LLM round sees the latest stabilized timings instead of only one-shot
measurements.
search state, query the LLM, benchmark new configs, then rebenchmark the
strongest configs.
How to run it:
Specify the autotuner, model, and API keys:
export HELION_AUTOTUNER=LLMGuidedSearch/LLMSeededLFBOTreeSearch (next PR)
export HELION_LLM_MODEL=gpt-5-2/claude-...
export OPENAI_API_KEY=... or export ANTHROPIC_API_KEY=...
If not using the API KEYs directly (devservers), you need to use:
export HELION_LLM_API_BASE=...
export HELION_LLM_CA_BUNDLE=...
export HELION_LLM_CLIENT_CERT=...
export HELION_LLM_CLIENT_KEY=...