Skip to content

JustAResearcher/Latency-Based-GPU-Algorithm

Repository files navigation

GPUx — ASIC-resistant PoW for GPUs

Build & Release License: MIT Latest Release

Status: v0.1.5 community testing Target: Replacement for Cuckaroo29 (C29) in Tari (XTM) Goal: GPU-native, ASIC-resistant proof-of-work; low power; cheap verifier.


What this is

GPUx is a candidate proof-of-work algorithm designed to make GPU mining durable against ASIC takeover. It combines random per-epoch programs, a 2 GiB random-access DAG, and a per-thread scratchpad to force any would-be ASIC into looking like a GPU — at which point the ASIC has no cost advantage.

Three artifacts in this repo:

  1. Algorithm spec (ALGORITHM_SPEC.md) — formal definition.
  2. Reference C implementation (spec/) — the authoritative semantics.
  3. CUDA implementation + bench harness (cuda/, bench/) — what community testers run on their GPUs.

If you are a community tester, jump to COMMUNITY_TESTING.md.

If you are reviewing the algorithm, start with ALGORITHM_SPEC.md and then docs/DESIGN_RATIONALE.md.


Quick numbers (RTX 5090, unoptimized v0.1 kernel)

Metric Value
Hashrate ~1.25 MH/s
DAG generation 2 GiB in ~30 ms (~65 GB/s)
Per-share verify ~0.5 ms (warm DAG)
GPU vs reference bit-identical (5/5 KAT nonces)

These are baseline numbers from a reference port. Optimized kernels (warp-cooperative DAG access, shared-memory scratchpad, instruction reordering) are expected to multiply throughput 2–5× without changing consensus.


Why GPUx is hard for ASICs (one-screen summary)

ASICs win when the algorithm is small, homogeneous, and predictable. GPUx attacks each premise:

Property GPUx mechanism
Predictable kernel Random program regenerated every 1024 blocks
Small kernel 256 ops × 64 iters = 16 384 ops/nonce, 12 distinct opcodes, 32 64-bit lanes
Cheap memory 2 GiB DAG with random dependent access (forces GDDR/HBM)
No cache 16 KiB per-thread scratchpad with R-M-W (forces L1-equivalent)
One datapath Mix of 64-bit int ALU, MULHI, AES round, IEEE-754 FP32 FMA
Throughput parallel Latency-bound dependent chains limit pipelining

Long-form analysis with comparisons to Ethash, ProgPoW, RandomX, Cuckaroo, and X16R is in docs/DESIGN_RATIONALE.md.


Repo layout

gpux/
├── ALGORITHM_SPEC.md          formal algorithm spec
├── COMMUNITY_TESTING.md       how to run tests and submit results
├── README.md                  this file
├── Makefile                   builds reference + tests (Linux/WSL/macOS)
├── spec/                      reference C implementation
│   ├── gpux.h / gpux.c        algorithm reference (the source of truth)
│   ├── blake2b.c+h            embedded BLAKE2b reference
│   ├── chacha20.c+h           embedded ChaCha20 reference
│   ├── aes_round.c+h          embedded AES single-round reference
│   └── test_vectors.h         frozen KAT (regenerate with `make gen-kat`)
├── tests/
│   ├── smoke.c                primitive correctness (BLAKE2b, ChaCha20, AES, KAT generators)
│   ├── kat.c                  full hash KAT (allocates 2 GiB)
│   └── gen_kat.c              regenerate test_vectors.h
├── cuda/                      CUDA implementation
│   ├── gpux_kernel.cu         the mining kernel
│   ├── gpux_device.cuh        device-side BLAKE2b/ChaCha20/AES
│   ├── gpux_miner.cu          host driver: verify, bench, info
│   ├── Makefile               Linux/WSL build
│   └── build.bat              Windows build (vcvars + nvcc)
├── bench/                     community testing
│   ├── run_bench.ps1          Windows harness
│   ├── run_bench.sh           Linux harness
│   └── results/               per-GPU JSON results (created on first run)
└── docs/
    └── DESIGN_RATIONALE.md    why each design choice; ASIC-resistance argument

Building

Linux / WSL / macOS (reference + tests)

make smoke   # primitive tests, no DAG
make kat     # full KAT (allocates 2 GiB)

Linux / WSL / macOS (CUDA)

cd cuda && make
./gpux_miner verify
./gpux_miner bench 30

Windows (CUDA)

Requires Visual Studio 2022 BuildTools + CUDA 13.x.

cd cuda
.\build.bat
.\gpux_miner.exe verify
.\gpux_miner.exe bench 30

Or use the testing wrapper:

.\bench\run_bench.ps1 -Seconds 60

Tari integration (proposed)

Tari's existing block header is hashed with BLAKE2b-256 to produce a 32-byte digest. To use GPUx as a PoW algorithm:

header_digest = BLAKE2b-256(serialized_block_header_excluding_nonce)
block_hash    = GPUx(header_digest, nonce)

Difficulty target and Tari's multi-algo selection layer integrate at the consensus boundary. See ALGORITHM_SPEC.md §11.


v0.1 status

  • Spec frozen for testing
  • Reference C impl, deterministic
  • KAT (1 epoch_seed, 5 nonces) with bit-exact reference output
  • CUDA impl matches reference
  • Baseline RTX 5090 hashrate (1.25 MH/s)
  • Cross-vendor FP32 determinism audit (NVIDIA Ada/Hopper/Blackwell vs AMD RDNA3/RDNA4 vs Intel)
  • Light-verifier Merkle DAG witness
  • Tari multi-algo selection integration
  • Optimized CUDA kernel (warp-coop DAG, shmem scratchpad)
  • OpenCL implementation for AMD/Intel

License

MIT — see LICENSE. Bundled reference primitives (BLAKE2b, ChaCha20, AES round, Argon2id) are public-domain or CC0/Apache-2.0 and remain so under MIT. The intent is full open-source auditability — fork it, break it, propose changes via PR, run your own bench results and submit them as JSON files in bench/results/.

About

ASIC-resistant, latency-bound proof-of-work algorithm for GPUs. Proposed replacement for Cuckaroo29 (C29) in Tari XTM.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors