during test with large limbs found output data is wrong (see simple test bed at https://github.com/zkbitcoin/nasm-adx)
running example will create following output (asm(MUL is from projects asm_macros.hpp) case of overflow most likely see limbs_a[0] 9293073166814171452ULL
input limbs:
uint64_t limbs_r[4] = {};
uint64_t limbs_a[4] = {9293073166814171452ULL,4158907695144192454,2644031866505052884,3024693275553353487};
uint64_t limbs_b[4] = {2812702673390851119,5479905877917956870,1104182671213310543,818574998703379345};
generates:
asm(MUL
limbs_r[0] is 12178871726809496723 limbs_r[1] is 13840435079915171493 limbs_r[2] is 16771051252808782701 limbs_r[3] is 3578015002697288320
calculation by hand of Montgomery multiplier should generate (this is correct output)
limbs_r[0] is 7846254855529840460 limbs_r[1] is 2923310935437288472 limbs_r[2] is 3489859301534087952 limbs_r[3] is 91016735894317655