L-Mul replaces expensive floating-point multiplication with simple addition. Same accuracy. 67% less silicon. 81× less energy. Hardware-verified.
LLMs like ChatGPT, Gemini, and Claude are everywhere—but they're computationally expensive. A single query uses 0.3–3 Wh of energy. As of July 2025, ChatGPT alone processes 2.5 billion prompts per day.
At the heart of these models is matrix multiplication—accounting for 65–85% of all compute. Traditional floating-point multipliers are power-hungry and silicon-intensive.
L-Mul offers an alternative: approximate multiplication using addition in log-space. Neural networks don't need exact arithmetic—they need fast, efficient inference. L-Mul delivers both.
Tested across MLP, CNN, LSTM, and Transformer architectures with hardware validation.
| Model | Dataset | FP32 | L-Mul | Δ |
|---|---|---|---|---|
| MLP | MNIST | 97.02% | 97.01% | −0.01% |
| CNN | MNIST | 98.30% | 98.07% | −0.23% |
| LSTM | FashionMNIST | 83.4% | 82.8% | −0.6% |
| LSTM | KMNIST | 84.8% | 84.2% | −0.6% |
| LSTM | SeqMNIST | 95.2% | 94.6% | −0.6% |
| Implementation | Correct | Accuracy |
|---|---|---|
| Standard BF16 Mult | 165/1000 | 16.5% |
| L-Mul Approximate | 177/1000 | 17.7% |
Note: Low absolute accuracy due to BF16 quantization of FP32-trained weights. L-Mul still outperforms standard BF16.
| Size | Simulation Time (s) | Estimated Energy (J) | Instructions (M) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| LMUL Accel | IEEE CPU | LMUL CPU | LMUL Accel | IEEE CPU | LMUL CPU | LMUL Accel | IEEE CPU | LMUL CPU | |
| n=4 | 1.20× faster | 0.000089 | 1.01× slower | 1.21× less | 0.000296 | 1.01× more | 1.23× less | 0.010 | 1.09× more |
| n=64 | 8.86× faster | 0.007192 | 3.69× slower | 9.52× less | 0.023336 | 3.68× more | 6.09× less | 0.877 | 9.87× more |
| n=256 | 37.36× faster | 0.446429 | 3.82× slower | 41.29× less | 1.473231 | 3.76× more | 21.11× less | 47.468 | 11.51× more |
| n=512 | 73.04× faster | 3.512644 | — | 80.97× less | 11.613140 | — | 39.73× less | 369.393 | — |
Multiple verification paths: ASIC synthesis, FPGA simulation, HLS, and full system simulation.
BF16 L-Mul synthesized with Nangate 45nm standard cells. 66.8% area reduction, 63% fewer cells. 500MHz timing verified.
Full CPU + accelerator simulation with DMA data path. Up to 73× speedup and 81× energy reduction at n=512.
Full MNIST classifier in RTL: nn.Linear(784→128) → ReLU → nn.Linear(128→10). BF16 datapath with FP32 accumulation.
Tiled matrix multiplication for Qwen2.5 0.5B linear layer. On-chip memory reuse, parallel MACs. 40% DSP reduction.
VCD-based power estimation. 2.90× better energy efficiency (pJ/MAC). System-level speedup 1.77–2.31×.
Software implementation for accuracy validation. Vectorized BF16 ops with bias correction.
Replace O(n²) mantissa multiplication with O(n) addition in log-space.
Split 16-bit input into sign (1), exponent (8), mantissa (7 bits).
Treat exponent+mantissa as a single 15-bit field. Add with bias correction.
Check carry bits for overflow/underflow. Saturate to max/min values.
Add correction factor (result/32 + result/64) for accuracy.
def lmul_bf16(a, b): # Extract BF16 from upper 16 bits a_bf16 = (a.view(torch.int32) >> 16) & 0xFFFF b_bf16 = (b.view(torch.int32) >> 16) & 0xFFFF # Extract sign and field a_sign = (a_bf16 >> 15) & 0x1 b_sign = (b_bf16 >> 15) & 0x1 a_field = a_bf16 & 0x7FFF b_field = b_bf16 & 0x7FFF # L-Mul: add with bias sum_full = a_field + b_field + 0x4080 carry = (sum_full >> 15) & 0x3 # Handle overflow result_field = torch.where(carry == 1, sum_full & 0x7FFF, 0) result_field = torch.where(carry >= 2, 0x7FFF, result_field) # Pack and correct s = a_sign ^ b_sign result = ((s << 15) | result_field) << 16 result = result.view(torch.float32) return result + result/32 + result/64
Which layers are most sensitive to L-Mul approximation?
gem5 / System Sim
Verilog / Vivado
Synthesis / Power
NanoGPT / Transformers
HLS / Qwen
Ablation / Website
Advised by Professor Rajesh Gupta
Halıcıoğlu Data Science Institute, UC San Diego