DSC 180B Capstone · UC San Diego · Winter 2026

Neural networks,
73× faster.

L-Mul replaces expensive floating-point multiplication with simple addition. Same accuracy. 67% less silicon. 81× less energy. Hardware-verified.

0
× Speedup
0
× Less Energy
0
% Area Reduction
<1%
Accuracy Loss

Why This Matters

LLMs like ChatGPT, Gemini, and Claude are everywhere—but they're computationally expensive. A single query uses 0.3–3 Wh of energy. As of July 2025, ChatGPT alone processes 2.5 billion prompts per day.

At the heart of these models is matrix multiplication—accounting for 65–85% of all compute. Traditional floating-point multipliers are power-hungry and silicon-intensive.

L-Mul offers an alternative: approximate multiplication using addition in log-space. Neural networks don't need exact arithmetic—they need fast, efficient inference. L-Mul delivers both.

2.5B
ChatGPT prompts/day
🔋
0.3-3 Wh
Energy per LLM query
📊
65-85%
Ops are matmul

Results

Tested across MLP, CNN, LSTM, and Transformer architectures with hardware validation.

Classification Accuracy

FP32 vs L-Mul
ModelDatasetFP32L-MulΔ
MLPMNIST97.02%97.01%−0.01%
CNNMNIST98.30%98.07%−0.23%
LSTMFashionMNIST83.4%82.8%−0.6%
LSTMKMNIST84.8%84.2%−0.6%
LSTMSeqMNIST95.2%94.6%−0.6%

Accuracy Comparison

NanoGPT Perplexity

Shakespeare

FPGA Hardware Simulation

Vivado RTL
ImplementationCorrectAccuracy
Standard BF16 Mult165/100016.5%
L-Mul Approximate177/100017.7%

Note: Low absolute accuracy due to BF16 quantization of FP32-trained weights. L-Mul still outperforms standard BF16.

gem5 System Simulation

ARM CPU + LMUL Accelerator
SizeSimulation Time (s)Estimated Energy (J)Instructions (M)
LMUL AccelIEEE CPULMUL CPULMUL AccelIEEE CPULMUL CPULMUL AccelIEEE CPULMUL CPU
n=41.20× faster0.0000891.01× slower1.21× less0.0002961.01× more1.23× less0.0101.09× more
n=648.86× faster0.0071923.69× slower9.52× less0.0233363.68× more6.09× less0.8779.87× more
n=25637.36× faster0.4464293.82× slower41.29× less1.4732313.76× more21.11× less47.46811.51× more
n=51273.04× faster3.51264480.97× less11.61314039.73× less369.393

Speedup Scaling

LMUL Accel vs IEEE

Energy Reduction

LMUL Accel vs IEEE
63%
Fewer Cells
40%
Less DSP Usage
500MHz
Timing Verified

Synthesis Area Comparison

Nangate 45nm

HLS Resource Usage

Vitis HLS

Hardware Validation

Multiple verification paths: ASIC synthesis, FPGA simulation, HLS, and full system simulation.

Yosys Synthesis

Done

BF16 L-Mul synthesized with Nangate 45nm standard cells. 66.8% area reduction, 63% fewer cells. 500MHz timing verified.

YosysOpenSTA45nm

gem5 System Sim

Done

Full CPU + accelerator simulation with DMA data path. Up to 73× speedup and 81× energy reduction at n=512.

gem5ARMDMA

Verilog MLP

Done

Full MNIST classifier in RTL: nn.Linear(784→128) → ReLU → nn.Linear(128→10). BF16 datapath with FP32 accumulation.

VivadoVerilogMNIST

Vitis HLS

Done

Tiled matrix multiplication for Qwen2.5 0.5B linear layer. On-chip memory reuse, parallel MACs. 40% DSP reduction.

Vitis HLSQwenTiling

Power Analysis

Done

VCD-based power estimation. 2.90× better energy efficiency (pJ/MAC). System-level speedup 1.77–2.31×.

VCDOpenSTAPower

PyTorch Reference

Done

Software implementation for accuracy validation. Vectorized BF16 ops with bias correction.

PyTorchNumPyPython

The L-Mul Algorithm

Replace O(n²) mantissa multiplication with O(n) addition in log-space.

1

Extract BF16 Fields

Split 16-bit input into sign (1), exponent (8), mantissa (7 bits).

2

Add Instead of Multiply

Treat exponent+mantissa as a single 15-bit field. Add with bias correction.

3

Handle Overflow

Check carry bits for overflow/underflow. Saturate to max/min values.

4

Apply Correction

Add correction factor (result/32 + result/64) for accuracy.

lmul_bf16.py
def lmul_bf16(a, b):
    # Extract BF16 from upper 16 bits
    a_bf16 = (a.view(torch.int32) >> 16) & 0xFFFF
    b_bf16 = (b.view(torch.int32) >> 16) & 0xFFFF
    
    # Extract sign and field
    a_sign = (a_bf16 >> 15) & 0x1
    b_sign = (b_bf16 >> 15) & 0x1
    a_field = a_bf16 & 0x7FFF
    b_field = b_bf16 & 0x7FFF
    
    # L-Mul: add with bias
    sum_full = a_field + b_field + 0x4080
    carry = (sum_full >> 15) & 0x3
    
    # Handle overflow
    result_field = torch.where(carry == 1, 
                               sum_full & 0x7FFF, 0)
    result_field = torch.where(carry >= 2, 
                               0x7FFF, result_field)
    
    # Pack and correct
    s = a_sign ^ b_sign
    result = ((s << 15) | result_field) << 16
    result = result.view(torch.float32)
    return result + result/32 + result/64

Ablation Studies

Which layers are most sensitive to L-Mul approximation?

LSTM: Gates vs States

Perplexity
FP32 Baseline1.58
L-Mul Gates Only1.58
L-Mul States Only1.55
L-Mul Both1.54 ✓
Finding: State updates benefit from L-Mul—possible regularization effect.

CNN: Conv vs FC

Accuracy
FP32 Baseline98.30%
L-Mul Conv Only98.00%
L-Mul FC Only98.20% ✓
L-Mul Both98.07%
Finding: FC layers tolerate L-Mul better than conv layers.

NanoGPT: Layer Types

Perplexity
Base Perplexity4.97
Full LMUL4.64
Attn + MLP4.66
MLP + LM Head4.55 ✓
MLP Only4.73

Key Insights

Dense > Spatial: FC/linear layers consistently handle L-Mul better.
Regularization: Some components actually improve with L-Mul.
Hardware Priority: Target dense matmuls first for best ROI.

Who Built This

O

Owen Shi

gem5 / System Sim

E

Edgar Guzman

Verilog / Vivado

I

Idhant Kumar

Synthesis / Power

B

Brendan Kuang

NanoGPT / Transformers

B

Brandon Chiou

HLS / Qwen

P

Pranav Kumarsubha

Ablation / Website

Advised by Professor Rajesh Gupta

Halıcıoğlu Data Science Institute, UC San Diego