DSC 180B Capstone · UC San Diego · Winter 2026

Neural networks,
73× faster.

L-Mul replaces expensive floating-point multiplication with simple addition. Same accuracy. 67% less silicon. 81× less energy. Hardware-verified.

× Speedup

× Less Energy

% Area Reduction

<1%

Accuracy Loss

The Problem

Why This Matters

LLMs like ChatGPT, Gemini, and Claude are everywhere—but they're computationally expensive. A single query uses 0.3–3 Wh of energy. As of July 2025, ChatGPT alone processes 2.5 billion prompts per day.

At the heart of these models is matrix multiplication—accounting for 65–85% of all compute. Traditional floating-point multipliers are power-hungry and silicon-intensive.

L-Mul offers an alternative: approximate multiplication using addition in log-space. Neural networks don't need exact arithmetic—they need fast, efficient inference. L-Mul delivers both.

⚡

2.5B

ChatGPT prompts/day

🔋

0.3-3 Wh

Energy per LLM query

📊

65-85%

Ops are matmul

Benchmarks

Results

Tested across MLP, CNN, LSTM, and Transformer architectures with hardware validation.

Classification Accuracy

FP32 vs L-Mul

Model	Dataset	FP32	L-Mul	Δ
MLP	MNIST	97.02%	97.01%	−0.01%
CNN	MNIST	98.30%	98.07%	−0.23%
LSTM	FashionMNIST	83.4%	82.8%	−0.6%
LSTM	KMNIST	84.8%	84.2%	−0.6%
LSTM	SeqMNIST	95.2%	94.6%	−0.6%

Accuracy Comparison

NanoGPT Perplexity

Shakespeare

FPGA Hardware Simulation

Vivado RTL

Implementation	Correct	Accuracy
Standard BF16 Mult	165/1000	16.5%
L-Mul Approximate	177/1000	17.7%

Note: Low absolute accuracy due to BF16 quantization of FP32-trained weights. L-Mul still outperforms standard BF16.

gem5 System Simulation

ARM CPU + LMUL Accelerator

Size	Simulation Time (s)			Estimated Energy (J)			Instructions (M)
	LMUL Accel	IEEE CPU	LMUL CPU	LMUL Accel	IEEE CPU	LMUL CPU	LMUL Accel	IEEE CPU	LMUL CPU
n=4	1.20× faster	0.000089	1.01× slower	1.21× less	0.000296	1.01× more	1.23× less	0.010	1.09× more
n=64	8.86× faster	0.007192	3.69× slower	9.52× less	0.023336	3.68× more	6.09× less	0.877	9.87× more
n=256	37.36× faster	0.446429	3.82× slower	41.29× less	1.473231	3.76× more	21.11× less	47.468	11.51× more
n=512	73.04× faster	3.512644	—	80.97× less	11.613140	—	39.73× less	369.393	—

Speedup Scaling

LMUL Accel vs IEEE

Energy Reduction

LMUL Accel vs IEEE

66.8%

Area Reduction

63%

Fewer Cells

40%

Less DSP Usage

500MHz

Timing Verified

Synthesis Area Comparison

Nangate 45nm

HLS Resource Usage

Vitis HLS

Implementation

Hardware Validation

Multiple verification paths: ASIC synthesis, FPGA simulation, HLS, and full system simulation.

Yosys Synthesis

Done

BF16 L-Mul synthesized with Nangate 45nm standard cells. 66.8% area reduction, 63% fewer cells. 500MHz timing verified.

YosysOpenSTA45nm

gem5 System Sim

Done

Full CPU + accelerator simulation with DMA data path. Up to 73× speedup and 81× energy reduction at n=512.

gem5ARMDMA

Verilog MLP

Done

Full MNIST classifier in RTL: nn.Linear(784→128) → ReLU → nn.Linear(128→10). BF16 datapath with FP32 accumulation.

VivadoVerilogMNIST

Vitis HLS

Done

Tiled matrix multiplication for Qwen2.5 0.5B linear layer. On-chip memory reuse, parallel MACs. 40% DSP reduction.

Vitis HLSQwenTiling

Power Analysis

Done

VCD-based power estimation. 2.90× better energy efficiency (pJ/MAC). System-level speedup 1.77–2.31×.

VCDOpenSTAPower

PyTorch Reference

Done

Software implementation for accuracy validation. Vectorized BF16 ops with bias correction.

PyTorchNumPyPython

How It Works

The L-Mul Algorithm

Replace O(n²) mantissa multiplication with O(n) addition in log-space.

Extract BF16 Fields

Split 16-bit input into sign (1), exponent (8), mantissa (7 bits).

Add Instead of Multiply

Treat exponent+mantissa as a single 15-bit field. Add with bias correction.

Handle Overflow

Check carry bits for overflow/underflow. Saturate to max/min values.

Apply Correction

Add correction factor (result/32 + result/64) for accuracy.

lmul_bf16.py

def lmul_bf16(a, b):
    # Extract BF16 from upper 16 bits
    a_bf16 = (a.view(torch.int32) >> 16) & 0xFFFF
    b_bf16 = (b.view(torch.int32) >> 16) & 0xFFFF
    
    # Extract sign and field
    a_sign = (a_bf16 >> 15) & 0x1
    b_sign = (b_bf16 >> 15) & 0x1
    a_field = a_bf16 & 0x7FFF
    b_field = b_bf16 & 0x7FFF
    
    # L-Mul: add with bias
    sum_full = a_field + b_field + 0x4080
    carry = (sum_full >> 15) & 0x3
    
    # Handle overflow
    result_field = torch.where(carry == 1, 
                               sum_full & 0x7FFF, 0)
    result_field = torch.where(carry >= 2, 
                               0x7FFF, result_field)
    
    # Pack and correct
    s = a_sign ^ b_sign
    result = ((s << 15) | result_field) << 16
    result = result.view(torch.float32)
    return result + result/32 + result/64
                    

Analysis

Ablation Studies

Which layers are most sensitive to L-Mul approximation?

LSTM: Gates vs States

Perplexity

FP32 Baseline1.58

L-Mul Gates Only1.58

L-Mul States Only1.55

L-Mul Both1.54 ✓

Finding: State updates benefit from L-Mul—possible regularization effect.

CNN: Conv vs FC

Accuracy

FP32 Baseline98.30%

L-Mul Conv Only98.00%

L-Mul FC Only98.20% ✓

L-Mul Both98.07%

Finding: FC layers tolerate L-Mul better than conv layers.

NanoGPT: Layer Types

Perplexity

Base Perplexity4.97

Full LMUL4.64

Attn + MLP4.66

MLP + LM Head4.55 ✓

MLP Only4.73

Key Insights

Dense > Spatial: FC/linear layers consistently handle L-Mul better.

Regularization: Some components actually improve with L-Mul.

Hardware Priority: Target dense matmuls first for best ROI.

The Team

Who Built This

Owen Shi

gem5 / System Sim

Edgar Guzman

Verilog / Vivado

Idhant Kumar

Synthesis / Power

Brendan Kuang

NanoGPT / Transformers

Brandon Chiou

HLS / Qwen

Pranav Kumarsubha

Ablation / Website

Advised by Professor Rajesh Gupta

Halıcıoğlu Data Science Institute, UC San Diego

Neural networks,73× faster.

Why This Matters

Results

Classification Accuracy

Accuracy Comparison

NanoGPT Perplexity

FPGA Hardware Simulation

gem5 System Simulation

Speedup Scaling

Energy Reduction

Synthesis Area Comparison

HLS Resource Usage

Hardware Validation

Yosys Synthesis

gem5 System Sim

Verilog MLP

Vitis HLS

Power Analysis

PyTorch Reference

The L-Mul Algorithm

Extract BF16 Fields

Add Instead of Multiply

Handle Overflow

Apply Correction

Ablation Studies

LSTM: Gates vs States

CNN: Conv vs FC

NanoGPT: Layer Types

Key Insights

Who Built This

Owen Shi

Edgar Guzman

Idhant Kumar

Brendan Kuang

Brandon Chiou

Pranav Kumarsubha

Neural networks,
73× faster.