VBlog

Experiments. Thougts. Experiences.

“Every microsecond of wasted compute mattered. We realized tha tif we optimized just 1%, we could cut millions in cost”
– Luo Fuli, Chief Architect at Deepseek

When ChatGpt was launched in December 2022, it was supposed to be the disruption of the tech oligopoly that never happened. Why? Because rather than disruption the big tech, it became “big tech”. The GPT-4 costed more than hundred million dollars. The deepseek r1 is different for exactly that reason: it costed a mere six million dollars to train (and also for being more open than “open”AI). In this post, I will go into the details of why I think so. This is going be a fairly technical one.

How did DeepSeek achieve such feat

They pulled of a series of data, software and hardware sleight of hands to significantly cut down their costs. Rather than doing incremental updates, they made some foundational changes that reaped them the benefits. Let’s look at each of the three in detail.

Software/Model

IMO, this is the most important category. Let’s look at different innovations:

  1. Their use of MoE (Mixture of Experts) helped them to be much more leaner with their GPU usage. I think it is also much closer to a representation of human brain. It gave them the results similar to OpenAI’s o1 without going the agentic route.
  2. Another key modification is Sparse Attention. It reduces the complexity by allowing each word to attend to only a subset of other words, as if skimming a book rather than reading it cover-to-cover. The computational complexity improves from O(n2) to O(nlogn) or even better.
  3. Dynamic Precision Scaling was the other key improvement. The precision here refers to the number of bits used to represent the numbers in the model. Their idea is to compute the precision on the fly based on the needs of the task. Eg. during the traing, they might use FP16 for most operations but switch to FP32 for gradient updates. It’s like changing the gears while driving a car based on the terrain.

Hardware

They were forced to innovate because of the export controls on the high-end Nvidia GPUs imposed by the US Govt. They relied on the sub-par PCIe based A100 GPUs instead of the SXM versions. Scarcity is indeed the mother of innovation!

  1. PCIe A100 GPUs:
    Using “sub-par” PCIe-based chips (instead of premium SXM versions), they optimized for cost. Think of it as building a racecar with economy parts—and still winning the race.
  2. 2-Layer Fat-Tree Network:
    A streamlined “data highway” that reduced networking costs by 40%. Traditional setups resemble congested city roads; DeepSeek’s design acts like an express lane.
  3. HFReduce Library:
    Replaced Nvidia’s NCCL with a custom tool tailored for their hardware. Result? 35% faster data transfer between GPUs.

Data

DeepSeek’s data optimizations are the unsung heroes:

  1. Deduplication: By removing repetitive data (e.g., trimming 1,000 near-identical cat images to 100), they cut training time by 30%. Google’s 2022 study showed similar gains—but scaling this is no small feat.
  2. Adaptive Data Remixing: Like a chef adjusting ingredients, DeepSeek rebalances datasets mid-training to focus on underrepresented topics, boosting model accuracy.
  3. Custom Tokenizer: Their in-house tool converts text into AI-digestible chunks 20% more efficiently than off-the-shelf alternatives.

All these techniques complement each other, converging into a fast but sleek model. Their latest model (deekseek-v3) activates just 5% of the network at any time.


DeepSeek vs. The Giants: A Speculative Comparison

MetricGPT-4 (OpenAI)DeepSeek-v3
Training Cost$120M$6M
GPU Efficiency (TFLOPS)45%82%
Activation Rate100%5% (MoE)

The Team: Underdogs With A Mission

A look into their team truly feels disruptive (like the old Google or Apple disrupting their markets)

Liang Wenfeng (Founder, 39):

He founded the hedge fund High-Flyer in 2015. In 2021, he stockpiled 10,000 A100 chips—a $200M bet that now powers DeepSeek’s rise.

Luo Fuli (Chief Architect, 29):

The grew up poor in the countryside of Sichuan, went to an average university in Beijing. Her journey mirrors Steve Jobs’ “think different” ethos.

So, what’s next?

The implications of DeepSeek’s $6M breakthrough are seismic. While their model has answered millions of questions for users worldwide, it’s also cracked open a Pandora’s box of questions about the future of AI:

  • Is this the end of Nvidia’s bull run?
  • What could be the impact of “cheap” AI on bad actors?
  • Will this trigger an entry of other countries into AI race?
  • What about even smaller startups?

One thing is certain: The era of “bigger is better” AI is over. Companies building humongous data centers are chasing diminishing returns. This shift could also mean a much lower environmental impact for AI—a silver lining in an industry often criticized for its carbon footprint. I hope they are taking notes!

What do you think about the questions that DeepSeek raises? Do leave a comment with your thoughts.

Posted in

Leave a comment