Sitemap

Deploying DeepSeek-R1 on AWS: Our Journey Through Performance, Cost, and Reality

5 min readApr 14, 2025

🚀 TL;DR

  • At our startup, we’re passionate about AI and future-ready technologies, so we explored deploying open-source DeepSeek-R1 models in-house to evaluate their viability as alternatives to paid services like ChatGPT.
  • We tested multiple DeepSeek-R1 models (1.5B, 7B, 8B, and 16B) across AWS EC2 GPU instances (g5g.xlarge, g5g.2xlarge).
  • While the deployment was technically successful, the operational costs significantly exceeded expectations.
  • For startups and growing enterprises, self-hosting these LLMs isn’t yet cost-effective at smaller scales, but we’re optimistic about the near future.

🤖 Why did we Experiment With Self-Hosted LLMs

Our team thrives on exploring cutting-edge technology. Recently, we asked ourselves a big question:

“Can we use powerful, open-source language models like DeepSeek-R1 to replace paid AI coding assistants?”

We had clear motivations:

  • Control & Customization: Fine-tune the AI specifically for our use-cases.
  • Privacy & Security: Maintain complete control over sensitive code and data.
  • Long-Term Cost Savings: Potentially reduce ongoing AI tool expenses.

DeepSeek-R1 was particularly appealing because:

  • Offers multiple model sizes (1.5B, 7B, 8B, 16B) for flexibility.
  • Fully open-source, making it perfect for innovation and rapid experimentation.
  • Made records by outperforming ChatGPT in certain coding benchmarks.

🛠️ Choosing the Optimal AWS Infrastructure

Deploying these models required careful GPU instance selection.
We tested:

  • g5g.xlarge (4 vCPU + 8 GB + 16 GB GPU)
  • g5g.2xlarge (8 vCPU + 16 GB + 16 GB GPU)

Why AWS EC2 (G5 Series)?

  • Strong balance of GPU performance to price.
  • Compatibility with CUDA and machine learning frameworks.

⚡ Our Deployment Stack & Setup

We kept things lean, practical, and fast to iterate. Here’s what we used to get DeepSeek models up and running:

  • Docker & NVIDIA Container Toolkit: The backbone of our deployment. It ensured consistent environments with GPU access, saving us from the usual “it works on my machine” chaos.
  • Ollama + OpenWeb UI: Ollama made local model loading dead simple, and OpenWeb UI gave us a clean, ChatGPT-like frontend — perfect for internal testing with engineers.

📈 Benchmarking DeepSeek-R1 Performance:

We benchmarked four DeepSeek-R1 model sizes: DeepSeek-R1 1.5B, 7B,8B, 16B. Our benchmarking revealed crucial insights:

Key Takeaways:

  • Smaller models (1.5B) were highly efficient in terms of tokens/sec and cost, but their response quality fell short for production-grade use.
  • Mid-tier models (7B, 8B) delivered noticeably better responses, but still didn’t reach the quality bar needed to fully replace commercial LLMs.
  • The 16B model, while more capable, introduced major trade-offs: slower throughput, high operational costs, and low concurrency. Hence no thoughts on trying larger models

📌 Note: In our testing, anything below 30–35 tokens/sec began to feel sluggish to the human eye. This threshold became especially noticeable in interactive settings — responses started to drag, and the user experience suffered.

💰 Cost Reality Check: AWS vs. SaaS (ChatGPT)

Compared against existing SaaS solutions like ChatGPT, our numbers indicated:

  • Keeping just one g5g.2xlarge instance running 24/7 clocks in at around $414/month.
  • Equivalent SaaS services (e.g., ChatGPT Plus at $20/user/month) appeared drastically cheaper at our scale.
  • Hidden operational costs (setup, DevOps overhead, maintenance) further widened the gap.

Bottom Line: For startups operating at a small-to-medium scale, today’s SaaS LLM offerings deliver far better value for money. Until infra costs drop or usage scales dramatically, self-hosting simply doesn’t add up.

🚧 Technical Challenges & Key Lessons

Our journey wasn’t without obstacles:

  • GPU memory limitations: The 16B model routinely crashed under load until we applied aggressive quantization techniques and heavily reduced batch sizes. Even then, inference stability wasn’t guaranteed, especially during peak throughput.
  • Performance degradation with longer context: As the prompt context increased — especially with 8B and 16B models — we observed token generation speeds dropping below 30 tokens/sec. This led to noticeably slower response times and, in some cases, complete crashes when the context window neared maximum limits.
  • Performance tuning complexity: Although the models were pretrained, getting optimal throughput still required significant tuning. Small tweaks to batch_size, sequence length, or context window sometimes introduced unpredictable latency or runtime crashes.

Critical Lessons:

  • Larger models require meticulous GPU memory management and rigorous testing.
  • Optimize batch inference from the outset to handle concurrent requests efficiently.

🤯 Reflecting on the Scale of OpenAI & DeepSeek

Running DeepSeek-R1 models locally gave us a new appreciation for what it takes to operate at the scale of OpenAI or even DeepSeek’s own hosted infrastructure. We were just trying to support fewer than 100 internal users — and yet, the 8B and 16B models already pushed the limits of what we could reliably run on GPU-backed EC2 instances.

These models pale in comparison to the massive 400B+ parameter models that serve millions globally. Delivering responsive, stable, and cost-effective LLM experiences at that scale is nothing short of engineering wizardry.

This reflection wasn’t just humbling — it underscored that while self-hosting is technically possible, it comes with major infrastructure and optimization demands that grow exponentially with model size.

✨ So Are Self-Hosted LLMs Worth It?

For startups eager to explore AI from the inside out, self-hosting models like DeepSeek-R1 is an exciting challenge — but one that comes with steep costs. We found ourselves dealing with slow generation speeds, instability at high context lengths, and memory bottlenecks even for smaller models.

What surprised us most? We weren’t deploying for global scale — just supporting under 100 internal users — not a massive workload by any means. Even so, infrastructure costs and engineering time quickly added up.

While self-hosting gives you unmatched control, privacy, and flexibility, the cost-benefit trade-off just doesn’t land in your favor — yet.

Still, the future looks promising as GPU access is improving & that small workstations like Nvidia DIGITS are popping up due to which developers can prototype, fine-tune, and inference the latest generation of reasoning AI models from DeepSeek etc. locally

We’re excited to revisit this path soon — with more firepower and better context on where it makes sense.

📘 Before You Go: Some Helpful Resources…

--

--

LiftOff LLC
LiftOff LLC

Written by LiftOff LLC

We are business accelerator working with startups / entrepreneurs in building the product & launching the companies.

No responses yet