Crusoe Managed Inference

Breakthrough inference
speed is here

Achieve up to
9.9x faster time-to-first-token*

Process up to 5x more tokens per second*

Optimal price-performance.
No limits.

Run model inference with fast time-
to-first-token, low latency, limitless throughput, and resilient scaling.

Eliminate latency with Crusoe's MemoryAlloy technology.

Scale to more users while maintaining consistent low latency.

Reduce token spend and serve more users without hitting capacity limits.

*
Benchmarked against vLLM for Llama-3.3-70B model.
Read our blog to learn more details.

Crusoe's inference engine is powered by MemoryAlloyTM technology, a unique cluster-native memory fabric that enables persistent sessions and intelligent request routing.

Abstract illustration composed of layered curved lines forming an organic shape

Model catalog

Experiment with top open/open-source models or work with our team to optimize performance for your own fine-tuned model.
Kimi-K2- Thinking
Input price

$0.60

 / 1M tokens
Output price

$2.50

 / 1M tokens
 / video sec
Cached token price

$0.30

 / 1M tokens
Context length
131,072
Frame rate
gpt-oss- 120b
Input price

$0.15

 / 1M tokens
Output price

$0.60

 / 1M tokens
 / video sec
Cached token price

$0.08

 / 1M tokens
Context length
131,072
Frame rate
DeepSeek R1 0528
Input price

$1.35

 / 1M tokens
Output price

$5.40

 / 1M tokens
 / video sec
Cached token price

$0.68

 / 1M tokens
Context length
163,840
Frame rate
Qwen3 235B A22B Instruct 2507
Input price

$0.22

 / 1M tokens
Output price

$0.80

 / 1M tokens
 / video sec
Cached token price

$0.11

 / 1M tokens
Context length
262,144
Frame rate
Llama 3.3 70B Instruct
Input price

$0.25

 / 1M tokens
Output price

$0.75

 / 1M tokens
 / video sec
Cached token price

$0.13

 / 1M tokens
Context length
131,072
Frame rate
DeepSeek V3 0324
Input price

$0.50

 / 1M tokens
Output price

$1.50

 / 1M tokens
 / video sec
Cached token price

$0.25

 / 1M tokens
Context length
163,840
Frame rate
Gemma 3 12B
Input price

$0.08

 / 1M tokens
Output price

$0.30

 / 1M tokens
 / video sec
Cached token price

$0.04

 / 1M tokens
Context length
131,072
Frame rate
Decart MirageLSD
Input price

-

 / 1M tokens
Output price

$0.25

 / 1M tokens
 / video sec
Cached token price
 / 1M tokens
Context length
131,072
Frame rate
30 fps

Bring your own
fine-tuned model

Built with cutting-edge technology to deliver unmatched performance

1

Breakthrough speed

Achieve up to 9.9x faster time-to-first-token* for real-world workloads with our inference engine featuring Crusoe's MemoryAlloy technology, a cluster-wide KV cache that eliminates duplicate prefills.
2

Superior throughput

Process up to 5x tokens per second* while maintaining low latency for each user with speculative decoding and dynamic batching.
3

Seamless scaling

Meet changing workload demands with scaling that is managed for you, and reliable even when loading the largest models.
*
Benchmarked against vLLM for Llama-3.3-70B model. Read our blog to learn more details.

Crusoe inference engine vs vLLM

0
2
4
6
8
10
TTFT
Throughput
9.9x
5.0x
x Improvement vs. vLLM
Llama-3.3-70B model, 4-node deployment
Optimizing for throughput and price is critical for our product experience. We're excited to explore the performance benefits that Crusoe's Inference Engine provides, and are looking forward to serving our models through the service.
Dhruv Batra
Co-founder & Chief Scientist
This is the kind of foundational technology that will enable our customers to build and deploy far more powerful and responsive AI agents with confidence.
Roey Lalazar
Co-founder & CTO
We need to process complex records instantly. Crusoe Managed Inference helps us meet that challenge. It provides a reliable path to production at a pace we haven’t seen on other platforms.
Grant Jensen
Co-Founder & CEO
Black background with the words 'Crusoe Managed Inference' in white and green text alongside four layered transparent panels with abstract white line designs.

Crusoe Intelligence Foundry,
designed for AI developers

Speed up app development with a unified hub that accelerates model discovery and experimentation, supports quick iteration, and removes the burden of managing infrastructure.

API keys for fastest
path to production

Experiment with top open-source models rapidly. Generate API keys, monitor performance metrics and enable provisioned throughput for production-scale deployments.

Managed endpoints
for rapid deployment

Leverage fully managed endpoints powered by our inference engine, with Crusoe's MemoryAlloy technology, tuned specifically to each model for optimized performance.

Unified interface for
cross-team collaboration

Users working across teams can easily switch between the Crusoe Intelligence Foundry for inference tasks and the Crusoe Cloud Console for infrastructure-as-a-service (IaaS) resources within a single, integrated environment.

Frequently
asked questions

Are you ready to build something amazing?