Solution Architecture

Prepare your data for the speed of thought

Don’t let your data hold back your ambition. Crusoe Cloud’s high-throughput, cost-effective system eliminates bottlenecks and hidden fees.

Data processing on Crusoe Cloud

Crusoe Cloud is purpose built to meet the massive demands of AI. Featuring a cost-efficient storage ecosystem with zero ingress and egress fees.

Optimize your data pipeline with Crusoe Cloud

Ingest
& Land
Ingest
& Land

Streamline ingestion from external databases, APIs, or object stores (s3 or similar) using tools like Apache NiFi, Airbyte, or parallelized rclone. Raw data is landed directly into a high-throughput namespace on Crusoe Shared Disks, bypassing slow staging layers.

Profile
& clean
Profile
& clean

Automate data quality checks and sanitization early in the pipeline. Run distributed jobs using Great Expectations or ydata-profiling to normalize encodings, strip markup, deduplicate records, and redact PII to ensure downstream compliance.

Transform 
& structure
Transform
& structure

Execute heavy transformations on Crusoe Managed Kubernetes (CMK) using frameworks like Apache Spark or KubeRay. The Crusoe CSI driver mounts Shared Disks directly into pods for high-speed I/O, allowing you to handle segmentation, MinHash deduplication, and conversion into training-ready formats (e.g., Parquet, WebDataset, TFRecords) at scale.

Validate
& publish
Validate
& publish

Finalize curated datasets for immediate use. After validation (via tools like Jupyter), versioned data is published back to Shared Disks. Because compute and storage are integrated, the data is instantly available for high-speed streaming into downstream model training jobs with zero copying required.

Crusoe benefits for
data preparation

Crusoe Cloud delivers the infrastructure and tools needed to create a standardized, scalable framework for ingesting, transforming, and governing enterprise data for a clean, reliable pipeline and downstream analysis.

1

Eliminate I/O bottlenecks

Leverage petabyte-scale NFS backed by VAST Data. Our storage layer delivers hundreds of GB/s in aggregate bandwidth, ensuring you fully saturate compute resources during intensive data preprocessing and checkpointing.
2

Accelerate data engineering

Speed up compute-intensive transformations by deploying tools like Apache Spark, KubeRay, and Kafka directly on Crusoe Managed Kubernetes. Utilize GPU-accelerated instances for high-throughput tokenization and feature engineering that traditional CPU clusters can’t match.
3

Predictably scale without penalties

Our low-latency networking minimizes congestion through an AI-first network backbone, utilizing a hybrid full-mesh iBGP design and MPLS-TE for real-time adaptation to traffic surges. Move data freely with zero ingress or egress fees – eliminating the hidden costs and budget volatility of scaling.  
4

Deliver enterprise-grade compliance

Secure your pipeline within a centralized, auditable framework. Our platform is SOC 2 Type II, ISO 27001, and ISO 42001 certified, ensuring your data privacy and AI management systems meet the most rigorous global enterprise standards.

Crusoe data processing solutions at a glance

Feature
Using Crusoe data processing services
Who it’s for
Data engineers, Data architect, ML engineers
Setup time
Seconds to minutes
Scaling
Distributed*
Storage options
  • Persistent Disks (Block storage)
    High availability & durability. Lifecycle independent of any VM.
  • Ephemeral Disks (Block storage)
    Highest performance, no redundancy. Lifecycle tied to the VM, data is erased on stop, restart, or hardware failure. AES-XTS encrypted.
  • S3-compatible API. Co-located with VMs for low latency. Supports versioning, object lock, multipart upload, and pre-signed URLs.
  • High scalability, shared access across many VMs via NFS. Data preserved until explicitly deleted.
Data protection
Your data stays yours and encrypted, with snapshots and backups available across all durable storage volumes.
Data transfer fees
Zero ingress or egress fees
Compliance
Crusoe Trust Center
SOC 2 Type II, ISO 27001, ISO 42001
* Distributed scaling applies to Object Storage and Shared Disks. Persistent Disks and Ephemeral Disks are provisioned on a per-VM basis.