Storage

Flash workers have access to two types of storage: for temporary data and for persistent, sharable data.

Container disk

A container disk provides temporary storage that exists only while a worker is running. Each worker gets its own isolated container disk, with a default size of 64GB for GPU endpoints. You can read and write temporary files to the container disk using standard filesystem operations from within @Endpoint functions. Any file that is not written to a network volume (at /runpod-volume/) is written to the container disk, and will be erased when the worker stops.

Configuring container disk size (GPU-only)

Configure container disk size for GPU endpoints using the template parameter (default: 64GB).

from runpod_flash import Endpoint, GpuType, PodTemplate

@Endpoint(
    name="large-temp-storage",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    template=PodTemplate(containerDiskInGb=100)
)
async def process(data: dict) -> dict:
    # 100GB container disk available
    ...

CPU auto-sizing

CPU endpoints automatically adjust container disk size based on instance limits:

CPU3G and CPU3C instances: vCPU count × 10GB (e.g., 2 vCPU = 20GB)
CPU5C instances: vCPU count × 15GB (e.g., 4 vCPU = 60GB)

If you specify a custom size that exceeds the instance limit, deployment will fail with a validation error.

Network volumes

Network volumes provide persistent storage that survives worker restarts. Each volume is tied to a specific datacenter. Use volumes to share data between endpoint functions or to persist data between runs.

Attaching network volumes

Attach a network volume using the volume parameter. Flash uses the volume name to find an existing volume or create a new one. Specify the datacenter parameter to control where the volume is created:

from runpod_flash import Endpoint, GpuType, DataCenter, NetworkVolume

vol = NetworkVolume(name="model-cache", size=100, datacenter=DataCenter.US_GA_2)

@Endpoint(
    name="persistent-storage",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    datacenter=DataCenter.US_GA_2,
    volume=vol
)
async def process(data: dict) -> dict:
    # Access files at /runpod-volume/
    ...

You can also reference an existing volume by ID:

vol = NetworkVolume(id="vol_abc123")

Multi-datacenter volumes

For endpoints deployed across multiple datacenters, pass a list of volumes (one per datacenter):

from runpod_flash import Endpoint, GpuType, DataCenter, NetworkVolume

volumes = [
    NetworkVolume(name="models-us", size=100, datacenter=DataCenter.US_GA_2),
    NetworkVolume(name="models-eu", size=100, datacenter=DataCenter.EU_RO_1),
]

@Endpoint(
    name="global-inference",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    datacenter=[DataCenter.US_GA_2, DataCenter.EU_RO_1],
    volume=volumes
)
async def process(data: dict) -> dict:
    # Workers in each region access their local volume at /runpod-volume/
    ...

Only one network volume is allowed per datacenter. If you specify multiple volumes in the same datacenter, deployment will fail.

Accessing network volume files

Network volumes mount at /runpod-volume/ and can be accessed like a regular filesystem:

from runpod_flash import Endpoint, GpuType, NetworkVolume

vol = NetworkVolume(name="model-storage")

@Endpoint(
    name="model-server",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    volume=vol,
    dependencies=["torch", "transformers"]
)
async def run_inference(prompt: str) -> dict:
    from transformers import AutoModelForCausalLM, AutoTokenizer

    # Load model from network volume
    # Persists across worker restarts and shared between workers
    model_path = "/runpod-volume/models/llama-7b"
    model = AutoModelForCausalLM.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    # Run inference
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_length=100)
    text = tokenizer.decode(outputs[0])

    return {"generated_text": text}

Load-balanced endpoints with storage

from runpod_flash import Endpoint, GpuType, NetworkVolume

vol = NetworkVolume(name="model-storage")

api = Endpoint(
    name="inference-api",
    gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
    volume=vol,
    workers=(1, 5)
)

@api.post("/generate")
async def generate(prompt: str) -> dict:
    from transformers import AutoModelForCausalLM

    model = AutoModelForCausalLM.from_pretrained("/runpod-volume/models/gpt2")
    # Generate text
    return {"text": "generated"}

@api.get("/models")
async def list_models() -> dict:
    import os
    models = os.listdir("/runpod-volume/models")
    return {"models": models}

Creating and managing network volumes

Network volumes must be created before attaching them to an Endpoint. See Network volumes for detailed instructions.

Get started

Flash

Serverless

Pods

Public Endpoints

Instant Clusters

Integrations

Hub

Fine-tuning

Reference

Storage

Container disk

Configuring container disk size (GPU-only)

CPU auto-sizing

Network volumes

Attaching network volumes

Multi-datacenter volumes

Accessing network volume files

Load-balanced endpoints with storage

Creating and managing network volumes

Get started

Flash

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Integrations

Hub

Fine-tuning

Reference

​Container disk

​Configuring container disk size (GPU-only)

​CPU auto-sizing

​Network volumes

​Attaching network volumes

​Multi-datacenter volumes

​Accessing network volume files

​Load-balanced endpoints with storage

​Creating and managing network volumes

Container disk

Configuring container disk size (GPU-only)

CPU auto-sizing

Network volumes

Attaching network volumes

Multi-datacenter volumes

Accessing network volume files

Load-balanced endpoints with storage

Creating and managing network volumes