Flash workers have access to two types of storage: for temporary data and for persistent, sharable data.
Container disk
A container disk provides temporary storage that exists only while a worker is running. Each worker gets its own isolated container disk, with a default size of 64GB for GPU endpoints.
You can read and write temporary files to the container disk using standard filesystem operations from within @Endpoint functions.
Any file that is not written to a network volume (at /runpod-volume/) is written to the container disk, and will be erased when the worker stops.
Configuring container disk size (GPU-only)
Configure container disk size for GPU endpoints using the template parameter (default: 64GB).
from runpod_flash import Endpoint, GpuType, PodTemplate
@Endpoint(
name="large-temp-storage",
gpu=GpuType.NVIDIA_A100_80GB_PCIe,
template=PodTemplate(containerDiskInGb=100)
)
async def process(data: dict) -> dict:
# 100GB container disk available
...
CPU auto-sizing
CPU endpoints automatically adjust container disk size based on instance limits:
CPU3G and CPU3C instances: vCPU count × 10GB (e.g., 2 vCPU = 20GB)
CPU5C instances: vCPU count × 15GB (e.g., 4 vCPU = 60GB)
If you specify a custom size that exceeds the instance limit, deployment will fail with a validation error.
Network volumes
Network volumes provide persistent storage that survives worker restarts. Each volume is tied to a specific datacenter. Use volumes to share data between endpoint functions or to persist data between runs.
Attaching network volumes
Attach a network volume using the volume parameter. Flash uses the volume name to find an existing volume or create a new one. Specify the datacenter parameter to control where the volume is created:
from runpod_flash import Endpoint, GpuType, DataCenter, NetworkVolume
vol = NetworkVolume(name="model-cache", size=100, datacenter=DataCenter.US_GA_2)
@Endpoint(
name="persistent-storage",
gpu=GpuType.NVIDIA_A100_80GB_PCIe,
datacenter=DataCenter.US_GA_2,
volume=vol
)
async def process(data: dict) -> dict:
# Access files at /runpod-volume/
...
You can also reference an existing volume by ID:
vol = NetworkVolume(id="vol_abc123")
Multi-datacenter volumes
For endpoints deployed across multiple datacenters, pass a list of volumes (one per datacenter):
from runpod_flash import Endpoint, GpuType, DataCenter, NetworkVolume
volumes = [
NetworkVolume(name="models-us", size=100, datacenter=DataCenter.US_GA_2),
NetworkVolume(name="models-eu", size=100, datacenter=DataCenter.EU_RO_1),
]
@Endpoint(
name="global-inference",
gpu=GpuType.NVIDIA_A100_80GB_PCIe,
datacenter=[DataCenter.US_GA_2, DataCenter.EU_RO_1],
volume=volumes
)
async def process(data: dict) -> dict:
# Workers in each region access their local volume at /runpod-volume/
...
Only one network volume is allowed per datacenter. If you specify multiple volumes in the same datacenter, deployment will fail.
Accessing network volume files
Network volumes mount at /runpod-volume/ and can be accessed like a regular filesystem:
from runpod_flash import Endpoint, GpuType, NetworkVolume
vol = NetworkVolume(name="model-storage")
@Endpoint(
name="model-server",
gpu=GpuType.NVIDIA_A100_80GB_PCIe,
volume=vol,
dependencies=["torch", "transformers"]
)
async def run_inference(prompt: str) -> dict:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model from network volume
# Persists across worker restarts and shared between workers
model_path = "/runpod-volume/models/llama-7b"
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Run inference
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=100)
text = tokenizer.decode(outputs[0])
return {"generated_text": text}
Load-balanced endpoints with storage
from runpod_flash import Endpoint, GpuType, NetworkVolume
vol = NetworkVolume(name="model-storage")
api = Endpoint(
name="inference-api",
gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
volume=vol,
workers=(1, 5)
)
@api.post("/generate")
async def generate(prompt: str) -> dict:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("/runpod-volume/models/gpt2")
# Generate text
return {"text": "generated"}
@api.get("/models")
async def list_models() -> dict:
import os
models = os.listdir("/runpod-volume/models")
return {"models": models}
Creating and managing network volumes
Network volumes must be created before attaching them to an Endpoint. See Network volumes for detailed instructions.