Google Cloud announced a preview of NVIDIA L4 GPU support for its Cloud Run serverless platform. The feature allows organizations to run serverless AI inference, with users only paying for the GPU resources they use.
The GPU-enabled Cloud Run instances can support various AI frameworks and models, including NVIDIA NIM, VLLM, Pytorch, and Ollama. Each instance can have one NVIDIA L4 GPU, providing up to 24 GB of vRAM. The service supports running models with up to 13 billion parameters for optimal performance.
Google Cloud claims this integration enables real-time inference with lightweight open models, serving custom fine-tuned GenAI models, and accelerates compute-intensive services. The company reports cold start times ranging from 11 to 35 seconds for various models, demonstrating the platform's responsiveness.
By using this site, you agree to allow SPEEDA Edge and our partners to use cookies for analytics and personalization. Visit our privacy policy for more information about our data collection practices.