Cerebras Systems has introduced Cerebras Inference, a new service for running AI models. The service offers speeds of 1,800 tokens per second for Llama 3.1 8 billion and 450 tokens per second for Llama 3.1 70 billion, with pricing starting at 10 cents per million tokens.
Cerebras Inference maintains 16-bit accuracy throughout the inference process, offering high performance without compromising accuracy. The service is available in three tiers: Free, Developer, and Enterprise. The Developer tier provides an API endpoint for flexible, serverless deployment, while the Enterprise tier offers fine-tuned models, custom SLAs, and dedicated support.
Cerebras claims its inference service is 20x faster than NVIDIA GPU-based solutions in hyperscale clouds and offers 100x higher price performance for AI workloads. The company claims this speed enables developers to build next-generation AI applications requiring complex, multi-step, real-time task performance, such as AI agents.
Analyst QuickTake: Cerebras Inference’s performance and lower costs threaten NVIDIA's solutions, potentially changing how AI inference is done. With nearly 20x the speed and much lower costs, it may appeal to developers and businesses looking for more efficient and affordable AI deployment options.
By using this site, you agree to allow SPEEDA Edge and our partners to use cookies for analytics and personalization. Visit our privacy policy for more information about our data collection practices.