Kyutai, a France-based non-profit AI research lab, has launched Moshi, a real-time multimodal AI model prototype to communicate naturally and expressively with an AI.
The model can recognize and express up to 70 different emotions and styles, speak with various accents, and manage two audio streams concurrently, allowing simultaneous listening and speaking. Additionally, the model's end-to-end latency is below 200 milliseconds, it is compatible with consumer-grade hardware (including MacBooks), and it supports backends like CUDA, Metal, and CPU.
The lab plans to release the full codebase to contribute to AI open research and development.
By using this site, you agree to allow SPEEDA Edge and our partners to use cookies for analytics and personalization. Visit our privacy policy for more information about our data collection practices.