What does VoxCPM actually do?
VoxCPM2 matters because it combines several capabilities that are often split across separate TTS tools. It can take plain text and produce speech, but it can also accept a natural-language voice description before the text, use a reference clip for cloning, or use prompt audio plus transcript for closer continuation. The concrete numbers shape the buying decision: 2B parameters, 30 supported languages, 48kHz output, more than 2 million hours of training data, and roughly 8 GB VRAM listed in the model details.
The product is strongest for builders. A creator can try the Hugging Face Space or listen to the demo page, but the durable paths are Python, CLI, local web demo, fine-tuning, and serving. The repo documents commands for direct synthesis, voice design, reference cloning, batch generation, and deployment through NanoVLLM or vLLM-Omni. That makes VoxCPM more comparable to an open voice infrastructure layer than to a finished editing suite with folders, exports, permissions, and account billing.