1. Introduction: Quantifying the Pain

Fast, local, and high-quality Text-to-Speech (TTS) is becoming a necessity for developers and AI agents alike. However, existing solutions often come with a heavy reality: massive Python dependencies, complex PyTorch environments, and multi-gigabyte disk footprints just to synthesize a few seconds of audio. Cold-start times alone make them unsuitable for quick scripts or responsive agents.

Enter Kokoro v1.0. At just 82M parameters, Kokoro v1.0 produces incredible audio quality while remaining small enough for edge inference. To make the most of it, I built kokoro-cli - a native, cross-platform command-line interface that brings this powerful model directly to your terminal.

This post is a brief on my journey building kokoro-cli, the architectural decisions made (including a necessary pivot away from Go), how it was designed to be "agent-ready" from day one, and how you can use it today.

2. The Journey: The G2P Challenge that Forced the Pivot to Rust

My initial attempt at building this CLI was in Go. Not only is Go fantastic for building complex, statically compiled tools, but it's also my personal preference, since I can read and write it passably well. I hit a pretty big roadblock: true 1:1 phonetic parity with the official Python Kokoro pipeline.

Turning text into sounds is a complex pipeline. Grapheme-to-Phoneme (G2P) conversion involves expanding numbers ("123" to "one hundred twenty-three"), handling homographs ("I read" vs "I will read"), and applying specific prosody rules. Reimplementing Kokoro's proprietary lexicon lookups and espeak-ng fallbacks in Go proved to be a maintenance nightmare.

Lesson: Always check if the ML model's tokenizer/G2P has a native port before trying to reimplement it from scratch.

When I looked around, beyond my Go preferences, I discovered misaki-rs, a native Rust port of the official phonetic engine used by Kokoro. Pivoting to Rust allowed me to achieve 100% phonetic parity with the Python generation scripts—including support for inline phoneme overrides (e.g., [Kokoro](/kˈOkəɹO/)) while delivering a fast, natively compiled binary.

3. Architectural Highlights

By combining Rust with the Kokoro model, there are some interesting optimizations under the hood.

Kokoro CLI Architecture

Direct ONNX Bindings

I used the ort crate (v2.0+) to execute the model.onnx graph directly using native tensor math, bypassing heavy C++ or Python wrappers.

Zero-Copy Voice Embedding Loading

CLIs need to be fast. Users expect instant execution. The voices.bin file containing speaker embeddings is 27MB. To handle the 27MB voices.bin file instantly, I implemented a zero-copy approach.

By calculating the exact byte offset (voice_id * 522_240) + (style_index * 256 * 4), the CLI streams only the required 1024 bytes (256 floats) directly into the ONNX tensor, entirely eliminating startup latency.

Native Hardware Acceleration and Suppressing Noise

I'm doing most of my implementations on my MacOS laptop, so I added an optional mac-acceleration feature flag that taps into ort's CoreMLExecutionProvider. This pushes hardware-accelerated generation directly to the Apple Neural Engine (ANE) and GPU.

However, macOS notoriously dumps harmless but noisy CoreAnalytics telemetry to stderr during CoreML execution. To preserve the pristine UX expected of a CLI, I used the shh crate to hijack and suppress this output, keeping the terminal perfectly clean.

4. Designing for Agents: The Agent-Ready CLI

Modern CLIs serve a dual audience: human developers and automated coding agents or LLM-driven pipelines. Based on CLI Design Best Practices, I designed kokoro-cli to be fundamentally Agent-Ready.

  • Deterministic Output for Parsing: Every command that produces data supports a --json or --no-tui flag. Agents can reliably parse kokoro-cli voices --json into internal structures without screen-scraping text tables.
  • Environment Overrides: By supporting variables like NO_COLOR, automated systems can easily bypass interactive visual elements.
  • Structured Discoverability: I implemented distinct command groups and semantic help text. The CLI explicitly guides users (and agents exploring --help) through the configuration steps, minimizing cognitive load.
  • Standardized Composable Piping: True to the Unix philosophy, kokoro-cli accepts standard input. An agent can seamlessly pipe an LLM's text output directly into speech:
    echo "Action completed." | kokoro-cli speak -v af_bella > out.wav

5. How to Get It & Use It

kokoro-cli is published on crates.io and is heavily optimized for speed and simplicity.

Installation:

cargo install kokoro-cli

# For macOS users wanting ANE/GPU acceleration:
cargo install kokoro-cli --features mac-acceleration

Initial Setup:
The CLI respects the XDG Base Directory specification. Downloading the model and voices takes one command:

kokoro-cli setup

Discovering Voices:
Explore the 54 built-in voices across 8 languages.

kokoro-cli voices --language "English (US)"

Generating Speech:
Pass text directly, or pipe it in!

kokoro-cli speak "Hello, world! This is a test of the CLI." --voice af_bella --out test.wav

6. What's Next: Evaluation Pipelines

With the foundational engine complete, the next logical step is Evaluation. Currently, developers blindly prompt TTS models and hope the output sounds right.

My next goal (and the project's roadmap) involves building an automated Speech Evaluation Pipeline on top of this CLI. By integrating programmatic metrics (Pitch Variability, Speaking Rate) and subjective quality proxies (automated MOS scoring via NISQA, intelligibility via Moonshine), kokoro-cli will transition from a standalone generation tool into a CI/CD-ready evaluation suite. I anticipate running an LLM's output through kokoro-cli in a GitHub Action, and failing the build if the Intelligibility score drops below 0.8.

7. Conclusion

By combining the lightweight, high-quality Kokoro model with Rust's performance and a rigorous agent-first design philosophy, kokoro-cli provides a reliable, fast audio engine that integrates cleanly into existing software pipelines.

Check out the kokoro-cli repository, submit a PR, and start building agentic audio pipelines today!