Multimodal Ground Truth: Building AudioVoxBench

A dark-themed user interface displaying two audio track cards by user ghchinoy. The left card, titled 'Purr Echo' with an image of kittens, is tagged 'lyria-3-clip-preview'. The right card, titled 'Code & Coffee' with an image of a coding setup, is tagged 'lyria-3-pro-preview'. Both cards display durations and standard playback controls.
A screenshot of two tracks on Musicbox - the first is a track created from an image of our litter of foster kittens ❤️ Lyria 3 multimodal.

In building the MacOS companion app for Musicbox, a multimodal application that pairs Lyria 3 audio tracks with Gemini-generated visuals, I decided that I'd like to experiment on more than just text search for finding the "right" tracks for the right mood. The Gemini Embedding 2 natively multimodal model promises a single model that "maps text, images, videos, audio and documents into a single, unified embedding space, and captures semantic intent across over 100 languages." I wanted to test how well this works - searching via text makes sense, but what if I gave it an image, or hummed a tune, would that provide more relevant results?

The initial promise of multimodal models was simple: "Just embed everything." Conventional wisdom suggests that more data is better. If you have audio and artwork, you should provide both to the model to avoid the complexity of 2024-style RAG pipelines.

My benchmarks show that this assumption is incomplete—and the crossover point happens sooner than expected.

I found that while text-augmented search is perfect for small sets, it hits a "saturation point" much earlier than I anticipated. I built AudioVoxBench to determine the exact trade-off between descriptive text and raw media entropy as a catalog scales.

Indexing Strategies

I tested five strategies to identify which data points provide the strongest search signal:

Strategy A (Baseline): Original generation prompt.
Strategy B (Augmented): Prompt and an AI-generated visual caption.
Strategy C (Semantic): Prompt, caption, and technical quality score (MOS).
Strategy D (Multimodal): Prompt and raw track image.
Strategy E (Full-Spectrum): Prompt, caption, raw image, and raw audio.

I evaluated these using Mean Reciprocal Rank (MRR). In a discovery application, the accuracy of the top result is the primary metric. MRR rewards the "bullseye": a score of 1.0 for rank #1, and 0.5 for rank #2. For Musicbox, MRR is the most effective way to measure how well the system identifies the exact track a user is looking for.

Real-World Stress Test: The Production Ingestor

Theory is one thing; production data is another. To verify our findings, we built a TrackIngestor that harvests real history from our Musicbox Firestore database. When we moved from our 3-track "Golden Set" to a real production database of 60 tracks, the results pivoted.

Strategy	MRR (Production DB)
A (Baseline)	0.3595
C (Semantic Text)	0.5162
D (Multimodal Image)	0.6429
E (Full-Spectrum)	0.4286

Searching for a "Stormy Mountain" vibe using an image probe:

Strategy C (Text) correctly matched the track "Symphony of Shred" but with a distance of 1.1308.
Strategy D (Multimodal) matched the same track with a much higher confidence distance of 0.8831.

This is a critical finding. While text-augmented search (Strategy C) is accurate enough to find the right ID in a small library, it lacks the "semantic depth" of Strategy D. The interleaved raw image provided the model with the missing context needed to identify the vibe with 22% higher mathematical confidence.

When we ran our Hold-out Probes (media assets the model had never seen) against a database of 60 real production tracks, the results revealed a hidden dimension: The Confidence Gap.

The Confidence Gap

The production results revealed a hidden dimension: The Confidence Gap.

Searching for a "Stormy Mountain" vibe using an image probe:

Strategy C (Text) correctly matched the track "Symphony of Shred" but with a distance of 1.1308.
Strategy D (Multimodal) matched the same track with a much higher confidence distance of 0.8831.

This is a critical finding. Strategy D matched the track with 22% higher mathematical confidence. While text is enough for a small library, it lacks the "semantic depth" provided by the raw image data.

The Scaling Threshold

Strategy C is the most efficient choice for a tiny catalog, but it hits a wall at just 60 tracks. At this Pivot Point, the Confidence Gap becomes the defining factor.

Conceptual Visualization: Strategy C (Text) provides high initial precision but degrades as the library's semantic density increases. Strategy D (Multimodal Image) crosses over at ~60 tracks, using visual entropy to distinguish between tracks that text alone can no longer separate.

For now, Musicbox is transitioning to Strategy D for its core index. It provides the necessary semantic depth to handle growing catalogs without the "acoustic jitter" that currently plagues full-spectrum audio indexing.

Implementation Details

To keep iteration cycles fast, I decoupled the benchmark from the main macOS app. I implemented a native Swift CLI suite that handles asset generation via Lyria 3 and vector indexing using sqlite-vec.

I chose sqlite-vec over dedicated vector databases for three reasons:

Local Execution: It runs inside SQLite. This eliminates network latency and server dependencies for the macOS app.
Relational Integration: By using sqlite-vec, vector search becomes a standard SQL operation. We can join a vector similarity search with a relational WHERE user_id = ? clause in a single ACID-compliant transaction. There is no external API synchronization or complex "RAG pipeline"—just a single multimodal model and a local database.
Swift Integration: It is a single C file that integrates directly with Swift's C-interop and the GRDB library.

The official Firebase AI Swift SDK currently lacks the :embedContent method required for multimodal embeddings. For this, going directly to the API works well - I implemented a manual REST client to interact with the Gemini API directly:

private func getDeveloperAPIEmbedding(for parts: [ContentPart], apiKey: String) async throws -> [Float] {
    let modelName = "gemini-embedding-2-preview"
    let urlString = "https://generativelanguage.googleapis.com/v1beta/models/\(modelName):embedContent?key=\(apiKey)"
    
    var request = URLRequest(url: URL(string: urlString)!)
    request.httpMethod = "POST"
    request.addValue("application/json", forHTTPHeaderField: "Content-Type")
    
    let body: [String: Any] = [
        "model": "models/\(modelName)", 
        "content": ["parts": parts.map { $0.dictionaryValue }]
    ]
    request.httpBody = try JSONSerialization.data(withJSONObject: body)
    return try await executeEmbeddingRequest(request)
}

Here’s a snippet of the EmbeddingStrategy logic, which swaps indexing schemes at call-time:

enum EmbeddingStrategy: String, CaseIterable {
    case semantic = "C (Semantic)"
    case fullSpectrum = "E (Full-Spectrum)"
    
    func parts(track: TrackRecord) -> [ContentPart] {
        switch self {
        case .semantic:
            // Text-only augmentation
            return [.text("Prompt: \(track.prompt). Visual: \(track.caption). Quality: \(track.mosic)")]
        case .fullSpectrum:
            // Interleaved multimodal parts
            return [
                .text(track.prompt),
                .file(uri: track.imageUri, mimeType: "image/jpeg"),
                .file(uri: track.audioUri, mimeType: "audio/mpeg")
            ]
        }
    }
}

The "Noise" Problem

The benchmark results were conclusive: Text-Augmentation (Strategy C) outperformed the Full-Spectrum approach.

Even when the search query was a purely acoustic probe—such as a humming tune or a specific instrument melody—the strategy that indexed the tracks as descriptive text achieved a perfect 1.0 MRR. In contrast, Strategy E—which included high-fidelity raw media—dropped the score to 0.9583.

The raw audio in Strategy E introduced "jitter": acoustic noise that shifted the embedding vector away from the semantic center. In testing, a piano probe (probe_e1) ranked a "Jazz" track higher than the "Classical" target. The raw audio displaced the semantic concept, whereas the text-augmented index remained stable.

Key Takeaways (small catalog)

The Text Anchor is King: Gemini Embedding 2 has an incredible cross-modal alignment. Descriptive text remains the most stable "anchor" for semantic search, even for non-text queries.
Visual Relevance Improves Recall: Replacing generic placeholders with semantically accurate images (generated via Gemini 3.1 Flash) improved recall from 0.90 to 0.95.
Efficiency: Strategy C is inexpensive and computationally light. Strategy E costs approximately $0.43 per track to generate the media and process the vectors, yet it introduces noise that can degrade search quality.

AudioVoxBench is open source at ghchinoy/AudioVoxBench.