Giving Gemma 4 Ears: Porting Multimodal Audio to Apple Silicon with MLX Swift

Running Gemma 4’s audio pipeline natively on Apple Silicon requires moving beyond Python. Today, this process works entirely in Python via the ml-explore/mlx-vlm repository, but running multimodal evaluation directly in native Swift applications presented a gap. While text generation, the "brain" of the model, was already supported in Swift via Apple's MLX framework, the complex audio processing pipeline, the "ears," was missing.

To close this gap and unlock native on-device capabilities like comparing synthetic audio engines or extracting semantic meaning from speech, I ported the Gemma 4 Audio pipeline from Python into pure, native Swift. This required implementing Conformer blocks, Mel-Filterbanks, and specific Digital Signal Processing (DSP) operations from scratch using MLX primitives.

The Architecture: Senses, Bridge, and Brain

As an analogy, to understand the architecture of a multimodal model like Gemma 4, it helps to map it to human senses. (Apologies for the anthropomorphism here.)

The Multimodal Analogy

The Senses (Audio Encoders): These specialized encoders process raw audio waveforms into mathematical features. For Gemma 4, this means converting raw audio into Mel-spectrograms: a visual representation of the audio spectrum over time mapped to the Mel scale, which mimics how human hearing perceives frequencies. These spectrograms are then passed through a massive Conformer architecture. A Conformer block interleaves Convolutional layers (to capture local acoustic features) with Attention mechanisms (to capture global context).
The Bridge (Projector): A projector module aligns these extracted sensory features into the same dimensional "language" that the text model understands.
The Brain (LLM): The core Large Language Model processes the combined embeddings and autoregressively generates text.

While the MLX team had already provided the "brain" via the mlx-swift-lm libraries, the specific Senses and Bridge for Gemma 4 did not exist in Swift. We built them, creating the native Gemma4AudioModelTemplate.swift blueprint to wire the components together.

Gemma 4 Audio Architecture

"Glass Box" Insights: Translating Python to Swift MLX

Porting a complex neural network from PyTorch or NumPy-style Python to Swift exposes the nuances of both languages. Here are three critical engineering challenges addressed.

1. Bridging the DSP Gap in Swift

Raw audio must convert into Mel-spectrograms before entering the Conformer blocks. In the Python reference implementation, this relies heavily on NumPy.

Translating this required performing significant DSP math directly in Swift. NumPy uses as_strided to create overlapping windows of audio frames. We achieved this exact behavior in Swift using MLX.asStrided. For the Fourier Transform, the Python implementation runs stft = np.fft.rfft(...). This was replaced this with MLXFFT.rfft, allowing the execution the discrete Fourier transform natively and extremely quickly on the Apple Silicon GPU via Metal.

// Python (NumPy)
# stft = np.fft.rfft(frames, n=self.fft_length, axis=-1)
# magnitude_spec = np.abs(stft)

// Swift (MLX)
let stft = MLXFFT.rfft(frames, n: self.fftLength, axis: -1)
let magnitudeSpec = MLX.abs(stft)

2. The Ellipsis Slicing Trap

In Python, slicing an array up to t on the second dimension is cleanly written as context[:, :t]. Similarly, gathering indices on axis 1 is written as out[:, indices]. Initially, I ported this to Swift as context[MLXEllipsisIndex.ellipsis, 0 ..< t]. This was a rookie mistake. (Bard taught me Swift!)

In Swift MLX, combining .ellipsis with array indices pushes the specific index to the very last dimension. Instead of slicing the Time dimension (axis 1), it sliced the Head dimension (axis 4), silently mangling the tensor shape and causing downstream matrix multiplications to fail.

The solution requires mathematical precision: either explicit slicing (context[0..., 0 ..< t, 0..., 0...]) or using the MLX.take function (MLX.take(out, indices, axis: 1)), which is unambiguous and safe.

3. Einsum vs. Matmul Stability

Einstein summation (einsum) is an elegant notation for describing complex tensor contractions and matrix multiplications. Swift's MLX.einsum behaves exactly like its Python counterpart, accepting string syntax like "bnuwc,bucnh->buwnh".

While elegant, we found that complex 5-dimensional tensor contractions using einsum could be brittle depending on the exact compiler and wrapper versions. For absolute stability in the AudioAttention module, I used an explicit combination to translate the einsum operation into .transposed() and standard matrix multiplication (MLX.matmul()).

// Python (and initial Swift)
// context = MLX.einsum("bnuwc,bucnh->buwnh", probs, valueBlocks)

// Stable Swift
let valueBlocksT = valueBlocks.transposed(axes: [0, 3, 1, 2, 4]) // [b, n, u, c, h]
var context = MLX.matmul(probs, valueBlocksT) // [b, n, u, w, h]
context = context.transposed(axes: [0, 2, 3, 1, 4]) // [b, u, w, n, h]

The Workflow and Results

Mapping out the undocumented behaviors of the Python mlx-vlm codebase and aligning them with the mlx-swift architecture required extensive cross-referencing. Gemini CLI quickly identified missing primitives (like depthwise 1D convolutions, which i was able to replicate using MLXNN.Conv1d with specifically configured groups) and ensured our Swift implementation remained mathematically identical to the Python reference.

The resulting performance and outputs on Apple Silicon are promising. In a sample run using this Swift implementation, we observed fast native extraction and encoding times:

DSP Extraction Time: 28 ms
Encoder Processing Time: 142 ms
Encoded Features Matrix Shape: [1, 250, 1024]

To demonstrate this visually, I also built a sample interface (MLXAudioUI) running the native Swift layers:

MLX Audio UI

The Deliverable

The result of this work is the Gemma4Audio library, currently submitted as a pull request to mlx-swift-examples. It is a pure Swift implementation of AudioRMSNorm, ConformerBlock, AudioAttention, and Gemma4AudioFeatureExtractor.

This includes a comprehensive XCTest suite that mathematically verifies the convolution padding, tensor downsampling, and FFT extractions. By bridging this gap, the Gemma 4 audio pipeline can now run directly on iOS and macOS, taking full advantage of the Metal backend for native performance.

What's Next: The Final Mile

While the "ears" are now built and verified, running the model end-to-end natively requires upstream support in the core mlx-swift and mlx-swift-lm repositories. Specifically, I'm waiting for the language model's generate() function to officially support multimodal audio inputs (passing the extracted audio features alongside text tokens into the LLM).

The final integration into mlx-swift-lm is mostly architectural plumbing. It requires conforming the new audio model to the LanguageModel protocol. The prepare() function needs to be updated so that it can ingest the raw audio, pass it through our Senses and Bridge, and splice that massive block of projected audio embeddings directly into the text token sequence before filling the KV Cache.

Once this upstream integration is merged, developers will be able to pipe audio directly into the prompt natively. Keep an eye on the MLX repositories for these multimodal generation updates—that will be the final key to unlocking fully native, on-device audio analysis