[ RIFFUSION ]
(noun): riff + diffusion
You've heard of{" "} Stable Diffusion , the open-source AI model that generates images from text?
Well, we fine-tuned the model to generate images of spectrograms, like this:
The magic is that this spectrogram can then be converted to audio:
🔥🔥🔥😱
Really? Yup.
This is the v1.5 stable diffusion model with no modifications, just fine-tuned on images of spectrograms. Audio processing happens downstream of the model.
It can generate infinite variations of a prompt by varying the seed. All the same web UIs and techniques like img2img, inpainting, negative prompts, and interpolation work out of the box.
Spectrograms
An audio{" "} spectrogram{" "} is a visual way to represent the frequency content of a sound clip. The x-axis represents time, and the y-axis represents frequency. The color of each pixel gives the amplitude of the audio at the frequency and time given by its row and column.
The spectogram can be computed from audio using the{" "} Short-time Fourier transform {" "} (STFT), which approximates the audio as a combination of sine waves of varying amplitudes and phases.
The STFT is invertible, so the original audio can be reconstructed from a spectrogram. However, the spectrogram images from our model only contain the amplitude of the sine waves and not the phases, because the phases are chaotic and hard to learn. Instead, we use the{" "} Griffin-Lim {" "} algorithm to approximate the phase when reconstructing the audio clip.
The frequency bins in our spectrogram use the{" "} Mel scale, which is a perceptual scale of pitches judged by listeners to be equal in distance from one another.
Below is a hand-drawn image interpreted as a spectrogram and converted to audio. Play it back to get an intuitive sense of how they work. Note how you can hear the pitches of the two curves on the bottom half, and how the four vertical lines at the top make beats similar to a hi-hat sound.
We use{" "} Torchaudio , which has excellent modules for efficient audio processing on the GPU. Check out our audio processing code{" "} here .
Image-to-Image
With diffusion models, it is possible to condition their creations not only on a text prompt but also on other images. This is incredibly useful for modifying sounds while preserving the structure of the an original clip you like. You can control how much to deviate from the original clip and towards a new prompt using the denoising strength parameter.
For example, here is a modification of that funky sax solo to crank up the piano:
And here's an example that adapts a rock and roll solo to an acoustic folk fiddle:
Looping and Interpolation
Generating short clips is a blast, but we really wanted infinite AI-generated jams.
Let's say we put in a prompt and generate 100 clips with varying seeds. We can't concatenate the resulting clips because they differ in key, tempo, and downbeat.
Our strategy is to pick one initial image and generate variations of it by running image-to-image generation with different seeds and prompts. This preserves the key properties of the clips. To make them loop-able, we also create initial images that are an exact number of measures.
However, even with this approach it's still too abrupt to transition between clips. Multiple interpretations of the same prompt with the same overall structure can still vary greatly in their vibe and melodic motifs.
To address this, we smoothly interpolate between prompts and seeds in the {" "} latent space {" "} of the model . In diffusion models, the latent space is a feature vector that embeds the entire possible space of what the model can generate. Items which resemble each other are close in the latent space, and every numerical value of the latent space decodes to a viable output.
The key is that we can continuously sample the latent space between a prompt with two different seeds, or two different prompts with the same seed. Here is an example with the visual model:
We can do the same thing with our model, which often results in buttery smooth transitions, even between starkly different prompts. This is vastly more interesting than interpolating the raw audio, because in the latent space all in-between points still sound like plausible clips.
Here is one of our favorites, a beautiful 20-step interpolation from typing to jazz:
And another one from church bells to electronic beats:
Interpolation of arabic gospel, this time with the same prompt between two seeds:
The huggingface {" "} diffusers {" "} library implements a wide range of pipelines including image-to-image and prompt interpolation, but we did not find an implementation that was able to do prompt interpolation combined with image-to-image conditioning. We implemented this pipeline, along with support for masking to limit generation to only parts of an image. Code {" "} here .
Interactive Web App
To put it all together, we made an interactive web app to type in prompts and infinitely generate interpolated content in real time, while visualizing the spectrogram timeline in 3D.
As the user types in new prompts, the audio smoothly transitions to the new prompt. If there is no new prompt, the app will interpolate between different seeds of the same prompt.
The app is built using {" "} Next.js , {" "} React , {" "} Typescript , {" "} three.js , and {" "} Tailwind , and deployed with {" "} Vercel . It communicates over an API to the inference server that does the GPU processing.
The web app code is at {" "} https://github.com/hmartiro/riffusion-app .
The inference server code is at {" "} https://github.com/hmartiro/riffusion-inference .
If you have a powerful GPU, you can run the experience locally.
Samples
Some of our favorite prompts and results.
Techno beat to Jamaican rap:
Fantasy ballad, female voice to teen boy pop star: