Riffusion
++ (verb): riff + diffusion +
++ You’ve heard of{" "} + + Stable Diffusion + + , the open-source AI model that generates images from text? +
++ Well, we fine-tuned the model to generate images of spectrograms, + like this: +
++ The magic is that this spectrogram can then be converted to audio: +
+🔥🔥🔥😱
++ Really? Yup. +
++ This is the v1.5 stable diffusion model with no modifications, just + fine-tuned on images of spectrograms. Audio processing happens + downstream of the model. +
++ It can generate infinite variations of a prompt by varying the seed. + All the same web UIs and techniques like img2img, inpainting, + negative prompts, and interpolation work out of the box. +
+Spectrograms
++ An audio{" "} + spectrogram{" "} + is a visual way to represent the frequency content of a sound clip. + The x-axis represents time, and the y-axis represents frequency. The + color of each pixel gives the amplitude of the audio at the + frequency and time given by its row and column. +
++ The spectogram can be computed from audio using the{" "} + + Short-time Fourier transform + {" "} + (STFT), which approximates the audio as a combination of sine waves + of varying amplitudes and phases. +
++ The STFT is invertible, so the original audio can be reconstructed + from a spectrogram. However, the spectrogram images from our model + only contain the amplitude of the sine waves and not the phases, + because the phases are chaotic and hard to learn. Instead, we use + the{" "} + + Griffin-Lim + {" "} + algorithm to approximate the phase when reconstructing the audio + clip. +
++ The frequency bins in our spectrogram use the{" "} + Mel scale, + which is a perceptual scale of pitches judged by listeners to be + equal in distance from one another. +
++ Below is a hand-drawn image interpreted as a spectrogram and + converted to audio. Play it back to get an intuitive sense of how + they work. Note how you can hear the pitches of the two curves on + the bottom half, and how the four vertical lines at the top make + beats similar to a hi-hat sound. +
++ We use{" "} + + Torchaudio + + , which has excellent modules for efficient audio processing on the + GPU. Check out our audio processing code{" "} + + here + + . +
+Image-to-Image
++ With diffusion models, it is possible to condition their creations + not only on a text prompt but also on other images. This is + incredibly useful for modifying sounds while preserving the + structure of the an original clip you like. A denoising strength + parameter trades off between sounding similar to the original and + adapting the new prompt. +
++ For example, here is a modification of that funky sax solo to crank + up the piano: +
++ And here’s an example that adapts a rock and roll solo to an + acoustic folk fiddle: +
+TODO(hayk): This is as far as I got.
+