I strongly agree about generating "editables" rather than finalized media. In fa...

waffletower · on Sept 13, 2023

Audio is definitely editable. While generative audio is new I am hopeful that a host of interesting applications will emerge (audio2audio etc.) within its ecosystem. Promising signal separation (audio to STEMs) and pitch detection tools already exist for raw audio signals. If you want to force Stability to focus on symbolic representations (such as severely lossy MIDI) I hope you can instead first try adapting to tools that work fundamentally with rich audio signals. Perhaps there will be room for symbolic music AI and perhaps Stability will even develop additional models that generate schematic music, but please please don't sacrifice audio generality for piano roll thinking alone. LORAs will undoubtedly be usable to generate more schematic audio via the Stable Audio model -- I imagine they could be easily purposedly to develop sample libraries compatible with DAW (digital audio workstation), sequencer and tracker production workflows.

jononor · on Sept 14, 2023

Audio is editable, but it is a much more rare skill than text editing. Anyone that has completed primary school has basic proficiency. And those that have gone through college or held a job where email communication is common, has may years of experience in it.

fragmede · on Sept 14, 2023

What is audio2audio? Can I beat box into a mic and have professionally produced tracks come out the other end?

visarga · on Sept 13, 2023

Train the model with midi notes as text in the prompt and the audio as target. It will learn to interpret notes.

waffletower · on Sept 13, 2023

Not all music is well represented with notes, nor are audio datasets with high-quality note representations readily available. But I guess if you work hard enough you can get close: https://www.youtube.com/watch?v=o5aeuhad3OM My example still sounds like the chiptune simulation that it is, however.

visarga · on Sept 14, 2023

It's ok, the model would create music even from a vague prompt, it will learn even better from the notes imperfect as they are, because it has the interpreted version in the training target.