Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I strongly agree about generating "editables" rather than finalized media. In fact, that's why text generators are more useful than current media generators: text is editable by default. Here's a tweetstorm about it: https://x.com/jsonriggs/status/1694490308220964999?s=20


Audio is definitely editable. While generative audio is new I am hopeful that a host of interesting applications will emerge (audio2audio etc.) within its ecosystem. Promising signal separation (audio to STEMs) and pitch detection tools already exist for raw audio signals. If you want to force Stability to focus on symbolic representations (such as severely lossy MIDI) I hope you can instead first try adapting to tools that work fundamentally with rich audio signals. Perhaps there will be room for symbolic music AI and perhaps Stability will even develop additional models that generate schematic music, but please please don't sacrifice audio generality for piano roll thinking alone. LORAs will undoubtedly be usable to generate more schematic audio via the Stable Audio model -- I imagine they could be easily purposedly to develop sample libraries compatible with DAW (digital audio workstation), sequencer and tracker production workflows.


Audio is editable, but it is a much more rare skill than text editing. Anyone that has completed primary school has basic proficiency. And those that have gone through college or held a job where email communication is common, has may years of experience in it.


What is audio2audio? Can I beat box into a mic and have professionally produced tracks come out the other end?


Train the model with midi notes as text in the prompt and the audio as target. It will learn to interpret notes.


Not all music is well represented with notes, nor are audio datasets with high-quality note representations readily available. But I guess if you work hard enough you can get close: https://www.youtube.com/watch?v=o5aeuhad3OM My example still sounds like the chiptune simulation that it is, however.


It's ok, the model would create music even from a vague prompt, it will learn even better from the notes imperfect as they are, because it has the interpreted version in the training target.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: