Previous progress report: https://www.linkedin.com/pulse/programming-artificial-intelligence-based-daw-sonic-pi-mika-laaksonen
AaltoAivo in YouTube: https://www.youtube.com/@aaltoaivo
Abstract
Over a year ago I started a project to create generative artificial intelligence system for Sonic Pi. During this time that simple first system has gradually evolved to complex system. Excluding the Sonic Pi and Juke framework, everything else in this project – learning system, data analysis, AI models, generative algorithms - has been programmed and produced by myself, since there are no other contributors in this project. I have also produced the audio data necessary for machine learning processes (sound and music samples) in my music studio. I have also used music notation data representing various music styles and theories.
Currently this system, AaltoAivo has over 40 TB of music data and generative AI models, and multiple layers of custom-made software to organise, analyse, annotate and process music data into transformative and generative models, which are used to create music according to user-provided instructions, although AaltoAivo is capable to generate music without any instructions according to its own “machine aesthetics”. However, AaltoAivo is still at prototype phase, and not built as user friendly application. There are still many processes which need more development, testing, and optimization, but the basic workflow has been operational since the beginning.
My design principle has been to produce an independent generative AI, which emulates the workflow of real music producer. Thus, it must be able to not only generate music as audio, but also generate it in a music producer usable form which can be exported to commonly used digital audio workstation (DAW) software for further development and processing. As such, AaltoAivo can be used both as a generative AI on its own, as a music producer assistant, as a source of inspiration or even as a member of a music orchestra.
Video: Two instances of AaltoAivo creating cinematic music
Design
When I started this project, I had clear goals about how this AI system should work. I generally dislike the idea of black box systems, so I wanted that the mechanisms and generated results are mostly understandable, reproducible and exportable. Basically, it should work like a music producer, putting out not only music as audio, but as notes and chords, effect processing instructions and processed sound samples.
I am interested in machine intelligence as a concept, so I wanted this system to be flexible in the amount of user input or interference, ranging from specific instructions from user all the way to very minimal or zero instructions. This way it can be used as a tool to assist composer’s workflow, or as a entertainment generator for casual user, or as a tool to research if a machine intelligence can create something original.
Being a music producer myself in addition to being a programmer and software designer, I believe this kind of system to be more useful system that would merely generate audio according to text prompt written by user. Of course, that is possible in AaltoAivo too, but it is not my primary goal. I see AaltoAivo as a tool to assist in creative work, but I am also interested to see how far it can go – can it create its own aesthetics and own kind of music, can it express itself in some unique and unpredictable way, or is it just a complex automaton. I am more interested in the process and the internal mechanism of AI, than any kind of results as a commercial product.
Video: AaltoAivo creates Techno and Electro
Layered And Modular Structure
Due to all flexibility in the goals mentioned above, it is obvious that this software system has to be very layered and modular. There are obviously modules needed for data acquisition, storage and annotation, while other modules then use that data in machine learning processes. There must be artificial intelligence modules programmed and trained with traditional music theory models, and those models are used as a counterbalance or supervisor for other models, which AaltoAivo can generate.
To clarify, AaltoAivo has several independent and competing internal generative model machines, each of them contributing to the results, depending on context. Machines represent different contextual dimensions of music language used in AaltoAivo. They represent opposing extremes in structural, contextual, emotional and various other elements. Depending on the choices made by AaltoAivo AI, it can make music according to the “traditional rules of music”, or something totally rule-breaking and experimental music, which may not appeal human listener, but could make perfect sense to AI, because it creates some mathematical function in the form of sound, for example.
Annotation in data layer is important for machine learning layer, where transformers and reinforcement learning gradually builds associations between music patterns, sound and how they are usually understood or categorized as genres, emotions etc. In a sense, AaltoAivo characterizes music not only as sound and patterns, but also as language, which can describe concrete and abstract ideas.
Compositional layer uses those generative model machines and creates music according to a creative idea, which determines the parameters and boundaries of generative process, weighting the use of each machine, use and creation of appropriate sounds and patterns etc. Composition is expressed and stored in the form of configuration and pattern data, which is exportable to e.g., common DAW software.
At this moment, AaltoAivo does not produce sound itself, although this may change. Since this is still a prototype project, I needed an external software, which is suitable for algorithmic music and very similar to programming. I found Sonic Pi to be a perfect choice for this, because it could be easily extended, it supports MIDI and OSC and has Ableton Link support.
However, AaltoAivo does not generate Sonic Pi programs, but project data. To overcome this, I had to develop a Sonic Pi program, which transforms Sonic Pi into a headless DAW (Sonic DAW), with internal tracks, effect racks, sampled instruments etc. AaltoAivo composition is then given to this Sonic DAW as an input, and Sonic Pi then generates audio according to the instructions. AaltoAivo can communicate with Sonic DAW and change parameters and patterns when necessary. This Sonic DAW I have described in detail previously, in my first progress report.
Generative Dimensions and Parameters
I already mentioned that data and generative models use contextual dimensions and parameters to characterize music patterns and samples. This is like how we characterize physical objects, for example, their shape, colour, surface, material, regularity etc. For music I used several opposite descriptive extremes as follows
Traditional – Avantgarde
Outsider – Popular
Ethnic – Global
Descriptive – Abstract
Structured – Freeform
Organic – Synthetic
Sacral – Mundane
Minimalist – Maximalist
Every sample, music pattern, chord and full music piece is described within these eight characteristic dimensions, using normalized values. They are easy to understand, which is important while data is annotated, since all subsequent machine learning depends on that. You don’t have to have an academic decree to describe music you are listening to. During the development process I used other additional opposite pair dimension too, but eventually I settled on these eight.
Additionally, I use other descriptive and associative data like genre, instrument, music production terminology, emotional context, and many other minor details, which is stored as freeform text. This association data gives the music deeper and broader context, but also much more complex. However, this is also the source of infinite variation within the generative process. Furthermore, it is interesting to study all accumulated data using e.g., Fuzzy Clustering, Support Vector Machines and Self-Organizing Maps, which has provided some insights on how to utilize other Machine Learning methods to build generative models.
Data and Resources
It is understandable that I needed vast amount of audio material for machine learning. With musical notes and MIDI sequences it would be possible to create a system which generates only notes and MIDI sequences, of course. Modern audio production uses not only traditional instruments, but vast array of digital sample-based instruments, pure audio synthesis and various programmed and complex effects to create any imaginable sound. My goal was to teach the AI to associate sounds with notes, MIDI automation, effects, parameter changes etc. Most importantly, I wanted the AI to associate different sound patterns with specific music styles, genres, emotions and other ways used to describe music with words.
Descriptive association is the only way to enable a ML-model based AI to generate music according to description provided by user. The AI cannot invent such meanings out of thin air, it needs carefully annotated and described learning data to associate certain kind of sounds and music patterns, so that it can differentiate between dubstep and ambient music, for example. Similarly, it needs examples to understand what is upbeat and what is melancholic etc.
Annotation of learning data was the most tedious and time-consuming part of this project. As an experience programmer I could automate it for most parts, but even then, it took several thousand hours in total. If this were a Google, Apple, Spotify or Microsoft operated project, I probably could have used their vast music library, but as a single-developer I did not have such luxury, so I had to create the data on my own. There was one advantage – this way I could use a set of matching MIDI and audio in both short music stems and larger music pieces.
I have a moderate array of synthesizers, music workstations, DAWs, software synthesizers and sample-based instruments. With these I could create music patterns for anything between ethnic and traditional instruments, chamber and symphony orchestras, choirs and modern synthesizer sounds. For my purposes this was quite enough, since I am mostly interested in how far I can go in this project. Later, if I happen to have more resources I can expand the scope to real acoustic instruments, for example.
Besides, this is just a prototype, a proof-of-concept project. I just want to see if it works and how it can be done. Since I have no funding for this, no venture capitalists backing this up, it is probable that AaltoAivo will remain like it is now. Potential is there, but I lack the resources and networks, and I don’t speak business language. On the other hand, I am not losing anything.
Audio Analysis and Representation
Using notes, MIDI expression, characteristics and words in machine learning is routine nowadays. Some of them can easily transformed into numeric data, while text expressions can be handled as categorical data or NLP models. Audio sample data is trickier, because each sample has both frequency domain and time domain. At the same time, that sound data should be in such form which allows it to be cut, morphed and joined into new sounds and sound patterns, just like you can modify notes and MIDI data, for example.
In AaltoAivo the individual sound samples are analysed using several different techniques, for different purposes. If the beat of a sample is unknown, then Discrete Wavelet Transform is used to detect BPM, if the sample has one. Many naïve algorithms use clear drum beats for detection, but there are cases where there are no drumbeats, but the sound itself is modulated according to BPM.
Much harder task is to detect notes, chords and their progression. My solution is a process, where sound samples as a 2D representation of frequency and time domains are transformed into suitable “music byte matrix”. You can think it as a method like Granular Synthesis, where sound is decomposed into tiny sound atoms, each only a few milliseconds long. In this case I use machine learning method called Non-negative Matrix Factorisation (NNMF) with Short-Time Fourier Transformation (SFTF), which is based on Fast Fourier Transform (FFT). 2D representation of sound is interpreted as a matrix, and NNMF produces layered approximation matrix. Its elements can be used individually, mixed with other elements, if necessary.
When sound is represented as NNMF layers, it is possible to associate them, notes, chords and timbres with descriptive annotation data. Granularization of sound offers possibilities to pitch shifting, spectral modification and various effects to produce totally new sounds, instead of just imitating existing physical instruments. There were some challenges with transients, envelopes and phases, but eventually I have found some solutions for them. Using theories of nonlinear dynamic systems and differential equations, AaltoAivo also handles sound in a nonlinear phase state, meaning that it is not represented in either frequency domain, nor time domain.
As such, AaltoAivo contains an advanced toolbox to analyse, associate, learn and synthesize any kind of sound. During the generation process it is possible to transform the individual layers of matrix back to original sound as a sum of layers, using Wiener-Filtering.
Generative Models and Machine Learning
AaltoAivo uses Machine Learning on many levels of its process flow, for various purposes. Annotation and data analysis uses several machine learning and pattern recognition methods to pre-organize and check acquired data. As described above, NNMF is used to transform raw audio data into a format appropriate to subsequent machine learning.
Generative model building process uses Transformers and Reinforcement Learning. To put it simply, models learn to use sounds, notes, chords, smaller and larger music patterns in the context of associative data. These models are tuned further with Reinforcement Learning and Adversarial Learning. Similarly, AaltoAivo also learns to use various sound effects, filters, and generally mix different tracks like a human sound engineer in a studio.
There are several goals in this phase. One is to guide the models to understand and use already established music theories and rules, but also give freedom to break those rules. Another important point is to prevent models from producing illegal music, which infringes international copyright laws. Currently AaltoAivo produces original music for several genres and subgenres.
Various use scenarios
As I stated in beginning, flexibility was the grounding principle in AaltoAivo project. I had no intention to develop “a one-trick pony”, but a versatile AI software tool for music production and a research tool for machine intelligence. Here are just some examples of possible use cases.
Background music production. In this case AaltoAivo generates music directly as audio, based on some broad guidelines. This can be as simple as “play meditative and minimalistic piano music” or “play upbeat R&B” etc. AaltoAivo will generate such music as long as necessary.
AI Playing external instruments. Since AaltoAivo can output MIDI and OSC, it can play any external input which accepts MIDI or OSC as input. AaltoAivo can play software instruments too in this manner.
AI Band musician. AaltoAivo can use Ableton Link, and it takes both MIDI and audio as input data. This way it is able play as a member of band or orchestra, following possible tempo changes and other events, even improvising new patterns when necessary.
Music production AI assistant. Music producer can input his own material – midi, samples, effect use, automation – into AaltoAivo, and then AI may generate variations, or combine it with other material. This is a great way to find inspiration. Anything interesting can be exported back to DAW for further processing.
Machine Intelligence Art. I already have plans to integrate Computer Vision and other machine senses which could provide new kinds of associative data for AaltoAivo. This could be a basis for purely machine inspired generation of music, and AI machine-based aesthetics.
Platform and Software
AaltoAivo is built mostly from scratch using C++, Python and Ruby. There are some C++ libraries which proved used, as well as JUKE framework. My usual workflow is to use Python and Ruby for rapid prototype development, and later, when necessary, I reproduce CPU heavy calculations with C++. Many machine learning processes are performed with GPU support.
Since there is very large amount of data collected as sound and MIDI files, it is mandatory to use databases to organize data. For practical reasons I decided that there is no advantage in storing the original sound files in database, because it would make database too large for my purposes. Instead, sound files are carefully organized in hierarchical directory structure, and database is used only as an index for that, while it also contains all association data, model data etc.
After considering many alternatives, I found out that TinySQL is sufficient as a database for this purpose. Although it lacks certain features of more advanced databases, it has several advantages over them. First, TinySQL is serverless, there is no server software to be installed. This makes it portable. TinySQL is also very robust and fast, proved in millions of use cases in real-time applications. It has all necessary features of a SQL software, and it even supports NoSQL-like JSON data. If necessary, it is possible to transfer all accumulated data to other database system.
As mentioned before, this system uses Sonic Pi software for output, MIDI connections and Ableton Link networking. It is possible that I may recreate output, MIDI connections etc. as a JUKE based application, but that is not my focus at this moment. Using Sonic Pi has some limitations, but its existence saves me a lot of development time, which I can use for AaltoAivo development.
At this moment, AaltoAivo takes 45 Terabytes of disk space. Software itself is quite compact, but raw learning data, annotation data and model data are large. It took thousands of hours to produce and organize raw learning data, which I had to do myself, using lots of MIDI automation in Ableton Live and Steinberg Cubase, and MIDI automation in several hardware studio instruments and synthesizers.
Future Development
I hope I am able to continue this development. Understandably this is a tremendous task for a single developer, and my resources are limited. I also have other projects and responsibilities which take considerable time. On the other hand, it does not matter whether it takes a long time or not, because I have not promised to release this in any form as a commercial or non-commercial product. This is just an intellectual puzzle, that I want to solve.
Good stuff