Idea for Better Multicore CPU Support

Magnus · July 31, 2022, 9:43am

This is an idea for greatly improving the multicore CPU support of BitWig and other DAW’s. It is such a pity to have 4 CPU cores or more on modern computers and only be able to use 1 CPU core in audio processing, because the DAW is having trouble allocating the processing across multiple CPU cores. This problem is only going to get worse in the future, because CPU clock-speed will stagnate and CPU cores will increase.

I think I have a good solution for this problem. I am a computer scientist myself and I do sometimes work with parallel programming, but my field of expertise is quite different from how audio-processing is done, so it is possible I have misunderstood something.

The Problem

The BitWig user-guide doesn’t describe how multicore CPU allocation is done, and the GUI only shows the total CPU usage, not the allocation of tracks and plugins across the multiple CPU cores.

The Ableton manual has a short section describing how chained plugins (plugin A going into plugin B, going into plugin C, etc.) need to be placed on a single CPU core because the processing of one plugin depends on the output of the previous plugin.

I think BitWig does the same, because when I have e.g. a couple of CPU heavy plugins on FX Sends, and I send audio from all tracks to those FX Sends (e.g. a little reverb on all tracks), then BitWig shows nearly 100% CPU usage (and sometimes breaks up the audio when it exceeds 100% CPU usage), even though the Windows CPU monitor only shows e.g. 15-20% CPU usage. So I think BitWig has placed all processing on a single CPU core. I think the same happens when I have plugins on the master-bus.

I believe there is a clever way to distribute the computation to multiple CPU cores, even though the plugins depend on the output from previous plugins.

Example Scenario

Let us say we have several audio-tracks going into both the master bus and a single FX send which then goes into the master bus, and all of these channels have plugins. How can we make all the audio tracks and FX bus process in parallel on multiple CPU cores?

Solution 1: Double the block-size

A block refers to the audio-buffer that the sound-card driver requests that BitWig processes e.g. consisting of 480 samples, which corresponds to 10 msec at a 48 kHz sample-rate. This block-processing is done for efficiency, as the overhead would be too great if BitWig and all the plugins were only processing a single sample at a time.

So in this solution, the audio tracks are processing audio for block T, which gets queued for processing in the FX bus for block T+1. So in the next block, the audio tracks will process audio for block T+1, while the FX bus will process the queued audio from the audio tracks in the previous block. This makes the processing of the audio tracks and FX bus happen in parallel. The only problem is that the output of the FX bus is delayed by 1 block, so when sending the audio tracks to the master bus, we also need to delay their audio by 1 block, so the timing matches with the FX bus.

This should work and it would allow us to have parallel processing at a penalty of 1 extra block’s latency, which may be acceptable for some people.

However, the problem still exists if we have more “layers” of dependent processing, e.g. some FX buses sending their output to some other FX buses, or if there are plugins on the master-bus that we want to run on its own CPU core. We would have to add another block of latency for each “layer” of dependent processing like this. So we could end up with 3-4-5 times the original block-size’s latency, which is probably not acceptable if the original block-size is already 10 msec.

Although we would get much better CPU multicore utilization, the cost would be a big increase in latency.

Solution 2: Mini-blocks

Instead of processing the full block-size required by the sound-card, it can be split into smaller block-sizes for the internal processing. For example, if the original block-size is 480 samples, we could split this into 10 mini-blocks of only 48 samples each.

For a single “layer” of dependent processing, e.g. a single FX bus being processed on its own CPU core, this would only increase the overall latency by 10%. Even with 5 “layers” of dependent processing, e.g. FX buses going into other FX buses, etc., we would still only increase the overall block-size by 5 * 48 = 240 samples, so a 50% increase in overall latency. So we would get much better CPU multicore utilization, at a fairly small penalty in overall latency.

Because the block-sizes are much smaller, the processing overhead is going to be bigger as well, because the DAW has to call the plugins e.g. 10 times per block instead of just once. Exactly how much this penalty is would require experimentation, because computers are so advanced with CPU caches etc. that the result can be surprising. But I imagine that for CPU heavy plugins, the extra overhead of running smaller mini-block-sizes is going to be negligible, because most of the processing-time is spent inside the plugin and not in the “boiler-plate” code.

We may also worry about stability when decreasing the block-size to 10% of its original. This would again require experimentation to see if it really is a problem. My guess is that it is not a problem. I think that because the main block-size is unchanged, the system will be just as stable as it was when it was processing the full block-size required by the sound-card. The sound-card doesn’t know that the DAW is processing 10 mini-blocks to fill 1 block. I think the only drawback is the extra overhead that was discussed above.

One Track - Multiple CPU Cores

Using the mini-block method, it is also possible to distribute the computation of a single track onto multiple CPU cores. This may sound strange at first, but there is really no difference between a single track and the scenario above with FX buses, when distributing the computation onto multiple CPU cores using mini-blocks.

This would allow us to have several CPU heavy plugins on a single track, e.g. where each plugin uses 90% of a single CPU-core. Because they are run in parallel on different CPU cores using mini-blocks, we can have several of these plugins chained together, and the penalty is only going to be a small increase in latency, depending on the length of the plugin-chain and the size of the mini-blocks.

Allocation Algorithm

Using this method with mini-blocks, would probably make it possible to allocate the computation of all tracks, FX buses, and plugins across multiple CPU cores to nearly 100% utilization.

However, it is probably not going to be trivial to write a good algorithm for this. Not only is it going to make the audio-engine even more complicated, but it is essentially what is known as a “bin-packing” problem in computer science, which is known to be very difficult to find an optimal solution for - and this is even more difficult because the CPU load of each plugin may vary over time.

But it should be possible to make a pretty good allocation-algorithm that may not be able to squeeze 100% out of the multiple CPU cores, but if it could just get to 80-90% efficiency that would still be great, and certainly much better than only being able to use 1 CPU core!

Questions?

I thought about writing a little paper with this idea, making flowcharts and diagrams that show how it would work. But I really don’t have the time. So I hope the idea is clear from the description above.

If people are skeptical about how this would work, then I would encourage you to try and draw on a piece of paper a simple 2-core CPU system with audio-blocks for different time-steps, to see how you can parallelize the computation of two sequential plugins, simply by queueing the output of the first plugin for processing by the second plugin during the next block. And then just do this with mini-blocks to lower the latency. And if you need to mix the dry/wet signals, then you also need a delay on the dry signal to compensate for the extra latency.

If the BitWig software engineers have already thought about this, but there is a good reason why you don’t do this already, then I would be very curious to hear the reason.

icaria36 · August 1, 2022, 5:03am

Thank you for sharing these thoughts. This goes well above my knowledge. Have you shared it with the Bitwig team?

mauiwayne · August 1, 2022, 11:12am

Hey I think this is an excellent idea…the reason is, I recently started using a Multicore CPU Support plugin called “Unify” - made by PlugInGuru.com. It basically functions like a highly cpu efficient multi-vst “host” within your DAW (or stand alone). With Unify, I can easily use 2 dozen+ more virtual instruments & effects…all playing simultaneously, without glitching or crashing. Before, I could only run a handful of plugins before things would start to bog or crash. So having multicore CPU support makes a big difference. It’s actually amazing!

lokanchung · August 4, 2022, 11:10am

Interesting ideas, but this is more from engineers perpective rather than actual users.
Most people don’t care about internal detail, but it would be still useful to visualize multicore utilization for those who want to optimize their projects.

Some my thoughts on ideas:
Smaller block implies less memory locality so processing time will never be half with half sized blocks. Also multithreading comes with cost, which means two CPU doesn’t mean half processing time. Summing these up leads to conclusion that we don’t know how effective these ideas will be until PoC is made and benchmarked. It sounds worth trying though.

JakeX · January 13, 2023, 5:32pm

I read about this problem in ableton too some time ago and actually made a check in Bitwig and came to the conclusion that bitwig does use multiple cores for a single track.

In Windows, somewhere in the Task-Manager, you can see the load of the individual cores. If I use just a single track, the load is still evenly distributed across all tracks.
From that we can follow, that bitwig does not have the problem of ableton but does infact already use all CPU cores in a good way. (right?)

I also noticed, sometimes Bitwig shows 100 % usage when the Taskmanger reports much less and I don’t know why. But according to my tests it is NOT because of bad multi-core Management.

Magnus · January 13, 2023, 7:53pm

People are apparently still reading this thread.

@JakeX I’m not entirely sure what you mean, but if BitWig is showing nearly 100% CPU usage and the Windows CPU meter is only showing e.g. 20% CPU usage (as is the case for me), then it means that BitWig is NOT utilizing all the CPU cores, otherwise these two numbers would both be close to 100%. If you exceed 100% CPU usage in BitWig’s meter, you will probably start hearing clicks and pops in the audio as it is struggling to process everything in time.

Let me give an update on this topic as many people seem to be interested in improving the multi-core CPU usage of DAW’s.

I recently met a guy who was mixing classical music in another DAW. He had bought a 12-core CPU - and I don’t think he was wealthy so it was a lot of money for him - just to become supremely disappointed that his DAW still only used a single CPU core.

When I first came up with this idea, I contacted BitWig support and they asked the dev-team who replied that it was not possible, because otherwise everyone else would already be doing it.

But it IS possible, so when I had the time, I made a small example to show how it works, which you can see here:

I do have a PhD in computer science, but it is a long time ago that I was a student. Parallel computation was not common back then, and I don’t recall having ever learned about how to make Parallel Pipelines specifically. But I recently talked with an Intel developer on GitHub, who said it is actually a typical Parallel Pipeline that I had re-invented here.

But the response of the BitWig dev-team, and the generally poor support for multi-core CPU’s in all the DAW’s, suggests that Parallel Pipelines are more or less unknown amongst DAW developers. So I also wrote the other big DAW developers (Ableton, Cubase, Native Instruments, etc.), because it is a pity that DAW’s cannot properly utilize multi-core CPU’s.

The only person who replied was the original developer of Reaper, who responded within one hour on a Sunday. He said that they had considered doing something similar in Reaper, but they didn’t do it for two reasons: (1) They already have a system that can improve parallel computation, and (2) he didn’t think it was necessary for most users.

I disagree with his second point, as I run into this problem constantly with BitWig, even with just a few plugins - e.g. the Waves plugins are really CPU hungry. It is not entirely clear when it occurs in BitWig, but it seems to be when I use heavy plugins on the FX tracks, and especially if there are heavy plugins on the master-bus. This is consistent with the description in Ableton’s manual of how they can only put connected processing chains on individual CPU cores, and they cannot split a processing chain across multiple CPU cores.

Furthermore, if DAW’s had better multi-core CPU support, it would allow plugin developers to make far more complicated plugins than they can make today, because they don’t have to worry so much about maxxing out a single CPU core. You would also be able to run higher over-sampling on more of your plugins in real-time.

Because BitWig is so flexible in its audio routing, it probably has a very complicated “computational graph”, which is basically a diagram of where the audio-data flows and which plugins must process them and in what order. It might be difficult to make BitWig’s complicated audio engine fully utilize Parallel Pipelines. But I still think that they can make it work for most relevant use-cases, e.g. when using heavy plugins on aux-tracks and the master-bus.

A lot of the big DAW’s were made a long time ago and they seem to have stagnated in their development. So if this is going to be made, it probably has to be BitWig, whose devs are still innovating at an impressive rate. I also e-mailed Behringer with the idea, so perhaps they will build it into their new DAW from the beginning. Let’s hope!

psycha0s · January 23, 2023, 2:34pm

if BitWig is showing nearly 100% CPU usage and the Windows CPU meter is only showing e.g. 20% CPU usage (as is the case for me), then it means that BitWig is NOT utilizing all the CPU cores

Bitwig doesn’t show the CPU load, what it shows is the ratio of the time spent by your CPU rendering the audio buffer to the length of that audio buffer as a percentage. So you’re right, it doesn’t utilize all the CPU cores, but it does it for a reason.
Plugins and Bitwig devices are connected together and form a directed (I believe) acyclic graph. It’s impossible to render all graph nodes at once due to its topology, because some nodes depend on the other nodes. The best thing we can do is to perform the topological sort to get the correct rendering order, and then to render some nodes in parallel. But most likely there will always be bottlenecks, when all cores would wait for a single core to finish rendering its node. IIRC, I heard that Ableton, FL Studio and Bitwig use the same strategy - they allocate a CPU core per track. If it’s true, then there is definitely room to improve, but personally, I wouldn’t expect a significantly better performance.

Magnus · January 25, 2023, 4:46pm

I repeat: It IS possible to make a Parallel Pipeline of serially dependent processing of streaming data - such as audio data in a DAW. It is very counter-intuitive and that is probably why no DAW is currently doing it. As I said before, the guy who made Reaper understood the idea very well, but he had other reasons for not implementing it in Reaper.

BitWig is very flexible in its audio routing, so it is probably not a trivial task to make this work in BitWig, and there may have to be limitations on what is possible, e.g. when side-chaining is being used.

The paper explains how it works. Please read it before claiming that it’s not possible. I wrote it because so few people seem to know about this parallel method.

I probably won’t respond anymore to this thread unless someone says something new and interesting. Just read the paper and look at the example source-code if you want to know how it works, and then start lobbying the DAW developers to implement it, so we can finally use our multi-core CPU’s properly.

soerensen3 · January 27, 2023, 4:36pm

Hi,

I am sorry I couldn’t read all of your paper because I did not have the time, but I think I more or less understood the idea. There were a couple of things that came to my mind that I think you didn’t address. I’m also not an expert with concurrent programming I have to admit but I have developed a plugin so I know a bit of how plugins interact with the DAW. I assume that the developers of Bitwig (and other DAWs) probably understand very well what they are doing but the reason they implemented multi threading mostly on a per track basis is that the situation is far more complex than you describe.

You covered the dependency of a side chain on other tracks. However this is just one but not the only kind of dependency. For each device in a chain there is a dependency to the device before it in the line. There might also be a dependency for each time step to the step before, depending on the effect plugin. Imagine for example a delay effect that needs to reference some sort of buffer at a previous time step to mix in the delay effect. So even if you split up your buffer in two parts, it is necessary for the first part to be fully processed by the time you start processing the second part. Even if there are effects that might not need to know about previous time steps (like Distortion) there is no reliable way for a DAW to tell if this is the case. So you have to assume the worst.
You also have the same problem with probably all VST instrument that receive MIDI data.They most likely store a key state to know if a key is pressed or released to output a sound on key press or to release a note. A MIDI message for a note usually consists of Note On and Note Off events. So it needs to process all MIDI events in the right order for it’s key state (which key is currently pressed) to be correct. So you cannot just go and reverse order or parallelize the calculation.

To understand it you can imagine a picture which you want to blur. If you slice it in 4 parts you process in parallel there would be artifacts on the edges since each pixel also has an influence on pixels in their neighborhood.

You do not address how you run third party code from a VST in parallel. You have to assume that the code is not thread safe. So if you have a state that is stored inside the plugin instance you cannot access it from two threads at the same time. I do not only mean plugin parameters but also buffers or the key state mentioned before. If parallel execution of the process function from the same instance of a plugin is not possible the other option would be to make one instance for each thread (with one thread per mini block) but this would result in having one state per thread, which will probably not work for the majority of plugins.

So because of this problems the only safe option I see is to make one thread per track because the instance of the plugin logically belongs to a track. This is what Bitwig and probably other DAWs obviously already do. The situation might also be different for internal plugins.

I hope I have expressed it more or less understandably but otherwise feel free to ask.

Magnus · January 30, 2023, 6:32pm

(Sigh) I said I wouldn’t respond anymore, but it annoys me that people are trying to shoot down a beautiful solution with bad arguments.

You say you haven’t read the paper, but you believe that you understand the idea. But you obviously don’t understand the concept of a Parallel Pipeline. You are basically making the same argument as the others in this thread, and the same argument as e.g. the Ableton developers have made in their manual: That there is a serial dependency between connected plugins, so they must be computed in series.

But what you all fail to realize, is that precisely because the plugins are working on streaming data, you can buffer the output of one plugin, and use the previous buffer as the input to the second plugin, and thereby have two serially connected plugins execute in parallel on a multi-core CPU, and still have them working on the data in the correct order. The only penalty is that it introduces one extra buffer of latency, but that can be partially alleviated by using mini-buffers instead.

Read Example 1 in the paper VERY carefully! That is the simplest example, which takes two serially connected plugins F and G and run them in parallel, while preserving the correct order of the computation, so the output of plugin F goes into the plugin G. If you don’t fully understand how that simple example works, then you don’t understand the method!

Regarding MIDI I am not exactly sure how that works. I think the DAW provides the plugin with a list of the MIDI and automation events for the buffer being processed, so it’s basically just a buffer like the one with audio-data, so the Parallel Pipeline will work just as well for the MIDI and event-buffer.

Regarding thread-safety it isn’t relevant, because we are only running each plugin once. That is, we are running the two plugins F and G in parallel, but we are not running F twice simultaneously.

I agree that this parallelization method is very counter-intuitive - and maybe even a bit ingenious if I were to flatter myself And that is precisely why it is so annoying that people keep trying to shoot it down with the same objections that the method was specifically designed to solve.

psycha0s · February 1, 2023, 12:02pm

One buffer of latency per plugin in the chain. The longer the chain of serial connected plugins, the bigger the latency. You could argue it’s not a big deal, but this way we would render in advance and enqueue lots of buffers. What the audio engine should do in case the user turned a knob or played some notes? The only possible option is to discard everything we have already rendered and redo it again, using the new midi data. It’s obvious that it would introduce an audible delay in the playback. What if the user is recording from the line input and wants everything to sound in sync? It’s impossible to buffer anything in advance for the chain that processes the signal that’s being recorded. Also, PDC makes everything is a way more complex.
I’m just sharing my doubts according to my experience and I don’t try to shoot down your idea at all.
TBH, I haven’t read all of your paper either, so I hope someone would read and analyze it and prove you’re right.

Magnus · February 3, 2023, 9:30am

(DEEP Sigh!) You also haven’t bothered to read the paper that you are trying to refute. And in the very part of my previous post that you are quoting, I am saying what the solution is to your objection: To use mini-buffers.

This is explained in more detail in Section 10 of the paper, and Section 11 discusses the “graph conversion algorithm” where it is obviously not necessary to have a Parallel Pipeline after every single plugin. For starters it could be used on grouped tracks, aux tracks, and the master bus, and then only when it is necessary because the CPU is overloaded. This would only introduce a few buffering layers for the Parallel Pipelines, and by using mini-buffers the extra latency would only be a few milli-seconds total (single digits), which is a tiny price to pay for much greater utilization of multi-core CPU’s.

All of the objections and assertions that people have made in the comments above, were already solved and described in the short paper, which should be very easy to read for people with proper training in computer science.

I can now fully understand why the BitWig developers will not interact directly with the community, and all of their communication has to go through their support staff. What we have seen on full display in this thread is known as the “Dunning-Kruger effect” in psychology, and it is incredibly annoying to be at the receiving end of that.

I am now going to ask the admin to lock this thread, as I have much more important work to do, than basically repeating “RTFM!” to random anonymous people.

If you manage to squeeze in another post before the admin locks this thread, then please be so good and write your full name with links to your work-related profiles (e.g. GitHub and LinkedIn) so I can see who I am talking to.

icaria36 · February 3, 2023, 9:55am

Ok, thank you everyone for your participation in this discussion. As a #brainstorm it is probably exhausted. If anyone is interested in improving better multicore CPU support for Bitwig, the next logical step is to contact the Bitwig team directly, and you can point them to the discussion here. Surely someone in their team has the knowledge and the capacity to decide what could or could not be improved in their product.