Idea for Better Multicore CPU Support

This is an idea for greatly improving the multicore CPU support of BitWig and other DAW’s. It is such a pity to have 4 CPU cores or more on modern computers and only be able to use 1 CPU core in audio processing, because the DAW is having trouble allocating the processing across multiple CPU cores. This problem is only going to get worse in the future, because CPU clock-speed will stagnate and CPU cores will increase.

I think I have a good solution for this problem. I am a computer scientist myself and I do sometimes work with parallel programming, but my field of expertise is quite different from how audio-processing is done, so it is possible I have misunderstood something.

The Problem

The BitWig user-guide doesn’t describe how multicore CPU allocation is done, and the GUI only shows the total CPU usage, not the allocation of tracks and plugins across the multiple CPU cores.

The Ableton manual has a short section describing how chained plugins (plugin A going into plugin B, going into plugin C, etc.) need to be placed on a single CPU core because the processing of one plugin depends on the output of the previous plugin.

I think BitWig does the same, because when I have e.g. a couple of CPU heavy plugins on FX Sends, and I send audio from all tracks to those FX Sends (e.g. a little reverb on all tracks), then BitWig shows nearly 100% CPU usage (and sometimes breaks up the audio when it exceeds 100% CPU usage), even though the Windows CPU monitor only shows e.g. 15-20% CPU usage. So I think BitWig has placed all processing on a single CPU core. I think the same happens when I have plugins on the master-bus.

I believe there is a clever way to distribute the computation to multiple CPU cores, even though the plugins depend on the output from previous plugins.

Example Scenario

Let us say we have several audio-tracks going into both the master bus and a single FX send which then goes into the master bus, and all of these channels have plugins. How can we make all the audio tracks and FX bus process in parallel on multiple CPU cores?

Solution 1: Double the block-size

A block refers to the audio-buffer that the sound-card driver requests that BitWig processes e.g. consisting of 480 samples, which corresponds to 10 msec at a 48 kHz sample-rate. This block-processing is done for efficiency, as the overhead would be too great if BitWig and all the plugins were only processing a single sample at a time.

So in this solution, the audio tracks are processing audio for block T, which gets queued for processing in the FX bus for block T+1. So in the next block, the audio tracks will process audio for block T+1, while the FX bus will process the queued audio from the audio tracks in the previous block. This makes the processing of the audio tracks and FX bus happen in parallel. The only problem is that the output of the FX bus is delayed by 1 block, so when sending the audio tracks to the master bus, we also need to delay their audio by 1 block, so the timing matches with the FX bus.

This should work and it would allow us to have parallel processing at a penalty of 1 extra block’s latency, which may be acceptable for some people.

However, the problem still exists if we have more “layers” of dependent processing, e.g. some FX buses sending their output to some other FX buses, or if there are plugins on the master-bus that we want to run on its own CPU core. We would have to add another block of latency for each “layer” of dependent processing like this. So we could end up with 3-4-5 times the original block-size’s latency, which is probably not acceptable if the original block-size is already 10 msec.

Although we would get much better CPU multicore utilization, the cost would be a big increase in latency.

Solution 2: Mini-blocks

Instead of processing the full block-size required by the sound-card, it can be split into smaller block-sizes for the internal processing. For example, if the original block-size is 480 samples, we could split this into 10 mini-blocks of only 48 samples each.

For a single “layer” of dependent processing, e.g. a single FX bus being processed on its own CPU core, this would only increase the overall latency by 10%. Even with 5 “layers” of dependent processing, e.g. FX buses going into other FX buses, etc., we would still only increase the overall block-size by 5 * 48 = 240 samples, so a 50% increase in overall latency. So we would get much better CPU multicore utilization, at a fairly small penalty in overall latency.

Because the block-sizes are much smaller, the processing overhead is going to be bigger as well, because the DAW has to call the plugins e.g. 10 times per block instead of just once. Exactly how much this penalty is would require experimentation, because computers are so advanced with CPU caches etc. that the result can be surprising. But I imagine that for CPU heavy plugins, the extra overhead of running smaller mini-block-sizes is going to be negligible, because most of the processing-time is spent inside the plugin and not in the “boiler-plate” code.

We may also worry about stability when decreasing the block-size to 10% of its original. This would again require experimentation to see if it really is a problem. My guess is that it is not a problem. I think that because the main block-size is unchanged, the system will be just as stable as it was when it was processing the full block-size required by the sound-card. The sound-card doesn’t know that the DAW is processing 10 mini-blocks to fill 1 block. I think the only drawback is the extra overhead that was discussed above.

One Track - Multiple CPU Cores

Using the mini-block method, it is also possible to distribute the computation of a single track onto multiple CPU cores. This may sound strange at first, but there is really no difference between a single track and the scenario above with FX buses, when distributing the computation onto multiple CPU cores using mini-blocks.

This would allow us to have several CPU heavy plugins on a single track, e.g. where each plugin uses 90% of a single CPU-core. Because they are run in parallel on different CPU cores using mini-blocks, we can have several of these plugins chained together, and the penalty is only going to be a small increase in latency, depending on the length of the plugin-chain and the size of the mini-blocks.

Allocation Algorithm

Using this method with mini-blocks, would probably make it possible to allocate the computation of all tracks, FX buses, and plugins across multiple CPU cores to nearly 100% utilization.

However, it is probably not going to be trivial to write a good algorithm for this. Not only is it going to make the audio-engine even more complicated, but it is essentially what is known as a “bin-packing” problem in computer science, which is known to be very difficult to find an optimal solution for - and this is even more difficult because the CPU load of each plugin may vary over time.

But it should be possible to make a pretty good allocation-algorithm that may not be able to squeeze 100% out of the multiple CPU cores, but if it could just get to 80-90% efficiency that would still be great, and certainly much better than only being able to use 1 CPU core!

Questions?

I thought about writing a little paper with this idea, making flowcharts and diagrams that show how it would work. But I really don’t have the time. So I hope the idea is clear from the description above.

If people are skeptical about how this would work, then I would encourage you to try and draw on a piece of paper a simple 2-core CPU system with audio-blocks for different time-steps, to see how you can parallelize the computation of two sequential plugins, simply by queueing the output of the first plugin for processing by the second plugin during the next block. And then just do this with mini-blocks to lower the latency. And if you need to mix the dry/wet signals, then you also need a delay on the dry signal to compensate for the extra latency.

If the BitWig software engineers have already thought about this, but there is a good reason why you don’t do this already, then I would be very curious to hear the reason.

3 Likes

Thank you for sharing these thoughts. This goes well above my knowledge. Have you shared it with the Bitwig team?

1 Like

Hey I think this is an excellent idea…the reason is, I recently started using a Multicore CPU Support plugin called “Unify” - made by PlugInGuru.com. It basically functions like a highly cpu efficient multi-vst “host” within your DAW (or stand alone). With Unify, I can easily use 2 dozen+ more virtual instruments & effects…all playing simultaneously, without glitching or crashing. Before, I could only run a handful of plugins before things would start to bog or crash. So having multicore CPU support makes a big difference. It’s actually amazing!

Interesting ideas, but this is more from engineers perpective rather than actual users.
Most people don’t care about internal detail, but it would be still useful to visualize multicore utilization for those who want to optimize their projects.

Some my thoughts on ideas:
Smaller block implies less memory locality so processing time will never be half with half sized blocks. Also multithreading comes with cost, which means two CPU doesn’t mean half processing time. Summing these up leads to conclusion that we don’t know how effective these ideas will be until PoC is made and benchmarked. It sounds worth trying though.

2 Likes