Why do we need it, when to use it, how to design it (in C++)?
Reading time: 5+ minutes
The design patterns in most general OOP languages have developed to a vast majority of techniques. In this article I will speak for one of the most important concepts when designing a multi-threading application – the Thread Pool. IMHO the short explanation can be this – The thread Pool serves as both: an efficiency booster (minimizing thread switches, maximizing CPU cores usage, accelerates processing) and a work manager (helping and forcing us to define and classify the jobs, and think on their right sequence).
——- ——- ——- ——- ——- ——- ——- ——- ——- ——- ——- ——- ——- ——- ——- ——- ——-
I come from the land of Embedded Development where efficiency is normally critical, and years ago most processes and sequences were obligatory hard defined in order to achieve maximum productivity and 100% predictability of the system behavior. Hence my love to the details and how to achieve this efficiency in a high-level system. As a result, and related to an ongoing project, I’ve recently made a research on some of the available C++ implementations in GitHub. Important point – I needed it to be in C++11! I tried to integrate few of them and got few bugs. Finally, we sat down with a friend and colleague of mine (Ferai Ali) to check those and we concluded – most of those are not good enough. Reasons – for some of them pending bugs were open; for others there were reports on lower efficiency; and some of the tests I tried to perform resulted in race conditions and hard to debug crashes. On top of that one of the queues detached threads at thread pool destruction (effectively leading to no further control at all). I admit – I had errors, and I was working on them when Ferai said – “Man those are not that good and we see all those issues, why don’t we write one of our own? It will be fun and easy :)”. So we sat down and the result is now in GitHub: https://github.com/AtanasRusevPros/CPP11_ThreadPool/.
——- ——- ——- ——- ——- ——- ——- ——- ——- ——- ——- ——- ——- ——- ——- ——- ——-
To start let’s just answer to few general questions.
Why do we need it?
Nowadays even simpler smaller devices use many times multi-core CPUs. In addition, most programming languages and platforms offer the ability to simply create whatever number of threads we want. So – if we have an application with 10 threads, which talks to the OS, and it communicates with e.g. Bluetooth, Sound Driver, Graphics Driver, LAN and few more – each of those drivers will have its own Low-Level Library and will have most probably additional threads. Even when those drivers are designed in a really efficient manner – simply think – e.g. take a PC, or a smartphone, each of them with several – up to 10-30 applications running simultaneously – and when each of those start 20 threads – we may have hundreds of threads (as it actually happens). The OS will also start its threads. Some processes or drivers may do the same.
The problem is that starting threads for each new task is not efficient. Each thread has its own stack, memory space, registers state and execution pointer. Switching a thread requires the OS to get all this information, store it somewhere in RAM, load the next thread “full bunch of same data”, and then start it. This mean copying from hundreds to tens of thousands of Bytes, interrupting the OS each time this must be done, talking to RAM (and not to CPU cache). The last one is critically important for PC – RAM is hundred times slower than CPU Cache (if interested – search for articles on how to write Cache Friendly code).
To conclude – switching threads (which is one of the main tasks of the OS) is an expensive operation!
One can easily see where the problem is, and if he searches carefully on the internet – he will soon see many programming gurus saying – if you open new threads like hell (without thinking for the overload on the OS) – you simply should not do it; stop and educate yourself.
Where do we need it? When to use it?
We need a Thread Pool in any multi-threading application, especially if it is to be at least a bit portable. Again – most CPUs nowadays are multi-core.
Second – whenever we think our application might go one day to a multi-core platform – we shall obligatory design the application in a Thread-Pool friendly manner. This simply means – all tasks shall be well defined in separate functions, whenever there is some sequential work or processing – it shall be managed at least by a simple state machine. The last ensures that we will have a mean for start, stop, get state and few others via some API of the state machine.
Let’s say we have 8 cores. And 50+ separate threads in our app. With no dedicated allocation and definition of job sequences – part of the time a core is not operating due to thread preemption. And if one thread is switched to do something, which is blocking for other thread, and the first is preempted, we go to the second, then the third thread time slice came – we have accumulation of latencies and waits.
Again – each thread switch requires switch of stack and thread context. For 50 threads this means periodic interruption of the work and spending quite a lot of time on switches – as each blocked thread, which will be switched only to check whether it’s wait condition is finally released, will require time. And if we have to check 10 times in this thread a wait condition, and for each of these checks we switch to the thread, then go back to other threads – we simply do thread switching with no effective work performed!
The result – without predefined when to make the switches (i.e. more like predefined sequence of operations) – this behavior interrupts and disrupts the whole processing sequence leading to major efficiency decrease.
In my opinion the best (and the highest level) of a Thread Pool is one that has exact allocation of threads to specific CPU cores. And also, the ability to choose whether to use e.g. Intel Hyperthreading technology (so that we test whether it helps or it doesn’t to our application design). This however is not easily achieved – C++11 offers only extraction of the supported HW threads, but no way to identify them, neither to talk to the CPU core and extract some info from it. For such operations we will need at least OS specific code, and also a C++ CPU library. Sometimes even assembler instructions. There are solutions over the internet and you may search on the Web if you need this in your application.
How to design and test it. Results in efficiency boosting of the different points of consideration.
- The threads are part of C++ since version 11. C++ itself is CPU independent and version 11 provides only the function std::thread::hardware_concurrency() to get the number of supported HW cores.
- Normally the regular thread pool concept works with exact number of threads matching to the number of CPU cores on the machine, or matching the number of supported HW threads. In case we have CPU with 4 cores and Intel Hyperthreading technology – we will have 8 HW threads.
- The thread pool assigns jobs to the threads. Jobs are functions. In C++11 these can also be lambdas, class methods, functors, i.e. a callable object.
- In addition – the jobs are queued. The state-of-the-art implementation is to have not only queues, but also the so-called job stealing. This means – if the jobs on one thread have finished – then pending job is started in whichever next thread is free. This ensures the best utilization of the CPU cores.
- Job Queues and Priorities – I would recommend considering also to use / write a Thread Pool which has several Queues with priorities – e.g. 3: for Normal, High and Critical jobs. On Normal one shall put tasks which are not time critical. On High – important current tasks. On Critical – e.g. sending data packets over LAN that shall be done periodically and synchronously. When you do this, you are forced to define the tasks and their priority. So – you are forced to do a good design. The Thread Pool me and Ferai wrote always takes first all next Critical jobs, then all next High jobs, and finally checks the Normal jobs. Each thread performs a task, and then goes through those Queues in this order to take a next one.
- Another option is to write a Thread Pool with exactly one Job Queue per thread, without job stealing. In this case one shall explicitly put jobs to each separate thread, thus taking full control of the execution sequence, thread work allocation and job execution synchronization between threads. While this is normally not necessary, there might be real applications for that approach. However – if we have several priorities, and e.g. 50 jobs that take total 20 seconds – if we use the original approach – those will be performed for 20 seconds and all cores will be used 100%. If we have to schedule manually – this will inevitably slower the execution, even if it is only with microseconds. This will happen as we will have waits between the jobs to ensure correct sequence.
So – using dedicated one Queue-per-Thread is done only when sequence and specific synchronous tasks are more critical than efficiency. On the other hand – a normal FIFO Queue will ensure the sequence of operations. It is up to you to assess your use case for this specific custom approach.
- In an ideal theoretical system – when only 1 thread is allocated to each core – thread switch and interruption is never performed in our application. Each thread takes a job immediately after it finishes the current. In addition to the queuing – one can make quite smart and controlled sequence of operations with maximum CPU utilization. Why ideal theoretical system? Because Windows (and probably the other high-level OSes) can and will switch the thread from core A to Core F if it can increase the general efficiency of the system. And – on a modern OS we have at least the OS threads and some basic drivers or other applications threads that work in parallel with our application. For more information – research on how your OS will do the thread management.
- Measuring performance – to do this one shall write a short application which e.g. uses simple while cycles and prints some information. The while cycle will keep the CPU loaded at 100%. Adding pauses on the main thread for a longer than necessary total jobs execution time, which stops the job insertion to the queues, will lead to wait periods when we shall be able to monitor / observe / measure 0% CPU load from all free threads when all jobs are finished. The last is done with:
- Thread sleeping – critically important for the performance. Why? When we have no work for a given thread – we shall obligatory put it to sleep and later wake it up upon new job insertion. Otherwise it will keep the given HW thread occupied, effectively leading to 100% CPU load on it.
- Terminating threads correctly – this is one of the most basic points when we have a multi-threading application. If we don’t first wait to empty all job queues (i.e. finish all pending tasks), and then stop the threads correctly – this might lead to either missing a job, and / or memory leaks. It is true, that terminating the program will close all threads, but writing correct code is the top priority ;). In addition – this might allow us to have two smaller thread pools.
When we have completely different parts of the program, we will be able to e.g. allocate 4 threads for group of tasks A in one thread pool, and have a second pool for tasks group B. We can even start a thread pool for some time, then stop it, then start it for other part of our work. That’s when writing the code with correct termination would be crucial.
Our solution with Ferai waits threads and shuts them down in the destructor.
- Parameterized number of threads upon pool creation. One thread pool shall better have the ability to be initialized with e.g. all available HW threads (by default), or any other random number. Some of the GitHub available examples now (Aug. 2019) does not have this. This means – once we have the ability to choose the number of threads – if we want, we can create any arbitrary number of pools with any arbitrary period they exist. Thus, we can better structure complex applications with many sub-tasks in them. Our thread pool in GitHub has this ability. We have tested it with 1, 4, default (12) and 40 threads on a 6 core Intel i7 with Hyperthreading. We also had the ability to monitor the results and how it operates with long and short tasks, and during pauses.
- Future object to return result and get feedback asynchronously from other objects in other threads. C++11 offers us the concept of Future. As explained in cppreference:
The class template std::future provides a mechanism to access the result of asynchronous operations:
- An asynchronous operation (created via std::async, std::packaged_task, or std::promise) can provide a std::future object to the creator of that asynchronous operation.
- The creator of the asynchronous operation can then use a variety of methods to query, wait for, or extract a value from the std::future. These methods may block if the asynchronous operation has not yet provided a value.
- When the asynchronous operation is ready to send a result to the creator, it can do so by modifying shared state (e.g. std::promise::set_value) that is linked to the creator’s std::future.
Me and Ferai wrote the code that satisfies all the points from above and it contains a simple test code. It has also basic references to C++11 pages mostly from cppreference.com and is completely commented (i.e. you will understand what each line of code does). For more information download the source and test it yourself:
I hope this article was useful for you. It definitely does not cover all possible aspects of the subject, but I tried to cover the basics from my perspective and knowledge. Feel free to write comments below, and share a link to it here and there. I would also highly appreciate if you find weak spots or something more that we can learn, add or shall investigate in addition.