It has happened to me so many times that I see Java code, written fairly recently, may be even a year or two years ago, and still uses the Java Thread class to build sufficiently complex multi-threaded systems, and obviously in a lot of such instances, it makes me think otherwise. One such interesting experience prompted me to write this blog, that tries to explain why it may not be a great idea to use Java Thread class directly, and why one should possibly look at leveraging the Java ExecutorService for multi-threading use cases.
Let me start by providing an overview of the use case and the observations I had while reviewing the existing “One Thread Per Tenant” based design and implementation of this system I came across recently.
Case Study – Multi Tenant System
To give a background of the application in question and it’s current design, the application is a multi-tenant cloud based Java application, responsible for processing jobs of multiple tenants simultaneously. It was designed such that it created one thread per tenant at startup, and kept on adding new threads as new tenants got added to the system. As soon as a thread for a tenant got created, it is started to perform the required processing for the tenant its running for. The processing (per tenant) itself is a typical batch pipeline processing – so, as part of the processing, the thread invokes a job in an external system, waits for sometime for the job to complete, and on completion of job in the external system, waits a fixed amount of time before it repeats the above steps again for the next set of input. Obviously the steps described above are the happy path steps, there are other complex cases like the external system being unavailable to process the request, or it takes longer than usual to process a particular job for a tenant and similar other real life scenarios, which obviously need special care – the existing design and implementation is such that it periodically polls for the status of the job, or the status of external system, waiting for varying intervals of time to account of the varied real life scenarios it could land up into.
One Thread Per Tenant model
So to summarize again, in its existing design, this multi tenant system is having one thread per tenant, each thread runs forever in a while loop and keeps processing jobs for the same tenant continuously at regular intervals, interacting with external system with varied set of inputs.
Alright, so the good part is that multiple tenants are being processed in parallel and it is able to process new tenants without any restart or manual intervention. However the above approach does have some downsides, so let’s talk about those downsides and then try to look at a potentially better way to handle this and similar other use cases.
The Not so Good…
- Number of concurrent threads: The first and foremost problem with the approach of creating one thread per tenant, with each thread running continuously in a while loop is that there is no control over how many threads are being spawned by the application. So if your business is doing well and if you keep adding new tenants to the application, it would keep adding a new thread for every new tenant and that way it could land up in a unwanted situation of having too many live threads within the system. Even though Java Threads are easy to create from an API usage perspective, Threads consume system resources like memory, cpu cycles and the JVM has to manage a lot under the hoods with respect to scheduling Threads efficiently, so it is always advisable to limit the number of threads that an application creates. Creating too many threads could lead to too much context switching between the Threads and could overload the JVM with respect to memory and CPU cycles, thus defeating the purpose of creating threads in first place, or even causing out of memory errors.
- Scalability with increased load from new tenants: Specially in this particular application since the processing in each thread involved submitting a job to an external system, the concurrent job execution capacity of the external system plays a very important role in the overall concurrency design of the system. If the external system is constrained and is not able to process all tenants or jobs from all tenants concurrently, the above system would fail to scale because with the current design, there is no way to control how many threads are actually live and concurrently submitting jobs to the external system. It could happen that all threads are trying to submit jobs to the external system within a span of few seconds leading to scalability problems.
- Tenant Deactivation/Disabling: Another major limitation of above design is about supporting removal of tenants or temporarily disabling tenants. Considering the case that tenants could get offboarded or we may want to disable some tenant for certain period of time, it is not possible to do that with the current design since the thread for the tenant gets created on tenant onboarding and it runs forever until the application is shut down. So unless the application is restarted, it would keep processing unwanted tenants which is obviously undesirable.
- Uninterrupted Reliable processing for every tenant: Likewise if for some reason the thread for a particular tenant dies or is terminated, there is no good/reasonable way for the application to restart that thread or to create another thread for that same tenant. Hence in such a case the only option left is to restart the application so that it initializes and creates the threads per tenant again for all the tenants at startup.
- Support, Maintainability and Reliability: And last but not the least, the application logic would be complex, unreliable and very hard to debug with all those wait and sleep calls on the threads. Likewise, such a design and the code base is usually clumsy, error prone and very challenging to maintain .
So as you may observe the current design may work initially for the Happy Path cases, however it would have challenges scaling as well as in handling non-happy path or practical real life scenarios which are bound to happen at some point in time, maybe even during the initial phase.
Since we have understand the problems and the limitations of the existing One Thread Per Tenant approach, let us try and see if an alternative approach can solve some of the problems we discussed above. So the approach itself is based on the idea of using Thread Pools instead of one thread per tenant.
Thread Pools model
Thread pools are implemented in Java using the Executor framework, introduced in Java 5. The Executor framework is an extensive framework with a lot of options to create and customize Thread pools and concurrency utilities.
For this discussion, considering the use case at hand, we can consider using the Executors.newFixedThreadPool() utility API which provides a Thread pool with a fixed number of threads and an unbounded job queue. As you may guess, a fixed thread pool has a fixed number of threads to execute the tasks submitted to it. One can keep submitting tasks irrespective of the number of threads, the tasks would get queued up and subsequently picked up as and when threads become available to execute the tasks.
From an overall design perspective, we could have our main class or a Driver class run on a scheduled basis, may be every few minutes, every time looping through all our “active” tenants, creating a Runnable or Callable task to perform the required job for every tenant, and submitting it to our thread pool that has a fixed number of worker threads. All the logic to check status of external system, triggering job in the external system, tracking it’s status and so on can be encapsulated in the Runnable task itself. A key point to be noted from an implementation standpoint, that is different from the “One Thread Per Tenant” approach is that we cannot issue wait or sleep calls using this approach since we are using a thread pool, so threads are shared across tenants and jobs from other tenants could be potentially be queued up waiting for their turn to execute.
So let us evaluate this Thread Pool based approach in the context of the constraints and limitations that were mentioned earlier while discussing the existing approach of One thread per tenant and see how they fare in a Thread Pool based implementation.
- Number of concurrent threads: As we noted above, if we are using a fixed thread pool, the number of threads that are concurrently running are always fixed. So, this is controlled and managed, irrespective of the number of tenants that need to be processed.
- Scalability with increased load from new tenants: Similar to the point mentioned above, since we control how many threads constitute the Thread Pool, we can also drive this setting based on the capacity of the external system. Thus, we would never run into a situation where we force the external system out of its capacity.
- Tenant Deactivation/Disabling: In the Thread Pool based approach, our Main class or the Driver program is running on a scheduled basis, every time looping over our active tenants and creating jobs for each tenant. Thus, disabling a tenant temporarily or deactivating a tenant is supported inherently in this approach.
- Uninterrupted Reliable processing for every tenant: Again, since threads are being managed by the Thread Pool, and we are just responsible to submit tenant jobs to the Thread Pool, we never run into the situation where a tenant specific thread died due to some unavoidable condition. There is no permanent tenant specific thread. Every tenant gets its turn in a round robin fashion, and if threads in the Thread Pool ever die, the Thread Pool would transparently replace them with new threads, without you having to do anything about it, thus making the system much more reliable and fault tolerant.
- Support, Maintainability and Reliability: Since the management of threads is done by the Thread Pool, we are just responsible for creating and submit our tenant jobs, thus simplifying the code and its maintenance. Likewise, debugging of the application would also be much easier since we would not have the wait and sleep calls that were inherently part of the One Thread Per Tenant model.
In addition, the Executor framework provides a wide range of concurrency utilities that would make the system design and implementation much simpler, and the application much more robust and reliable.
As you may infer from the above discussion, the Thread Pool based approach scores much higher in terms of overall system design, scalability, reliability, fault tolerance, support and maintainability as compared to the One Thread Per Tenant model. It is much easier to implement and a lot simpler to debug and maintain.
I realize a single blog post is definitely not enough to comprehensively cover topics like concurrency, threads and thread pools, so my intent for this blog was to introduce the topic, present the real life case study with its existing design, and compare it with an alternate and better approach to design such systems and hopefully follow this up with some more such posts sharing my experiences and thoughts on these topics.
As always, comments and feedback are most welcome.
Happy reading, happy coding!!