How I Fell in Kotlin’s RunBlocking Deadlock Trap, and How You Can Avoid It

October 25, 2023
Rss Fetcher

The blocking coroutine builder can kickstart your coroutine journey, but you need to know the risks—and the alternatives

When you write your first coroutine in Kotlin, you might do it with the runBlocking function, like this.

<a href="https://medium.com/media/491a66ff5dfc434aecb2fbf8f4d9a139/href">https://medium.com/media/491a66ff5dfc434aecb2fbf8f4d9a139/href</a>

It’s a super easy way to start new coroutines. Calling a suspending function like delay from a plain old function would normally give you an angry red compilation error. But wrap it with runBlocking, and suddenly you’re free not only to call all the suspending functions you like, but also to use coroutine builders like launch and async.

Using runBlocking in the main function like this is no problem. But when was the last time you actually added code to the main entry point of a real production system? When you’re working with coroutines, you’re more likely to be deep within your app’s code, and that’s where things get more complicated.

The runBlocking function blocks the current thread while it waits for your coroutines to complete. Call runBlocking from the wrong thread, and you can end up stalling the progress of the very coroutines it’s supposed to be waiting for.

It doesn’t take many lines of code to reproduce the problem, so I can easily show you an example right here. This code is supposed to just print “Hello!” ten times and then exit. But if you try running it, it’ll actually hang forever without printing anything.

<a href="https://medium.com/media/2aae542f0379888dd12bbd35745346e3/href">https://medium.com/media/2aae542f0379888dd12bbd35745346e3/href</a>

The default dispatcher uses a shared pool of threads to execute coroutines. It’s built on the assumption that coroutines will suspend when waiting for outside events and use a thread only when they’re actually doing work. The number of threads given to the default dispatcher is automatically determined based on the number of CPU cores so that there’ll be the right number of threads to have one coroutine actively using every core at all times.

Try adjusting the times = 10 value up and down in the example code to launch a different number of coroutines. In the online playground, two coroutines seem to be enough to cause the problem, but you’ll need more if you try running this code on your expensive gaming rig.

When the number of coroutines is lower than the number of threads available to the default dispatcher, the program completes normally. As soon as more coroutines are launched than there are available threads, the program hangs forever and produces no output.

Why does this happen? It’s because the call to runBlocking in this example is blocking a thread from the same dispatcher it uses to run its coroutines. The main function’s withContext block and the launch of the new coroutines all specify Dispatchers.Default as part of their context.

Calling runBlocking inside a coroutine means it blocks whatever thread that coroutine is currently executing on—and coroutine dispatcher threads aren’t designed to be blocked! Now the runBlocking call won’t release the thread back to the dispatcher until all the coroutines are complete, but the coroutines can’t run until the default dispatcher has an available thread. That sure sounds like a deadlock waiting to happen.

This might seem like an easy mistake to avoid. Just don’t make blocking calls from the default dispatcher! Better yet, follow the advice in the documentation and don’t call runBlocking from inside a coroutine at all.

The problem is that once you have calls to runBlocking in your code, they can be easy to lose track of. Unlike a suspending function, runBlocking doesn’t advertise its presence with an icon, or fail your build when you use it in the wrong place. In a huge codebase with many modules and many contributors, you might not have any idea that a function will end up calling runBlocking, or what dispatchers it might use to launch its coroutines.

The first time I used coroutines in production code, I was integrating them into an existing server-side application. The eventual goal was to reduce memory use by switching to a non-blocking HTTP client for requests to other services.

Converting the huge codebase to coroutines in one go would have been a monumental and risky task. Instead, we wanted an incremental approach that would let us make a series of small changes, continually checking our work by shipping working code out to our users.

Kotlin’s built-in runBlocking function seemed like the perfect tool for the job. Its documentation says it’s designed to “bridge regular blocking code to libraries that are written in suspending style.” We’d be able to introduce coroutines to the HTTP client right away—we used Ktor—and then invoke it with runBlocking until we were ready to gradually migrate the calling code to use coroutines.

I’ll call this the bottom-up approach to introducing coroutines, which turned out to be a big mistake.

The plan was straightforward. Say function a calls b, which in turn calls c. We want to make an asynchronous HTTP request in function c.

fun a() = b()
fun b() = c()
fun c() = runBlocking { /** coroutines! **/ }

At first, we’d just have c use runBlocking to call the new HTTP client. Later we’d convert c to a suspending function, moving the runBlocking call up to the place where b calls c. Then we’d convert b to a suspending function, moving the runBlocking call all the way up to a. We had to do it slowly like this, because c might be actually be a shared module with dozens of callers in multiple different backend services.

fun a() = runBlocking { b() }
suspend fun b() = c()
suspend fun c() { /** coroutines! */ }

One problem with the plan was that it introduced many calls to runBlocking in many different places. Each time we made an outgoing HTTP request we had another instance of runBlocking. As we converted more of the calling code to use suspending functions, we were able to launch extra coroutines to speed up or simplify parallel tasks.

It was a ticking time bomb. As we migrated and refactored the code, some of those calls to runBlocking actually ended up being called from inside other coroutines. That was never something we did deliberately or consciously. But the complex codebase had a lot of branching code paths, and whole sections of code would sometimes get lifted and shifted from one part of the system to another. It was easy to introduce an innocent-looking call to an existing function without realising that it would eventually end up calling runBlocking.

It didn’t take long for a coincidence of code changes to lead to the exact circumstances from the example I showed earlier—only with many more levels of function calls, making the problem impossible to detect.

There weren’t any build failures, and nobody spotted any problems in the test environment. With low traffic, blocking a few extra threads was no problem.

But in production, it wasn’t long before we ended up with complete deadlock. It would happen when traffic increased, and the server was handling several requests at the same time. The default coroutine dispatcher would just stop, making the server dead in the water until the app was restarted. New coroutines would never run at all, and existing coroutines would stay suspended forever and never resume their execution.

It took hours of head-scratching and a whole lot of trial and error to figure out what was going on. Once we finally understood the problem, we enacted an almost total prohibition on the use of runBlocking anywhere in the system.

I learned a lot about coroutines from the experience—far too much to share here. One of the most significant realisations was that coroutine timeouts are powerless to fix this kind of problem. If a request is hanging due to any kind of coroutine deadlock, introducing a withTimeout block probably won’t do anything at all. That’s because the timeout does its job by triggering the coroutine to resume prematurely. A dispatcher that has no available threads can’t resume a coroutine at all, even for a timeout or cancellation.

<a href="https://medium.com/media/ef7d9ac6cbf23402c1e02acdff14a3c8/href">https://medium.com/media/ef7d9ac6cbf23402c1e02acdff14a3c8/href</a>

In the online playground example, the whole program will still get killed for taking too long. But that will happen long after the 100-millisecond timeout that the coroutine is supposed to respect, and it won’t help us in a real app. The coroutine timeout itself does nothing, meaning some of our early attempts to defend against the problem were utterly ineffective.

So if it’s not safe to use runBlocking, how are you supposed to start new coroutines outside the main function or add suspending function calls to an existing app?

Custom coroutine scopes are one solution. To avoid resource leaks and handle errors, it’s essential for something to wait for the completion of coroutines. But that doesn’t have to be done by blocking a thread. If you have an existing component with a lifecycle, you can link your coroutines to that instead. The coroutine scopes provided for lifecycle-aware components in Android are a great example of this.

Another way to tackle an incremental migration is to avoid the bottom-up approach and take a top-down approach instead. Instead of adding asynchronous HTTP calls first and then updating the calling code, start by identifying the highest possible place in the call stack that you can put a call to runBlocking. From there, work your way down, using the IO dispatcher when you need to call blocking code that hasn’t yet been migrated. It’s not a perfect solution but safer than risking deadlock.

If you’re lucky, you might have the option to avoid the problem altogether. For example, Spring WebFlux now lets you write your REST controller methods directly as suspending functions, so you have access to coroutines right from the start.

Figuring out how to launch a coroutine or call a suspending function isn’t always easy and runBlocking is a tempting shortcut that might just end up backfiring. What solutions or strategies have you come up with in your own applications?

How I Fell in Kotlin’s RunBlocking Deadlock Trap, and How You Can Avoid It was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

The blocking coroutine builder can kickstart your coroutine journey, but you need to know the risks—and the alternatives

Previous Post

Next Post

Solutions

Regions Covered