feature article
Subscribe Now

Shared Responsibility

Dynamic Analysis for Race Conditions and Deadlocks in Java

Let’s face it: multi-threading has created some pains in the… well, as Forrest would say, butt-tox. You used to be able to write code assuming you had your own nice little sandbox, especially in a more protective language like Java, which allows you to be reasonably sure that some other schmo’s pointer won’t come weaseling its way onto your turf. But now you can have multiple threads that may want to get to the same pieces of data, so the threads have to learn how to play nice together, and that’s not always easy.

The issues all build on a few key concepts that at first blush seem so simple and straightforward. Then again, if they were, we wouldn’t be having this little chat, now, would we? So we’ll start at the start, or at least close. What’s the problem with multiple threads sharing data? Well, there’s the obvious first problem of only one entity having access to the data at a time, from a physical standpoint. An arbiter can generally handle this, granting memory access in turn to requesters. Yes, theoretically there’s some extremely tiny chance that two requests could happen at exactly the same attosecond causing some metastability-like confusion for an arbiter (even if you build in some kind of tie-breaking bias, you can always come up with some similar unlikely scenario), but – and this is key – that can generally be recovered from pretty easily. That’s not the big problem.

Race conditions and deadlocks

The big problem relates to series of operations we want to have handled “atomically.” An atomic operation is akin to what database folks would call a transaction: either the whole series of steps takes place, or none of them does. And it’s important that they take place uninterrupted. For example, if you’re transferring money from one of your bank accounts to another, there are two steps: remove money from account 1, and add money to account 2. If this process gets interrupted, and only the first step occurs, you’re not gonna be happy. Or let’s say you want to take your entire balance from one account and move it to another. This human-plus-computer sequence goes as follows: you check the balance to see what it is, you then subtract that amount of money from the first account and deposit that money into the second account. If, between checking the balance and removing the money, an outstanding check clears, the balance you checked is no longer valid, and you’re going to subtract more money than you have.

This is a situation referred to as a “data race” or a “race condition.” The outcome of a race condition is, by definition, unpredictable, and it can leave the system in an undefined state. Not a good idea. And unlike the simultaneous access problem above, this isn’t easy to recover from.

In order for your bank transaction to work reliably, you want all the steps to happen atomically: you get exclusive access to the resources – in this case, the bank accounts – and perform all the operations, and then let go so that someone else can get in there. There are two important concepts here: that of a series of steps that must be atomic, and that of protecting resources so that no one else gets them. These are independent (but related) items and both may have to be addressed.

The concept of an atomic series of steps gives rise to the so-called “critical section” of code. Only one thread is allowed to be executing a critical section of code at a time. How could more than one thread execute the same code? Consider a function that is called by two different threads. They could execute concurrently, messing up each others’ data, which is why they have to be restricted. Now, how could they really be executing concurrently? On a single-core processor, they wouldn’t be exactly concurrent, but the execution of one thread could be stopped to allow the other thread to execute, and that interruption could happen at any line of code. On a multicore processor, if the threads were assigned to different cores, then the execution could be concurrent.

A critical region, by design, executes atomically – if you ignore other unrelated accesses to the shared resources (more on that in a minute). But as a result, when one thread enters the critical section, then if another thread comes up to it, it has to wait until the first thread is done before it can enter, and so it stalls. As a result, you want your critical sections to be small and crisp to minimize the number of other threads hanging around doing nothing while they await their turn.

There still remains the problem that other unrelated code could access the shared data, and defining a critical region doesn’t help that. The general idea is that a piece of data can be locked, and that any code that wants to access the data has to acquire the lock first. If someone else has the lock, then the requesting code has to wait until the lock is released (and perhaps stand in line).

Even the process of getting a lock has to be done atomically. The steps are 1) checking to see if the lock is available, and 2) if so, grabbing the lock. But what happens if, after you see that the lock is available, your code is pre-empted by some other code that also sees that the lock is available and takes it? Now your code, which still thinks the lock is available, can’t grab the lock. This problem exists right down to the level of the machine instruction such that there are test-and-set instructions designed specifically to provide an atomic operation that can test a bit for 1, and if it’s not a 1, set it to a 1, without anyone being able to get in there and mess it up midstream.

Of course, you can get into trouble with locks. Let’s say you and I both want to gain access to data A and data B. For whatever reason, I take out a lock for data A as you’re taking out a lock for data B. I then try to get a lock for data B, but I can’t because you have it; you try to get a lock for data A, and you can’t because I have it. And we’re both going to sit here waiting for the other to finish with the lock we want, unaware that we’re waiting for each other. And we’ll wait a long long time. It’s the classic deadlock, and while pretty simple in this scenario, it can actually involve strings of entities waiting for each other in a way that’s less obvious.

One of the insidious problems with race conditions and deadlocks is that they are not guaranteed to be a problem; they are only a potential problem, and they may exist benignly for millions of hours before striking. Apparently, the huge blackout of 2003 has been traced to a race condition that lurked through three million hours of operation before darkening the lives of fifty million people.

The Java approach

In Java, a “synchronized” block or method can be used with an intrinsic or specific lock to set up critical regions and protect data. The semantics aren’t quite as transparent as the way things are described above, however. For instance, there is no required explicit association of locks and data or critical regions. There is a “guardedby” annotation that allows such an association to be stated, but it’s not required and is apparently not often used. So, not being a well-schooled Java programmer, this indirect approach to locking data and code can frankly leave my head spinning faster than a spinlock. Looking at various descriptions and solutions on the web, couched in more or less obtuse language, you end up feeling like this whole topic might be the realm of specialists. Which is a problem, since more and more programmers have to deal with thread-interaction issues.

And the evidence for that is the vast amount of work that has gone into static analysis programs designed to locate race conditions and deadlocks in code. Such programs can work in different ways, but all of them attempt to ensure that access to shared data occurs only through correct lock usage. Because these validation tools use static analysis, they comb through the code while the code is not executing. The accuracy of any such analysis for Java can’t be perfect, however, due to the fact that the data/lock associations aren’t explicit, so the tool has to guess to some degree. You could have the programmers provide that information as input, but that’s pretty impractical for huge programs, and even so, you would get only the ones the programmer remembered, and you wouldn’t test the ones he or she forgot – exactly the ones that are most likely to cause problems.

One of the companies addressing this space, Coverity, has just announced a dynamic analysis program to detect race conditions and deadlocks in Java programs. Unlike the static analysis tools, this checks for race conditions as the code is running, and it is apparently the first program to do so for Java. Dynamic testing is usually done by instrumenting source code (that is, temporarily adding extra code only for the purposes of analysis) to track and report on behavior as the program runs. A critical point that Coverity makes with respect to their coverage is that the analysis isn’t looking for race conditions when they actually occur, which might be never, but it looks for potential race conditions so that they can be identified before (or whether or not) they occur.

Combining static and dynamic

There are pros and cons to static and dynamic analysis when run by themselves. Dynamic analysis can demonstrate actual behavior as it occurs, but only in portions of the code that actually execute. Static analysis can completely cover all the code, but depending on what it’s testing, its ability to infer behavior may be more limited. Coverity actually makes a static analysis tool as well, and they recommend the use of both, allowing them to interact.

One way to do this is to start with a static analysis run, which will infer some shared data/lock associations. The ability to infer shared data is also the ability to infer non-shared data, so the static analysis tool will end up with data it is confident is shared, and test for race conditions on that, and it will end up with data it is confident is not shared, and it can report on which data that is. There is some gray region in between where the inference may not be strong, which is where a user may want to do some tuning on the sensitivity to avoid too many false positives – the bane of this kind of testing.

The report on non-shared data can be sent to the dynamic analysis tool, which can then skip doing any instrumentation for that data. This can speed up the execution, since instrumentation is bound to slow things down some (and minimizing that impact is usually a key design goal – Coverity claims a 2X runtime overhead, as compared to 10X and as high as 100X for other dynamic tools). The dynamic analysis tool gets to observe the actual acquiring of locks in association with data accesses, providing yet more confidence on data/lock associations. Once this is done, those associations can be fed back to the static analysis tool for one more run, which, now armed with better information on data/lock pairs, can potentially provide a more precise result than the first time, eliminating more of that gray area where inference was uncertain.

By including such tools in their validation suites, programmers can get better visibility into where shared data needs better protection. It may feel like yet more work, but the payoff is that you can identify extremely elusive flaws ahead of time, improving the quality and the robustness of the code before it ships. It’s far less effort than trying to debug race conditions as they happen.

Leave a Reply

featured blogs
Sep 21, 2023
Wireless communication in workplace wearables protects and boosts the occupational safety and productivity of industrial workers and front-line teams....
Sep 26, 2023
5G coverage from space has the potential to make connectivity to the Internet truly ubiquitous for a broad range of use cases....
Sep 26, 2023
Explore the LPDDR5X specification and learn how to leverage speed and efficiency improvements over LPDDR5 for ADAS, smartphones, AI accelerators, and beyond.The post How LPDDR5X Delivers the Speed Your Designs Need appeared first on Chip Design....
Sep 26, 2023
The eighth edition of the Women in CFD series features Mary Alarcon Herrera , a product engineer for the Cadence Computational Fluid Dynamics (CFD) team. Mary's unwavering passion and dedication toward a career in CFD has been instrumental in her success and has led her ...
Sep 21, 2023
Not knowing all the stuff I don't know didn't come easy. I've had to read a lot of books to get where I am....

Featured Video

Chiplet Architecture Accelerates Delivery of Industry-Leading Intel® FPGA Features and Capabilities

Sponsored by Intel

With each generation, packing millions of transistors onto shrinking dies gets more challenging. But we are continuing to change the game with advanced, targeted FPGAs for your needs. In this video, you’ll discover how Intel®’s chiplet-based approach to FPGAs delivers the latest capabilities faster than ever. Find out how we deliver on the promise of Moore’s law and push the boundaries with future innovations such as pathfinding options for chip-to-chip optical communication, exploring new ways to deliver better AI, and adopting UCIe standards in our next-generation FPGAs.

To learn more about chiplet architecture in Intel FPGA devices visit https://intel.ly/45B65Ij

featured paper

Intel's Chiplet Leadership Delivers Industry-Leading Capabilities at an Accelerated Pace

Sponsored by Intel

We're proud of our long history of rapid innovation in #FPGA development. With the help of Intel's Embedded Multi-Die Interconnect Bridge (EMIB), we’ve been able to advance our FPGAs at breakneck speed. In this blog, Intel’s Deepali Trehan charts the incredible history of our chiplet technology advancement from 2011 to today, and the many advantages of Intel's programmable logic devices, including the flexibility to combine a variety of IP from different process nodes and foundries, quicker time-to-market for new technologies and the ability to build higher-capacity semiconductors

To learn more about chiplet architecture in Intel FPGA devices visit: https://intel.ly/47JKL5h

featured chalk talk

Challenges of Multi-Connectivity Asset Tracking
Multi-connectivity asset tracking is a critical element of our modern supply chain. In this episode of Chalk Talk, Colin Ramrattan and Manuel Cantone from STMicroelectronics and Amelia Dalton discuss the common needs required for asset tracking today, why low power processing is vital for these kind of applications, and how STMicroelectronics ASTRA platform can help you get started on your next asset tracking design.
Feb 20, 2023