How I Learned To Review My Own Code

I recently discovered a new technique for reviewing my own code. Previously, I had the experience of flowcharting algorithms for two different patent applications. While producing the diagrams, the process of converting the code to a different form caused me to find bugs in both cases. The problem with flowcharts, though, is the constant fiddling with the graphical representation and adjusting things to fit in the space or how best to split the diagram when it does not fit in a viewable area. Because it had been so successful in forcing me to review my code at a more thorough level, I have been on the look out for something similar that would allow me put the code through an essentially isomorphic transformation without the complexity and overhead of a graphical layout to get the same benefit.

I came across a paper by Maarten H. van Emden that uses a matrix to describe the state transitions that take place in the code where the matrix represents a dual-state machine, where both the control flow and the data state are represented. I modified his method somewhat to adapt for use in a spreadsheet. He added the states starting from upper right going down and to the left. For a spreadsheet, it works much better to go from upper left going down and to the right. Maarten showed in his paper how to generate working C or Java code from the matrix, but I went the other way by converting existing C code to a matrix. The effort to convert my code to another form forced me to think through the code to a degree where I found numerous errors in the code I had previously missed. These included errors that passed the unit tests, working correctly in a single thread, but entirely wrong from a concurrency standpoint.

The other advantage was that it also allowed me to review the code in terms of its complexity, namely the number of states and state transitions involved. I found that I needed to reduce the number of states (to a reasonable number that would fit on a single spreadsheet tab) required refactoring the functions, thus improving the level of abstraction and greatly simplifying the code. The results can be seen in my current github repository, and the spreadsheets, which are in Google Sheets, can be seen in sheet1, sheet2, and sheet3.

I must say that this technique has helped me produce some of the best code I have ever written, both aesthetically and correctly, in which I justifiably can have a lot of confidence.

To TDD, Or Not?

One of the things bouncing around the Internet around the time I began to consider starting a blog was the debate over test driven development (TDD) as a measure of professionalism. I find I am an empiricist when it comes to arguing about development methodology after so many years of doing devleopment, which means I consider much of what is debated as vain efforts in convincing people to have the “correct opinion.” Yet, as with most things, I do have to make a decision about what I personally am going to practice in my personal effort to produce high quality code.

In addition, because of the type of code I tend to write, I have issues with some practices such as code reviews since reviewing the multithreaded, concurrent programs and data structures I write requires a skill set and level of expertise that is not common among my peers or developers in general. It isn’t that code reviews don’t help, but I find that most of the value comes not from others input, comments, or finding errors, but, rather, from the effort I have to undertake to explain the code which causes me to examine it in depth in order to be able to explicate it to others. So I am constantly looking for techniques that help me to examine, analyze, and force me to thoroughly walk through my own code in order for it to be as correct as possible. With my current project, I find I am even more on my own since it hasn’t attracted much attention, and I find myself even more dependent on my own self review.

So as I previously alluded, I find the arguments over TDD largely evidence free, and, therefore, largely arguments over opninions and preferences. To make TDD a hallmark of professionalism, I find a ridiculous assertion that reduces the practice of programming to mere domatic religion. The only measure of professionalism I find that makes sense is the actual production of high quality, error free code, regardless of how it was produced. That does not mean don’t find value in unit testing and some value in writing tests before writing code, but there is very little actual evidence that TDD itself produces well-designed and correct code over and above any other practice. Frankly, the more I listen to some of Leslie Lamport’s recent presentations about specifying systems and thinking before coding, I find his views on what should be practiced much more compelling as a measure of professionalism, even though I question how well they would scale as formal methods haven’t proven economical in practice for most development efforts.

As it stands, my current practice is to mix writing unit tests both before and after coding. I find that writing some simple, initial unit tests to exercise the major functions on their “happy path” the fastest way to get something working, which gives that initial, positive feedback I think that programmers find so motivating. I also find, though, a lot of value in writing most of my unit tests after writing the majority of the code as it forces me to examine the various paths through the code in a thorough fashion. I find the activity of writing thorough unit tests that exercise especially the error paths a valuable exercise as a form of self-directed code review.

Code Aesthetics

As I have stated in a previous post, this blog and the associated code project are a kind of experiment or exploration to satisfy my curiousity about ways to actually build applications using multiple languages. Because the code is open source available on Github, I have altered my personal style that I have used for much of my career. Some of the changes are because they are just better options and my changing is just long overdue. Most though are adjustments to accomodate what is common or recommended practice in open source projects, and at least one change is an outright innovation on my part which may not survive.

The first major change I have undertaken relates to exiting from functions. I was educated in the 1980’s at a time when formal methods were under development and seen as the future direction for development. At that time, in order to facilitate formal analysis, we were trained to have a single entry and exit point for all functions. In spite of the fact that I personally never tried to perform a formal proof of my code, I continued to code in this style and reason about my code in that fashion. With the processors of that era, it also made sense in that you could actually reason about execution order and code a straight line path of likeliest path through the code in order to be more performant. Admittedly, that hasn’t been true probably at least a decade with out-of-order execution and pipelining, but it had become an ingrained habit. So lately, in my code I have moved more to the style seen in more modern code bases like the Linux kernel where multiple return statements are used to exit a function as soon as possible as conditions merit to minimize nesting levels.

A rather simpler change is to use the Kernighan technique for brace placement that places the opening brace at the end of a line. For years, I practiced aligning braces in the same column, but for this project I adopted the reasoning that it is more important for understanding to get more lines of code on screen. So I am cutting down on lines by placing braces at end of lines and reducing the number of empty lines, as well as, trying to keep functions at 40-50 lines so they fit on one screen.

The other formatting change that I made was to accomodate recommended practice to limit line length to 80 characters. In my daily work, I wouldn’t pay that much attention to it since screen width hasn’t been an issue for some time. I would never create a line that was a totally unreasonable 200 or 300 characters long, but if a long function declaration or an if statement with multiple tests went well over 80 columns I wouldn’t worry about it. I have to say that for the most part limiting myself to 80 columns has been easier than I would have thought using these practices, and I have to say for the most part I am pleased with the clean appearance of my code.

The one innovation I am experimenting with based on the 80 column limitation is putting arguments in function declarations on separate lines. After trying it for a while, I quite like the technique for being able to comment each argument in line, adding and removing arguments, and for the effect of making the arguments similar to local variable declarations. I can now see how the original function declarations in C came about, even though when I learned about them I thought them rather archaic. The one downside that will probably cause me to move away from this style and revert back to my normal practice is that when I set breakpoints in gdb the function name doesn’t show up when the breakpoint is reached and all I see is the closing paren/opening brace line. So much for “good ideas”.

The one thing that I did not change was the use of spaces rather than tabs. I entertained moving to tabs rather than spaces, but the rules for formatting things across multiple lines such that things would not be a mess no matter what the tab width was seemed like more mental overhead than I felt like taking on. Set space width is just simpler and guaranteed to produce a clean layout with no extra thought, so I am sticking with it.

Problem With Michael-Scott Lock-Free Queue

As I stated in the previous post, the actual queue is built based on the lock-free queue implementation described by Maged Michael and Michael Scott in this paper, and that it is probably the best known and most widely used lock-free queue algorithm. Doug Lea used it as the basis for concurrent linke queues in java.util.concurrent libraries for Java. In the paper, the authors address a major problem with lock-free programming known as the ABA problem, where between the read and update another process changes the value from A to B and back to A so that the queue is not in the same state though the pointers in first process have not changed. The well-known solution is to add a generation counter and use a double-word compare_and_swap instruction, which is what my implementation does. The Java implementation does not have the same issue because of garbage collection since the first thread would hold a reference to A it could never be reused. The shared queue uses the counter and double-word CAS instruction to prevent ABA issues since it is implemented in shared memory with no possibility of garbage collection.

While stress testing the shared memory queue, I found another problem with the use of the Michael-Scott algorithm in addition to the ABA problem. I stumbled across the problem while stress testing mostly because I use multiple queues using the same memory in close proximity starting at the same initial state for the generational counters, which dramatically increased the probability of occurrence. The problematic algorithm below is from the paper:

enqueue(Q: pointer to queue t, value: data type)
E1: node = new node()
E2: node–>value = value
E3: node–>next.ptr = NULL
E4: loop
E5: tail = Q–>Tail
E6: next = tail.ptr–>next
E7: if tail == Q–>Tail
E8: if next.ptr == NULL
E9: if CAS(&tail.ptr–>next, next, <node, next.count+1>)
E10: break
E11: endif
E12: else
E13: CAS(&Q–>Tail, tail, <next.ptr, tail.count+1>)
E14: endif
E15: endif
E16: endloop
E17: CAS(&Q–>Tail, tail, <node, tail.count+1>)

The problem begins after line E7 where an executing process is suspended before the CAS at line E9, either before or after line E8. The following diagram illustrates the steps that can occur with concurrently executing processes:

Step 1 for process 1 begins after line E7 where it gets suspended while attempting to add an item at the end of queue 1. At that point, another process can remove the items on the queue including item B which is the tail item for process 1. Then if item B is added to the end of queue 2, it looks to be in the same state as it was in the end of queue 1. If the generation counters for the two queues contain the same value, for instance because the items are bouncing between the queues, process 1 will succeed when it resumes and performs the CAS at line E9. Line E17 will fail because the tail has changed, so the item is incompletely on queue 2 instead of queue 1 and the queue 2 tail is not advanced correctly. I believe the Java implementation is not protected from this issue either based on the code I downloaded from Doug Lea’s homepage if the next field is set to null for any reason after being removed from queue.

I came up with two alternative solutions to the problem. The simpler one for my shared memory queue implementation was to intialize the generation counters to separate ranges of integer values that would not likely overlap since it is based on a 64-bit architecture where the integers will not wrap. Technically one range could end up overlapping the other, in which case the problem would occur again, but I make the ranges so far apart that it would be almost impossible. The other alternative is to use a unique value as the end of queue value in the next field instead of null, which would also be needed for Java. That could be the parent queue structure address, for instance, or some empty node value allocated for each queue, but the only requirement is that it be unique for each queue instance.

Shared Memory Interprocess Queue

In attempting to build my own implementation of an interprocess queue, I was faced with two basic design issues related to the requirements I placed on myself. The first requirement was to be able to dynamically grow the queue based on demand for memory, and the second requirement was to allow arbitrary sized messages. I wanted no preallocation or explicit limitations. So the design had to allow the shared memory to grow, and for all processes to be able to detect the increase and adjust as needed concurrently. Secondly, I needed to track arbitrary allocations and deallocations within the shared memory object.

Initially, a single page of 4096 bytes is allocated in shared memory, which for Linux resides in /dev/shm on the file system. The initial allocation looks like the diagram below:

For all processes to be able to use the queue when accessing at different virtual addresses, the memory is treated as an array of signed 64-bit integers so that the embedded data structures are all updated based on calculated offsets instead of pointers. There is a 512-byte header that contains the base of all the basic embedded data structures. Those data structures include three separate semaphores, lock-free lists for the queues, and a critbit tree for tracking freed data allocations. Both the critbit tree and the lists use the same size internal nodes, which have to be double-word aligned. All node allocations are performed from the left to guarantee alignment, whereas, the data allocations can be an arbitrary number of array slots proceed from the right when looking at the diagram. The array is always grown in multiples of 4096-byte page size and the allocation schemes for each type is maintained by advancing the allocation counters to the left and right of the newly allocated pages.

Unfortunately, I could not figure out a way to manage a lock-free concurrent expansion across multiple processes so I had to use a dedicated semaphore in the header as a mutex such that only one process at a time can expand the shared memory. That way any accessing process is able to safely expand using ftruncate() through the use of a shared semaphore. Every access of the shared memory by other processes is guarded by a range check on the index to make sure it is in bounds, otherwise, the enlarged shared memory is remapped before proceeding. In this manner, I am able to have any given process expand the shared memory queue and have all other processes detect and adjust.

As previously mentioned, I use a critbit tree to track freed data allocations based on the implementation described in this paper. I chose that particular data structure because it minimizes the number of internal nodes which minimizes the memory references. I believe that in typical usage that messages on the queue will grouped within a typical range so that the tree in most cases will remain quite small. The key used is the number of 64-bit integer slots in the array the memory allocation occupies, which means that endianness of the native CPU architecture affects the implementation. Since I have access to only Intel CPUs, the code is written for little-endian integer keys.

The actual queue is built based on the lock-free queue implementation described by Maged Michael and Michael Scott in this paper. It is probably the best known and most widely used lock-free queue algorithm. Doug Lea used it as the basis for concurrent queues in java.util.concurrent libraries for Java. The book The Art of Multiprocessor Programming by Maurice Herlihy and Nir Shavit also describes the algorithms involved and the issues surrounding them in detail. There are two explicit queues embedded with one for messages and the other for events. Also, when internal nodes are freed, they are essentially queued on a free list.

The conceptual diagram below illustrates a queue with a couple of items enqueued:

In the diagram, I use the arrows as if they are pointers, where in the actual implementation they are actually integer index references rather than pointer addresses on heap as would be done in a typical single process implementation. The data for each item on queue is shown allocated to the right hand side, and queue nodes are allocated on the left. Only node allocations end up on the actual queue. Each node has a reference to the data location and the next node on queue. Lastly, there is both a head reference and a tail reference maintained in the header portion of the shared memory segment.

Finally, there are two more semaphores that are used to track the size of the queue and allow the calling processes to block either because the queue is empty or that the queue is full. The write semaphore is initialized to the maximum size and will allow writers to add items to queue only if it is not zero. Every add to the queue decrements the count and every remove increments the count. The read semaphore works just the opposite in that it is initialized to zero and reads from queue will block until something has been added. Every add increments the semaphore count and every remove decrements the count. By using two semaphores in such a manner, I am able to create a bounded queue that allows calling processes to block or not block as needed on both the adds and removes. An unbounded queue is approximated by setting the write semaphore to the maximum value for a semaphore, which on a 64-bit Linux system allows 2147483648 items to be added to a queue.

Development Plans

As I said in my previous post, I would like to explore alternative means of communicating between programs written in different languages, and then using those components to assemble working applications.

I see three major steps to this effort:

  • develop data structures in shared memory to simplify interprocess communication in C
  • use foreign function interfaces in various languages to use those data structures
  • develop scripting to assemble programs written in various languages to assemble working applications

Because there are several shared memory data structures I have in mind, I do not intend to complete each step before proceeding to the next. Rather, I intend to iteratively go through these steps for each data structure. For example, the first data structure I am building is a new, improved version of the interprocess queue that is simpler to use and much more functional than OS based implementations currently available. It would be quite useful to at least partially complete steps two and three before moving on to the next data structure. In fact, the majority of the benefit will come from completing those steps for the shared memory queue.

The three data structures I have in mind at this point, all to go in the same library, are as follows:

  • shared memory queue that can grow dynamically and has no limits on message sizes
  • shared memory allocator that allows dynamically allocating portions of memory that can be referenced through a passed token
  • shared memory map for storing and accessing key/value pairs using arbitrary data as keys and values

I am beginning with the shared memory queue because that allows me to pass messages between programs and simulate both Actor and communicating sequential processes (CSP) models of computation. The shared memory allocator is an idea that occurred to me while considering how I would practically build applications using the queue. The queue is designed to handle messages of arbitrary length, but the occasional large message would cause all queues it was placed on to grow to an unnecessary size for just that one message. My solution is allocate space in shared memory associated with a token that can be passed via the queue and to allow the safe access of that memory through the use of the token. Lastly, the associative shared memory map is on the agenda as a generally useful data structure for building multiple types of applications. The Lua language, for example, demonstrates the utility of the associative array, or table in its vernacular, for implementing multiple software development paradigms.

Why this blog?

I was first inspired to start this blog after reading several blog posts about steps to take regarding career development. The posts suggested multiple ways to begin such as learning a new programming language, starting an open source project on github, or begin writing a blog. Since I really have not done anything regarding career development for several years, I have decided to do all three at once.

My first problem with regard to this blog is “what to write about?” I am not particularly fond of writing as a form of recreation. Neither do I want to add to the plethora of abandoned blogs of programmers proffering their hard won experience, only to quickly run out of things to say. So, I need a project to write about since I have no desire to write a software development advice column.

I also want to explore ideas that had cropped up in my day-to-day work activities as a systems developer, only to be told that “this is not an R&D project.” Understandably, the company I work for doesn’t have the time and money for me to follow all my ideas to a logical conclusion, especially if they do not work out successfully. Besides, there are obviously already off-the-shelf software products that will do most of what is needed already to build software using traditional software development techniques, and there are folks that are not fond of my adding my own contributions to the ever increasing pile of infrastructure software we are currently maintaining. So, I have unfulfilled ambitions regarding alternatives for how software should interact and be built.

This blog, and its associated projects, are about a specific problem I am choosing to explore. I want to address how programs that are written in different languages communicate, and, in simplifying those mechanisms, consider how applications might be built differently by assembling them out of communicating parts. I realize there are lots of approaches to these issues already. I am not promising ground breaking research here; just a running dialogue while I “scratch an itch”, so to speak. Hopefully, in addition to the blog, you will see working code in my github repositories that will be useful in some context.