Of course, one of the things that you do when chasing down these things is add more and more data to your logs in the hope of getting a critical piece of data that will explain everything. And over the weekend, the fellow who has been digging into this most recently said, "Hey, look at this."
So I did. And I replied, "I don't know where you're getting that data from, but the error message can't possibly be generated by the code that you are executing."
And after letting it roll around the back of my brain for a few hours, I realized that I had figured out the problem. There was a bug in the code that left old error messages attached to threads in the thread pool under certain circumstances; when one of those threads holding an old error message was dispatched, things started going wrong.
So, you see, once I had the critical bit of data, I was able to make the correct inference to fix the problem.
Of course, it turned out that there was a bit more to the problem than that, but it all related to similar behavior that was easily fixed once the data allowed me to come to the right conclusion.
It's all sort of fascinating. With incomplete data, you flail around and arrive at all sorts of incorrect conclusions, wasting time in the most amazing ways. When you finally have sufficient data -- that is to say, enough data to allow you to figure out what has happened -- you can get to the correct conclusion.
There is a lesson there somewhere.
When I have enough data, maybe I'll figure out exactly what the lesson is...