The problem is that our old balancing routine (which essentially does looping on the generated sort to drive two values to a match with each other) does not play well with multithreading or the new version of the topological sort. I'd been planning to try to rewrite the balancing routine for a while now -- it appears that the time has arrived.
The good news is that it should be faster when I'm done.
The bad news is that getting it done is not faster...