On perfection

I'm now on my second of what will probably be three roles (rotations) during my first two years here at Intercontinental. Previously I had been doing performance analysis for a piece of messaging middleware. The idea of middleware is to recognise that, in business, a lot of people spend a lot of time building processes and applications to do very similar things-- that is, reinventing the wheel. Middleware is a single, extremely powerful, well-built wheel which we can go up to people and say "Hey, don't build your own wheel! There are all kinds of pitfalls in that kind of thing, and besides, it's got nothing to do with your core business of selling theatre tickets or soil or whatever, and you don't want to waste time on it. Just buy our excellent wheels instead!" In this case the wheel was "reliably sending a message from point A to point B without losing it". The idea would be that you write your application to transform and route messages and deal with them appropriately, but you use our thing to actually send them from place to place.

The new thing I'm working on, Bus, is another piece of middleware which sits on top of that first piece of middleware and actually functions as that application. It does message routing and transformation and security and error handling. It has a nifty GUI and you just slot together a "message flow" using "nodes" and their various input and output "terminals". Where Q is known in the industry as a "messaging system", Bus is an "enterprise service bus". I used to do performance testing on Q; on Bus it is just plain system testing.

I ran performance tests on Linux and Windows. Now we have about sixty distinct combinations of hardware, operating system and installed software. There used to be eight dedicated machines, now there are sixty. I had thirty tests per platform, now there are thirty thousand. I knew the entire test infrastructure (in fact I rewrote almost all of it). Now, if a test goes awry, I invariably have to do a lot of investigation to figure out what the test is supposed to be doing and why exactly it isn't doing it. I used to have a primitive script to display all of my latest test results. Now I have to use an insanely complicated front-end to a monumental database of every test result ever, which has been literally hundreds of millions since development on the product began.

I used to work with an extremely buggy test interface which had been built piecemeal over the course of years by a succession of grads and industrial trainees, each one inexpertly bolting on more functionality and gradually making the thing that bit less usable.

That last bit hasn't changed.

I have a story which I tell at job interviews. I have done tedious admin work. I have done data entry. In these jobs, I have invariably been forced to sit down and learn how to use the specific bespoke piece of software which some other member of the organisation put together in order to allow me (or whichever temp sits in the chair) to do this job. Invariably, the piece of software is of poor quality, with a poor interface, functionality which only barely maps to the work that I actually need to perform, and for whatever reason it cannot be fixed or upgraded. I hate being in that position, at that end of the flow of software-to-user. Not just because I can't do my job properly (or, if I can, because I could do it ten times faster if the software was fixed), but because given the slightest opportunity I could jump into that software development role and fix those maddening behaviours and add that conspicuously absent functionality. This, I say, is why I want to go into software development: because I want to be the person who fixes those problems, to help the people at the other end of the software-to-user flow. This is why I enjoyed my previous web application development role so much.

I hate being forced to use sub-standard software, and at this organisation, of all organisations, I thought I could finally escape from that cycle. No. Our database is so gargantuan that it takes four and a half minutes to perform very basic queries about which tests have been failing lately, and I dread to think how much of that time is spent dragging out charts and bars which are simply bells and whistles of no practical use to me, the tester. No, I cannot fix it. It has been like this for years and bugs raised are ignored in perpetuity. I just have to work with it. (Come May, I should be moving to my third and final rotation, which will hopefully be a development role, so I'll just hang in there until then. I'm itching to actually create something instead of critiquing what others have done.)

There's that, and there are the results themselves. It was a realisation to which I should have come much sooner. We are not NASA. Our software is not perfect. It has defects. Not in frightening quantities, but 99.9% pass rate is still thirty failures. Not frightening defects-- no defects which would even be found "in the wild"-- but defects nonetheless. We try to fix all of them-- this is my job, to make sure that every single one is seen, diagnosed, triaged, raised as a piece of work for someone to perform, and sent to the development team. But when GA hits the count is never zero. We do not ship 100% perfect software because 99.9% perfect is good enough and fixing that final 0.1% would require disproportionate effort which wouldn't make any financial sense. NASA's software development model isn't a business model, it's a keeping-people-alive model. We push onwards with new stuff because that, more than anything, is what the customers actually want and it is what actually makes us money.

I am a mathematician. I aspire to perfection. But this is not some magical fairy tale wonderland, built by actinic, monstrous hyper-brains. This is not towering geniuses building the most incredible advanced supercomputers and infallible, perfect software. The whole thing is run by human beings.

Back to Blog
Back to Things Of Interest

Discussion (1)

2010-02-01 23:31:45 by pozorvlak:

Sounds like an even crazier version of the test rig we had at my last employers. Awe-inspiring in theory, a pain in the arse in practice, with very little ability to quickly run the tests you care about <em>right now</em>. There were plenty of obvious improvements that could have been made, but only one guy understood the code and he couldn't spare the time to either improve the test rig or to explain the architecture to anyone else.

If you ever need to set up a test rig again, Perl's test framework (implemented in the Test::* and TAP::* hierarchies, but interoperable with anything that can spit out a very simple textual format) is excellent. See http://testanything.org. Or is this what you used at your last position?

The other thing to note is that mathematics, too, is done by human beings, with all the concomitant downsides. New maths can be very messy indeed - although actual <em>errors</em> are rare, it takes a long time and a lot of effort before the clean, spare elegance that you encounter as an undergraduate emerges from the fog. Sometimes you get glimpses, though.