Emissions Test-Driven Development

Emissions Test-Driven Development is a software development methodology adapted specifically to the programming of internal combustion engine control units (ECUs).

ETDD is a subcategory of a more general form of software development known as "Adversarial TDD", which is a kind of Test-Driven Development.

In Test-Driven Development, the software development cycle begins with the creation of a test case, which naturally fails at first. Once this is done, a minimal amount of code is produced which causes the test case to pass. The process continues, tests first, code second, with the tests forming an implicit specification for the behaviour of the software. If the tests pass, then the software is considered correct. Assuming the tests fully encompass all the desired behaviour of the software, then the software is complete.

(Opinions differ on whether this a universally sound approach for creating good software. But opinions also differ on whether good software is theoretically achievable by humans, whether there is such a thing as good software at all, and even on whether an empirical definition of "good software" can ever be reached. Be that as it may, unquestionably TDD is a real thing which real software developers practice.)

It's not uncommon in TDD (and elsewhere) for the tests and the software to be written by different people. In Adversarial TDD, not only are the people developing the software (henceforth "the developer") and the people developing the tests for that software (henceforth "the tester") different people, not only are they physically separated and not in communication, they work for entirely different organisations and have opposing goals.

In the particular case of Emissions Test-Driven Development, the goal of the tester is to ensure that the internal combustion engine under test is clean, meeting certain emissions standards in regular use. Meanwhile, the goal of the developer of the ECU is to make a fast, powerful, and perhaps fuel-efficient engine which happens to also pass the tests. To be clear: the developer cares not one jot about emissions. Just emissions tests.

*

It's only fair to note at this point that neither the developer or the tester necessarily care about anything by default, and their priorities are largely received from higher powers. Still, these are the motivations at work here.

Experimenting with Adversarial TDD

We tried this at work once, as an exercise. We were divided into pairs and tasked with implementing Conway's Game Of Life.

One of us was the tester, the other was the developer. The tester was to write exactly one unit test case. Then, with no communication and no cooperation, deliberately trying to blot out all knowledge of Conway's Game Of Life, the developer was to write the bare minimum amount of code required to make the unit test case pass. After this, the keyboard was passed back to the tester and the cycle continued.

We immediately observed a pattern. The first test case could only ever expect a single result. It would look something like:

assert is_alive([
    [0, 0, 0],
    [0, 0, 0],
    [0, 0, 0],
], 1, 1) == False

(Java, obviously.) Given this, the first implementation of is_alive would invariably read as follows:

def is_alive(a, b, c):
    return False

Which is to say, in general, returning the same result every time.

After adding another test case:

assert is_alive([
    [0, 0, 0],
    [0, 0, 0],
    [0, 0, 0],
], 1, 1) == False

assert is_alive([
    [1, 1, 1],
    [0, 0, 0],
    [0, 0, 0],
], 1, 1) == True

The developer would reluctantly produce a minimum-effort implementation such as:

def is_alive(a, b, c):
    return a[0][0] > 0.8

As time went on, the amount of effort the developer needed to expend in order to continue the bluff increased. Taking it as a challenge, some developers were able to continue the bluff for quite some time. Still, eventually the sheer weight of tests made it so that continuing the bluff was impractical, and it was impossible to pretend not to understand what the tests were really testing for. At this point, the developers gave up and implemented Conway's Game Of Life, as the path of least resistance.

Of course, by this time the lesson of the exercise had been learned: development and testing are cooperative roles. Even if the two roles are in separate people, they need to have a common goal. There can't be barriers of communication between them, they must work together.

Heck, even if there are real, good reasons for the communications barrier — say, the developer is building a clean-room reimplementation of a piece of software which, for legal reasons, they cannot directly inspect, only its test suite — there still needs to be a good faith effort, on the part of the developer, to build the thing which the tests clearly "want".

If there isn't good faith, it can get very difficult.

Bifurcation threshold

We ended that exercise there, at the unit test case level, while it was still funny. But there's no reason why we couldn't have continued. If the developer is motivated by factors other than mere difficulty of implementation, then the charade can continue indefinitely.

Quite quickly the developer would arrive at a situation where the simplest course of action is to produce two implementations. One would be the genuine software desired by the tester and their tests. The other would be the software which the developer really wanted to build. Crucially, the switch in behaviour would be controlled by the environment. If there's a test running, the software would behave like this. If not, it would behave like that.

Broken promises

For a concrete example, consider the extremely stringent, technical and precisely-stated Promises/A+ specification, and its associated Compliance Test Suite. The Compliance Test Suite is rigorous and tests every single part of the Promises/A+ specification; it is entirely safe to assume that a piece of software which passes this specification is a bulletproof, totally correct implementation of the spec, and deserves the right to use the logo.

And then consider broken-promises-aplus, a totally compliant implementation which only behaves like a compliant Promises/A+ library when it detects that the Compliance Test Suite is running. When used in practice, it never does any work (there is no documented public API for this), and whenever you call then it throws what at first glance appears to be an exception.

Naturally, it could do something more puzzling — such as only working ninety-nine times out of a hundred — or something more nefarious — such as locking itself into a busy loop, consuming CPU power, using up electricity and increasing global CO2 emissions.

*

How does it work? The Compliance Test Suite gives the implementer carte blanche when implementing an "adapter" between the suite and the actual library. So in this case, the adapter sets a special "THIS IS A TEST" flag and passes this forward to the library.

If the adapter's structure were properly locked down, more advanced approaches would still be available. It would be possible for broken-promises-aplus to inspect the command line for strings like "npm test", or to test whether the promises-aplus-tests library is currently loaded.

Assuming no other information is available, the library still has access to the arguments of each API call, the sequence in which each call is made and the time interval between each call... in other words, the usage profile. It could compare this profile with the known behaviour of the Compliance Test Suite to determine whether it was likely that a test was in progress.

*

Turning back to ETDD, take a look at this diagram. The horizontal axis is the amount of time since the Volkswagen engine was turned on. The vertical axis is the distance driven. The coloured lines mark pre-programmed settings inside the engine control unit; if the usage profile crosses any of these coloured lines, it triggers a change in behaviour.

You can see that the lines clearly mark out three relatively narrow, straight channels. The profile of an emissions test always passes down one of these three channels. When this happens, the ECU conforms to low-emissions standards. When the profile steps outside one of the channels, "regular" behaviour appears. It's that easy to detect an emissions test in progress!

Fighting back

The crucial point is this: it needs to be impossible to programmatically distinguish a testing scenario from regular use.

Specifically, it needs to be impossible to programmatically distinguish an emissions test from regular driving.

This may be a much taller order than it first appears for two major reasons.

  1. Time and distance cannot be the only two pieces of information available to the ECU. The ECU receives data from all over the engine and there's no particular reason (is there?) why it can't receive data from all over the car.

    This is your attack surface.

    I'll admit I'm not a car person, so I wouldn't even like to speculate where this starts or stops. There are numerous mechanical techniques to improve fuel economy, such as taping over joints in the bodywork for improved aerodynamics, and overinflating tyres. Can the ECU detect tyre pressure and temperature? Does it know the position of the pedals?

    Can it guess how many real humans are in the car, based on whether the seatbelts are buckled and whether pressure sensors in the seats are fluctuating as people move? Does it know whether the stereo is on?

    What about the steering wheel? Can the ECU use steering and speed data to work out the shape of the "road"? Can it compare that shape with real maps to determine whether it's driving on a real road or not?

  2. Having a test which is undetectable means that it must also be unpredictable, i.e. randomised. This is somewhat antithetical to having a test which is standard across all engines, whose specification can be made public, and which gives fairly comparable results.

    I don't think this part is unsolvable, but it's certainly a problem.

Conclusions

Honestly? I blame the testing regime here, for trusting the engine manufacturers too much. It was foolish to ever think that the manufacturers were on anybody's side but their own.

It sucks to be writing tests for people who aren't on your side, but in this case there's nothing which can change that.

Lesson learned. Now it's time to harden those tests up.

Discussion (15)

2016-05-02 21:25:46 by qntm:

Obviously this is a software person's perspective, not a car person's. Corrections and omissions welcome. But not emissions.

2016-05-02 22:00:31 by hexapodium:

With reference to Problem 1: the NEDC test (and most emissions tests in general) are done on a dyno, for consistency reasons, so a dead giveaway is that the driven wheels are rolling but the non-driven ones are still and the steering doesn't move at all. With a car that has ABS and/or traction control (i.e. all new cars these days, especially ABS since it's mandatory on new cars in the EU since 2004) both of those parameters are definitely exposed to the ECU and can be used to (fairly) reliably detect "I am under test!" and drop into cheatmode. Most ECUs also handle lots of other car functions (it's cheaper that way) and will be able to tell if things like the headlights, airconditioner, and radio are on - or even just that the alternator isn't working very hard - which is also a telltale that This Isn't Real Life. There are other reasons you might be on a dyno, but most of them are related either to regulatory compliance tests, where being in cheatmode is no bad thing, or the car is being tuned for performance and the ECU is being remapped (and emissions controls are out the window - as soon as you remap, you're into uncharted territory), reflashed (same, but you'd just flash an image which has no cheatmode at all), or replaced wholesale. In any case, modded cars are definitely Not The Manufacturer's Problem if they happen to chug out pollution. The solution is obviously to conduct (some) emissions testing by shoving a flow meter in the intake manifold and sampling the exhaust while actually driving around for extended periods of time, which obviously runs into your Problem 2; but on the other hand, if you say "ok you can go on a dyno for your Official Figure but also must test within two standard deviations of it, 90% of the time in real-world urban driving", that's both quite fair to manufacturers but also far, far harder to game. The other flaw with the NEDC tests is that the manufacturers themselves ran them, then pinky-swore they were - hence seam taping, tyre overinflation, and other fringe measures to get the efficiency figures up. Just pulling a random new car once in a while, sticking it in an official lab, and running a few (randomly-generated) close-to-test-cycle tests on it, then comparing whether the official figures are *anything near* the measured ones, might go a long way to stopping cycle-beating, since the differences were so profoundly egregious. Publishing those pseudorandom cycles six months later and letting independent labs and manufacturers repeat them and challenge the results if they think they're unfair seems only sensible.

2016-05-03 00:16:22 by qntm:

Maybe demanding that ECU code be open would help?

2016-05-03 00:49:23 by KimikoMuffin:

I confess that I cackled at "Unexpected T_PAAMAYIM_NEKUDOTAYIM" in a supposed JavaScript exception. (For the uninitiated, this is a mysterious error message that comes up all the time in PHP.)

2016-05-05 09:46:07 by bdew:

Not sure what this has to do with this post, but PAAMAYIM NEKUDOTAYIM is Hebrew for double colon (literally "twice two dots"), the scope resolution operator in PHP.

2016-05-06 01:07:45 by KimikoMuffin:

Oh, that was in the text of the "exception" thrown by the Broken-Promises code.

2016-05-09 00:27:22 by sil:

This is, of course, similar to graphics drivers and their behaviour under benchmarks. Legendarily one driver saw performance drop to a third of previous when the benchmark app TUNNEL.EXE was renamed to FUNNEL.EXE. As you note, if you can't assume good faith, this is an extremely hard problem. Also consider the idea of outsourcing your development work to a company who want to do the bare minimum possible and still get paid. Creating increasingly detailed specifications will, honestly, not prevent an outsourcing company working in bad faith from screwing you if they can; there are a hundred ways to "hide from the specification", and most of the business negotiation is (or should be) dedicated to establishing whether your outsourcing company work in good faith rather than some absolute measure of competence.

2016-05-12 09:21:42 by atomicthumbs:

the easiest solution: ban computerized engine controls and bring back the carburetor and mechanical fuel injection!

2016-05-13 09:21:07 by Nerdguy:

Sadly, ECU code will probably never be open sourced. Too much money in it.

2016-05-13 10:22:30 by Daniel H:

I wonder what the board meeting was like where they decided to do that. Did anybody wonder if perhaps they should not try to cheat the test, or was the situation so adversarial that nobody had any qualms? On a different note, your obviously Java code looks suspiciously like Python to me.

2016-05-23 02:50:03 by Sean:

Thought about this a while. The second problem mentioned above seems fairly trivial* to deal with, actually. The people designing the tests can publish statistics about what they expect typical driving to be like (which they should have *anyway*, otherwise what's the relevance of the test to real driving supposed to be?). Things like time spent on city vs. highway driving, range of altitudes, resistance encountered (e.g. due to grade), etc. To be "fair" and "standard", a test just has to match those statistical properties fairly well. In software we are used to deterministic tests rather than stochastic tests (though perhaps we shouldn't be too afraid of the latter; a pseudorandom test is fine as long as enough data is recorded to reproduce/triage a surprising result later). But virtually all tests in the physical world have a stochastic element anyway. The first problem strikes me as far worse. Imagine, for instance, trying to stop a vehicle from knowing whether or not it is stationary. There are far too many potential sources of this information, and it's hard to decide what any given part of the vehicle "should" know or not. Accelerometers, cameras, GPS. I can think of a half-dozen ways of trying to measure how air moves past the vehicle (though granted, that's not quite my specialty, so maybe not all of those are feasible). When your group had to pass tests for the Game of Life, the developers eventually had to "admit" that they knew what was actually being tested for. But that cuts both ways. If the ECU has enough knowledge of its environment, the testers have to "admit" that running a car in a lab on a dyno is not an adequate test of what the car will do. At some point, I think you really *have* to be sensing emissions from a vehicle in motion, ideally on a course or set of courses that are supposed to be representative of what you're measuring (possibly even on real public streets, because why not simplify things). *This is assuming that we have a group of relevantly-trained, responsible engineers on both sides. That is, trivial in the sense that reasonably competent teams can solve it, not socially or legally trivial.

2016-05-27 12:19:18 by dsm:

"Maybe demanding that ECU code be open would help?" Well, the moral of this story appears to be that it is insanely, crazily, quite possibly impossibly difficult to rule out that an implementation is a bad-faith attempt to cheat the requirements merely by testing that implementation directly. As such, perhaps adversarial testing regimes shouldn't just be... tests. Handing over the ECU code and any relevant documentation to the testers would make it at least somewhat easier to identify when the software was doing something solely because it lowered emissions during tests and only during tests (i.e. a blatant attempt to cheat the requirements). I doubt it would be perfect, because there are probably a lot of things you can do to cheat the requirements that also happen to be justifiable, but it's certainly an improvement. It's also basically passing off the problem to be solved by human discretion, so it's not the most satisfying solution.

2016-09-05 00:00:20 by dmytry:

I wonder what was the complexity of the tests for game of life. It seems to me that you'll eventually have to get to the point where the test has to include a reference implementation. Which itself would need to be tested...

2021-06-24 20:16:21 by Joshua:

It gets ugly. The car's computer needs to know it's on a dyno so that it can disable the anti-lock breaks; and that component on the VW diesel in question runs on the same CPU as the emissions control unit. They managed to leak the information across by corrupting global variables so that the code actually would have passed inspection. The state change once found looks almost like a bug.

2021-07-29 15:45:10 by ingvar:

@ dmytry If we limit the test surface to "compute next state for a single cell", there are 512 test cases. A neighbourhood of 8 cells, each of which can be 0 or 1, giving 256 cases. And two states for the "cell under test", doubling that to 512. It is definitely within the scope of an enthusiastic tester to write each and every test case down. But, I would personally write some code to write that code for me, then manually inspect the test cases for correctness (it is pretty easy to verify the test cases manually, after all).

New comment by :

Plain text only. Line breaks become <br/>
The square root of minus one: