[ Introduction : Exception Handling : The FlakyIO Framework : FlakyPOSIX : FlakyPalm : FlakyNet : FlakyDisk ]

FlakyIO Framework

Michael W. Bigrigg, Copyright 2004-2007

Rather than designing a system and then implement it, we built several models of exception injection in order to understand the important pieces of the design. We design and built four initial versions: FlakyC, FlakyPalm, FlakyDisk, and FlakyNet. While there are many instances of FlakyIO including newer instances such as FlakyPHP, FlakyJava, and FlakyCORBA, the usefulness of FlakyIO is as a model for you to be able to build your own I/O testing development environment.

The protocol of communication between caller and callee must provide a mechanism for providing an exception. A good example of no exceptions is the atoi() function in C. If the string provided is not a numerical equivalent, the behavior is undetermined. It is not possible to evaluate the caller's ability to handle exceptional conditions, as the function does not express those exceptional conditions back to the caller.

The FlakyIO framework begins with a working system. This is very important so that we know that any exception that the application experiences comes from our system.

We then use a software interceptor technique to create a module, which we call the exception engine. A software interceptor is a form of filter, that puts itself between the caller and the callee. Our interceptor, the exception engine, will inject exceptions into an otherwise correctly running application.

FIGURE 1. Interception

There are four primary design issues in the creation of a SWEI system. They are interrelated, as one design choice will affect the ability to accomplish the others.

Exception Boundary

We drew boundary marks based on function calls. This allowed us the cleanest separation from caller to callee. Exceptions, aside from the machine exceptions of illegal pointer access, divide-by-zero, and overflow, are raised at a function call boundary. In object-oriented systems with operator overloading, the operations are actually function calls rather than machine instructions. The operation of A = B + C is actu-ally implemented as A = add(B,C). What looks like a machine operation, is actually implemented in software. Application features not available in hardware are implemented via a function call inserted by the compiler. This additionally provides us with feedback that our choice of function call boundary is a good one.

The I/O calls have no side effects. They do not leave the system in a corrupt state. If an error does occur, the operation can be considered atomic. I/O operations do not half-execute anything. No matter under what conditions you wish to emulate the exception occurring, you can model it as if the function call was not ever executed.

We may have to relax our no fault model approach. The fault model provides for a way to im-plement the side effects of the system. In I/O exception injection, it has not been necessary to construct side effects based on the faults that cause the exceptions.

Injection Engine Location

While we tried different locations for the engine, the location is basically on the application side or on the system side. The application side can either have the engine mechanism inserted via program transformation of the source code or a modification of the object code. When the engine is in the system, it is introduced manually into the system used by the application. The major difference between locations is in what can be known about the function call. In the system we can only distinguish between functions, but within applications we can distinguish between the different function call instances.

For instance, in FlakyC, which has the engine contained in the application, we were able to determine that the fread was called from three different locations in the code, but from the system side all we would be able to tell is that the fread function was called several times. In both cases we can tell the overall number of times the function was called. Only on the program side can we tell from how many places.

Its importance is based on what you are trying to measure and find out. If you are only looking at an absolute number to gauge the robustness of the application in the context of a particular system, then the system approach is suitable. If you need to know the failure condition to fix it, then source code information is necessary. As part of the source code modification, the program needs to pass along a specific instance number to be able to backtrack to the original program to tell what particular instance of the function call failed.

Exception Pattern Specification

The pattern specification is the way a user expresses when an exception should be raised. We used a file that was read upon application start to contain the pattern of exceptions to be raised. It listed all the function call along with two numbers, which expressed when to raise the exception. The first number identified when the exception should start. The second was a flag to identify either a transient (one time only) exception or a permanent (continued) exception. For instance given the following two lines:

fopen 3 0
fclose 5 1

The fopen call will raise a one time only exception the third time fopen is called. The fclose will start generating exceptions on the fifth time it is called and continue for the duration of the application.

We had envisioned a much more complex pattern specification, but this simple model was already helping us uncover a large number of problems. Our application tests did not run more than a minute and had consistent execution paths. A user interactive application would probably need a more complex pattern generation specification. The exceptions are raised based on function call name, not based on argument information. In a larger application it would be helpful to distinguish between data streams.

This file is typically manually generated. It was meant for programmers testing their own code. If this were a benchmark system, the exception specification would be automatically based on a full path analysis of the program to create full coverage testing.

We developed a graphical interface to the behavior control file. An example screen shot of this interface is shown in Figure 2.

FIGURE 2. Behavior Control Interface Application

Its purpose is not just to provide a more pleasing interface, but to allow for the grouping of exceptions. For instance, in the C standard I/O library there are many function calls for reading such as fgets and fread to name two. The interface application allows a developer to not only choose individual functions to raise an exception, but also to raise execeptions based on a family of routines, such as input. The advantage to this is that a developer can begin to understand the basic assumptions about the program, such as the belief that printing is assumed to work.

Execution Correctness

This is the manual part of the process. While there are automated mechanisms to support the other phases, this one is the most difficult. A catastrophic error is easy to identify if the application/machine crashes. I/O failures do not typically cause a catastrophic failure. Incorrect excep-tion handling behavior will consist of ignoring the exception when it should not be ignored, or trying and never giving up and returning control back to the application.

An exception identifies if the result of the operation is successful. It cannot be assumed that the result is valid unless the exception is acknowledged and determined to not identify an error. An exception must be examined.

Correct handling of an error may be to ignore the results of the operation. For instance, the failure of background saving the document as I type this paper can be ignored for some number of attempts.

There may be other alternative solutions. The system can write to a local disk rather than a remote one, should writing to the remote disk raise an exception. If the problem was a transient error, retrying the operation may result in success. Care must be taken when using alternative solutions. The user may have no recourse but to kill the application if the system continues never ending attempts at alternative solutions.

We have classified the response to exceptions as HC, HI, S, S2, L and C.

HC The application handles the exception correctly. This is primarily classified as behavior that will potentially try alternative mechanisms, but will ultimately return the error to the applica-tion, which in turn will report the failure back to the user.

HI When an exception is handled incorrectly, the most common behavior is reporting back to the user an incorrect assessment of the situation. This is less severe as it does acknowledge an error, but does not provide an overall correct solution.

S Silent failures are one of the worst forms of mishandling an exception. No acknowledge-ment of the error is given and processing continues as if the operation was successful.

S2 Many applications will use an error reporting mechanism, but do not check to make sure the error reporting mechanism does not raise an exceptional condition. We have partitioned this type of error into a sister category to the silent failures. An effective solution to an exception being raised during error reporting would be to exit the application with an appropriate return value sent to the operating system.

L Some applications will continue to try alternative solutions effectively causing the system to sit in an infinite loop.

C This is a catastrophic response. In I/O, a buffer has been allocated and it would only be the processing based on the garbage data in the buffer that would potentially lead to a crash. In a system that had C++ or Java runtime exceptions which, when propagated unhandled up the call stack would terminate the application.

Acknowledgements

The user interface software was developed at the University of Pittsburgh by Alexander Poulis.