Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements (1998)

National Academies Press: OpenBook

Chapter: 9 Using Modeling and Simulation in Test Design and Evaluation

Visit NAP.edu/10766 to get more information about this book, to buy it in print, or to download it as a free PDF.

Suggested Citation:"9 Using Modeling and Simulation in Test Design and Evaluation." National Research Council. 1998. Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements. Washington, DC: The National Academies Press. doi: 10.17226/6037.

9
Using Modeling and Simulation in Test Design and Evaluation

As part of system development, many industries, including the automobile industry, make substantial use of modeling and simulation to help understand system performance. Modeling and simulation (sometimes referred to here as simulation) are currently used for a number of applications in the Department of Defense, notably for training users of new systems and to help support arguments presented in the analysis of alternatives (formerly cost and operational effectiveness analyses) to justify going ahead with system development. Its success in similar applications in industry, and its cost, safety, and environmental advantages over operational testing, have raised interest in the use of modeling and simulation in operational testing and evaluation (where it enjoys the same advantages). The use of simulation to augment operational test design and evaluation (and other related purposes such as to provide insight into data collection) has been strongly advocated by DoD officials. Although a number of workshops have been devoted to the use of simulation for assisting in the operational test and evaluation of defense systems, the precise extent to which simulation can be of assistance for various purposes, such as aiding in operational test design, or supplementing information for operational test evaluation, remains unclear. Great care is needed to ensure that the information provided by modeling and simulation is useful since the uncritical use of modeling and simulation could result in the advancement of ineffective or unreliable systems to full-rate production, or, conversely, the delay or return to development of good systems.

USES OF SIMULATION

As the uses of simulation advance from training to test design to test evaluation, the demands on the validation of the simulation increase. Unfortunately, it is difficult to comprehensively determine whether a validation is ''sufficient," since (as discussed in Chapter 5) there are a large variety of defense systems, types of simulations, and purposes and levels of system aggregation for which simulations might be used.

Types of Simulation

The defense community recognizes three types of simulations: live, virtual, and constructive. A live simulation is simply an operational test, with sensors used to identify which systems have been damaged by simulated firings, using real forces and real equipment. It is the closest exercise to real use. A virtual simulation ("hardware-in-the-loop") might test a complete system prototype with stimuli either produced by computer or otherwise artificially generated. This sort of exercise is typical of a developmental test. A constructive simulation is a computer-only representation of a system or systems.

Thus, a simulation can range from operational testing itself to an entirely computer-generated representation (i.e., no system components involved) of how a system will react to various inputs. It can be used for various purposes—including test design and developmental and operational test evaluation—and at various levels of system aggregation, ranging from modeling a system's individual components (e.g., system software or a radar component, often ignoring the interactions of these components with the remainder of the system), to modeling an entire prototype, to modeling multiple system interactions.

The panel examined a very small number of simulations proposed for use in developmental or operational testing, and the associated presentations and documentation about the simulation and related validation activities were necessarily brief. They included: RAPTOR, a constructive model that estimates the reliability of systems based on their reliability block diagram representations; a constructive simulation used to estimate the effectiveness of the sensor-fuzed weapon; and a hardware-in-the-loop simulation used to assess the effectiveness of Javelin, an anti-tank missile system.

The panel did not perform an in-depth analysis of the validation of any of these models or their success in augmenting operational experience. However, preliminary impressions were that RAPTOR would be useful for assessing the reliability of a system only if the reliability of each component had been previously well estimated on the basis of operational experience and only if the assumed independence of the system's components was reasonable; the simulation for the Javelin was able to successfully measure system effectiveness for some specific scenarios; and the simulation used to determine which subsystems in a

tank would be damaged by the sensor-fuzed weapon was not likely to be informative. This is not to say that the simulation for the sensor-fuzed weapon was not reasonably predictive, given current technology; however, the physics needed to predict the path of (and damage caused by) a fast-moving piece of metal impacting various points on a tank is much more complicated than for the other two simulation applications.

Since there are both effective and ineffective applications of modeling and simulation to (developmental and) operational testing, the key issues are how to identify the models and simulations that can be safely used to augment operational test experience; how to conduct model validation; and how the augmentation should be carried out. Given the breadth of possible applications, only a few of these issues can be addressed here.

A key theme of this chapter is that simulation is not a substitute for operational testing. The chapter defines a comprehensive model validation of a constructive simulation for use in operational testing and evaluation, discusses the use of constructive simulation to assist in operational test design and then to assist in operational test evaluation, and lastly discusses needed improvements to the current uses of modeling and simulation.

Limits of Simulation

The relevant law (paragraphs 2399(a) and 2399(h)(1), Title 10, U.S. Code) states:

A major defense acquisition program may not proceed beyond low-rate initial production until initial operational test and evaluation of the program is completed; and the term 'operational test and evaluation' does not include an operational assessment based exclusively on computer modeling, simulation, or an analysis of information contained in program documents.

The panel strongly endorses this policy.

Models and simulations are typically constructed before substantial operational experience is available, either before or during developmental test. They are therefore based on information collected from similar systems, from components used in previous systems, or developmental testing of the current system. As a result (and almost by definition), a simulation often is unable to identify "unanticipated" failure modes—ones that are unique to operational experience. Consequently, it is almost axiomatic that for many systems a simulation can never fully replace an operational test. The challenge is not how to perform a perfectly realistic test, since that is nearly always impossible, but, instead, how to perform a test in which the lack of realism is not a detriment to the evaluation of the system.

There are undoubtedly some characteristics of systems for which modeling is clearly feasible: for example, estimating how quickly a cargo plane can be

unloaded or estimating the availability of systems based on failure and repair rates that can be measured in operational conditions. These are situations for which the actions of the typical user are argued, a priori, not to critically affect system performance.

Developmental tests and tests of related systems can obviously provide information about a new system's operational performance. Developmental testing can provide substantial information about failure modes of individual components, including the full prototype. Since many components may have been previously used, possibly in a modified form, in other systems, test results and field data might be available and analyzed in conjunction with the current developmental tests. However, in developmental or laboratory testing, the actions of a typical user are not taken into consideration, and therefore the prototype is not tested as a system-user whole. Given the current lack of operational realism in much of developmental testing, some system deficiencies will not exhibit themselves until the most realistic form of full-system testing is performed. Therefore, failure to conduct some operational testing can result in erroneous effectiveness assessments or in missed failure modes and (possibly) optimistic estimates of operational system effectiveness or suitability. As an example, Table 9-1 shows the mean time between operational mission failures for the command launch unit of the Javelin for several types of testing modes, ranging from pure laboratory or developmental testing (RQTI) to initial operational testing. From the reliability qualification test II (RQTII) to the initial operational test (i.e., as troop handling becomes more typical of field experience), the mean time between operational mission failure decreases from 482 hours to 32 hours. Fries (1994b) discusses a number of similar situations where important system limitations were not observed until field testing. 1

Of course, developmental tests can be structured to incorporate various aspects of operational use; elsewhere in the report we recommend that this be done more frequently, and this will support simulations that are based on more relevant experience. However, given that developmental tests and experiences on related systems are used to develop simulations, simulations are by nature more limited in the information that they can provide about the operational experience of new systems.

Some operational testing is very difficult. For example, testing multisystem engagements can be extremely costly and dangerous (or impossible, if not enough enemy systems are available). Cost arguments alone for simulations should not be made without some analysis of the trade-offs. A small number of additional operational test runs may well provide more information, in many situations, than a large number of model runs; the information would be worth the cost. When the interaction of individual systems is well understood, simulations could be

As mentioned previously, even field testing makes several compromises to full realism.

TABLE 9-1 Reliability Assessment of Command Launch Unit of Javelin in Several Testing Situations

Mean Time Between Operational Mission Failures

RQT I. Reliability Qualification Test I

RQT II, Reliability Qualification Test II

RDGT, Reliability Development Growth Test

PPQT, Preproduction Qualification Test

DBT, Dirty Battlefield Test

FDTE, Force Development Test and Experimentation

IOT, Initial Operational Test

useful in extrapolating to interactions of larger groups of systems. The use of a small number of operational exercises to help calibrate a simulation is also of potential value and is worth further investigation. The Army Operational Test and Evaluation Command is planning on using this approach for the operational evaluation of the ATACMS/BAT system (see Appendix B).

MODEL VALIDATION

In order to consider using a simulation to augment test design or evaluation, one must understand the extent to which a simulation approximates the real system and in which respects it is more limited, in relation to the intended application. To develop this understanding, model validation is a key. The literature on model validation describes several activities, which can be separated into two broad types: external validation and sensitivity analysis. External validation is the comparison of model output with (operationally relevant) observations from the system being modeled. It is helpful if external validation is accompanied by an uncertainty analysis of a model, which is an analysis of the variability of the model output that results from typical variability in model inputs. An uncertainty analysis provides a yardstick for measuring which differences between model

output and observations of the system are capable of being explained by model uncertainty, and, therefore, which are due to model inadequacy.

In stating that external validation is a comparison of model output with field observations, there is an (at least) implicit requirement for a performance criterion that indicates whether the simulation model conforms (adequately) to field experience. Agreement on this performance criterion is a crucial step in an external validation. Consider, for example, the sensor-fuzed weapon. A simulation was used to predict the damage resulting from use of the sensor-fuzed weapon against a tank. The criterion used by the Air Force Operational Test and Evaluation Center to validate the model was whether those subsystems the weapon actually damaged were predicted by the simulation to have a high probability of being damaged. The problem with this criterion is that there is no parallel check that subsystems that were not damaged were predicted by the simulation to have a low probability of damage. In other words, a simulation that simply predicted that all subsystems would be damaged with high probability would have scored well under this criterion. The stringent test of the simulation model that was not done would be an assessment of its ability to discriminate between subsystems that would and would not be damaged, with high probability. In this case, a laudable attempt to use field data to validate a simulation failed to achieve its intended result due to a poor understanding of appropriate validation criteria. 2

Sensitivity analysis is an analysis of the response surface of a model, especially the sign and rough magnitude of the relationship between the inputs to and the outputs from a model. The simplest version of a sensitivity analysis is the analysis of the model output from the incremental changes in single inputs from a central scenario, referred to as one-variable-at-a-time sensitivity analysis. A sensitivity analysis can be indicative of model failings if the direction and magnitude of sensitivity of model outputs to various inputs does not agree with subject matter expertise.

Since external validation of a model that predicts operational performance involves an operational test (or a developmental test with operational features), it can be expensive, and it is not surprising that, in the few examples we reviewed, the number of replications of external validation was fairly small. 3 Also, for the few examples reviewed, sensitivity analysis tended to be one-variable-at-a-time for a limited number of inputs. It is not unusual for constructive simulation models in defense testing applications to have hundreds or thousands of variable inputs and dozens of outputs of interest. While one-variable-at-a-time sensitivity analysis has the benefits of ease of interpretation and ease of estimation of partial

This example also argues for greater statistical expertise in the test community, since this is a mistake that is well described in the statistical literature.

We note, however, that there were certainly more than in other applications of modeling in public policy applications (see, e.g., Citro and Hanushek, 1991).

derivatives at the central scenario, using it will often result in missing important interactions between inputs and missing strong curvatures in the relationship between inputs and outputs. Such analysis is also limited to providing information on the behavior of the simulation in the neighborhood of the central scenario. It is also a very inefficient method for understanding the simulation model's response surface in the same way that experimental designs that vary only one factor between test runs are inefficient. Especially for constructive simulation models that have relatively fast turnarounds, there is no reason to perform such a limited analysis.

Face Validation

External validation and sensitivity analysis are often accompanied by the following activities: (1) model verification, checking to see that the computer code accurately reflects the modeler's specifications; (2) evaluation of model output either for particular inputs or for extreme inputs—where the resulting output can be evaluated using subject-matter expertise; and (3) comparisons of the model output with the output of other models or with simple calculations. Model verification, of course, is an extremely important activity. The quasivalidation activities (2) and (3) can lead to increased face validity—which we define here as the model's output agreeing with prior subject-matter understanding—which is often worthwhile.

However, it is important to stress that face validity is insufficient. If validation is limited to these latter activities, one can be misled by agreement with preconceived notions or with models that are based on a set of commonly held and unverified assumptions. In addition: there will be little or no support for important feedback loops to help indicate areas in the model in need of improvement, there will be little indication of the quality of the model for predicting various outputs of interest, and it will be impossible to construct hypothesis tests that indicate whether discrepancies with the results of other models or with field results are due to natural variation or to real differences in the model(s). Therefore, face validity must be augmented with more rigorous forms of model validation.

Improved Methods for Model Validation

In the last 15-20 years a number of statistical advances have been made that are relevant to the practice of model validation; we have not seen evidence of their use in the validation of constructive simulation models used for defense testing. We note five of the advances that should be considered:

Morgan and Henrion (1990), McKay(1995), and others have identified techniques for carrying out an uncertainty analysis 4 in order to assist external validation.
One-variable-at-a-time sensitivity analysis assumes a linear response surface, is inefficient in its use of model runs, and is too narrowly focused on a central scenario. These deficiencies can be addressed by using inputs produced by Latin Hypercube sampling (McKay et al., 1979: Iman and Conover, 1980; Owen, 1992) or related fractional factorial techniques. To help analyze the resulting paired input-output data set, McKay's (1995) ANOVA decomposition of outputs from models run on inputs selected by Latin Hypercube sampling is useful for identifying which outputs are sensitive to which inputs. The response surface of these input-output vector pairs can be summarized using various non-parametric regression techniques, such as multivariate adaptive regression splines (Friedman, 1991). For a single output, there are techniques that can identify a small collection of important inputs from a larger collection of candidate inputs. These techniques can be very helpful in simplifying an uncertainty analysis, a sensitivity analysis, and even an external validation, since it indicates which variables are the crucial ones to vary (see Cook, 1994; Morris, 1991; McKay, 1997; and others).
One formal approach to the design and analysis of computer experiments has been developed by Sacks et al. (1989), Currin et al. (1991), Morris, Mitchell, and Ylvisaker (1993), and others.
Identifying outliers to patterns shown by the great majority of inputs to a constructive simulation can be used to better understand regions of the input output space in which the behavior of the simulation changes qualitatively (see, e.g., Huber, 1981; Hampel et al., 1986; Atkinson, 1994).
The developers of a constructive simulation often make fairly arbitrary choices about model form. Citro and Hanushek (1991), Chatfield (1995), Draper (1995), and others offer ways of addressing this problem of misspecification of the relationship between inputs and outputs in computer and statistical models, wherein the simulation is less precise than one measures by using the usual "analysis of variance" techniques.

Some of the above ideas may not turn out to be directly applicable to defense models, but the broad collection of techniques being developed to analyze non-military simulations are likely to be relevant. Given the importance of operational testing, testing personnel should be familiar with this literature to determine its value in the validation of constructive simulations.

As noted above, a sensitivity analysis is the study of the impact of changes on model outputs from changes in model inputs and assumptions. An uncertainty analysis is the attempt to measure the total variation in model outputs due to quantified uncertainty in model inputs and assumptions and the assessment of which inputs contribute more or less to total uncertainty.

In addition to model validation, a careful analysis of the assumptions used in developing constructive simulation models is a necessary condition for determining the value of the simulation. Beyond the usual documentation, which for complicated models can be fairly extensive, an "executive summary" of key assumptions used in the simulation model should be provided to experts to help them determine their reasonableness (and therefore the utility) of the simulation. A full history of model development, especially any modification of model parameters and their justification, should also be made available to those with the responsibility for accrediting a model for use in operational testing.

"Model-test-model" is the use of (constructive) simulation models in conjunction with operational test. In model-test-model, a model is developed, a number of operational test runs are carried out, and the model is modified by adjusting parameters so that it is more in agreement with the operational test results. Such external validation on the basis of operational use is extremely important in informing simulation models used to augment operational testing. However, there is an important difference (one we suspect is not always well understood by the test community) between comparing simulation outputs with test results and using test results to adjust a simulation. Many complex simulations involve a large number of "free" parameters—those that can be set to different values by the analyst running the simulation. In model-test-model some of these parameters can be adjusted to improve the correspondence of simulation outputs with the particular operational test results with which they are being compared. When the number of free parameters is large in relation to the amount of available operational test data, close correspondence between a "tuned'' simulation and operational results does not necessarily imply that the simulation would be a good predictor in any scenarios differing from those used to tune it. A large literature is devoted to this problem, known as overfitting. 5

An alternative that would have real advantage would be "model-test-model-test," in which the final test step, using scenarios outside of the "fitted" ones, would provide validation of the version of the model produced after tuning and would therefore be a guard against overfitting. If there was interest in the model being finalized before any operational testing was performed, this would be an additional reason for developmental testing to incorporate various operationally realistic aspects.

Overfitting is said to occur for a model and data set combination when a simple version of the model (selected from a model hierarchy, formed by setting some parameters to fixed values) is superior in predictive performance to a more complicated version of the model formed by estimating these parameters from the data set. For some types of statistical models, there are commonly accepted measures of the degree of overfitting. An example is the Cp statistic for multiple regression models: a model with high Cp could be defined as being overfit.

Recommendation 9.1: Parameters from modeling and simulation should not be used to fit a simulation to a small number of field events without subsequent validation of the resulting simulation.

A Process of Comprehensive Validation

The panel reviewed several documents that describe the process used to decide whether to use a simulation model to augment an operational test. There are differences across the services, but the general approach is referred to as verification, validation, and accreditation. Verification is "the process of determining that model implementation accurately represents the developer's conceptual description and specifications" (U.S. Department of Defense, 1994a). (For constructive simulations, verification means that the computer code is a proper representation of what the software developer intended; the related software testing issues are discussed in Chapter 8.) Validation is "the process of determining (a) the manner and degree to which a model is an accurate representation of the real-world from the perspective of the intended uses of the model, and (b) the confidence that should be placed on this assessment" (U.S. Department of Defense, 1994a). Accreditation is ''the official certification that a model or simulation is acceptable for use for a specific purpose" (U.S. Department of Defense, 1994a). The panel supports the general goals of verification, validation, and accreditation and the emphasis on verification and validation and the need for formal approval, that is, accreditation, of a simulation model for use in operational testing.

Given the crucial importance of model validation in deciding the utility of a simulation for use in operational test, it is surprising that the constituent parts of a comprehensive validation are not provided in the directives concerning verification, validation, and accreditation. A statistical perspective is almost entirely absent in these directives. For example, there is no discussion of what it means to demonstrate that the output from a simulation is "close" to results from an operational test. It is not clear what guidelines model developers or testers use to decide how to validate their simulations for this purpose and how accrediters decide that a validation is sufficiently complete and that the results support use of the simulation. Model validation cannot be algorithmically described, which may be one reason for the lack of specific instruction in the directives. A test manager would greatly benefit from examples, advice on what has worked in the past, what pitfalls to avoid, and most importantly, specific requirements as to what constitutes a comprehensive validation.

This situation is similar to that described in Chapter 1, regarding the statistical training of those in charge of test planning and evaluation. Model validation has an extensive literature, in a variety of disciplines, including statistics and operations research, much of it quite technical, on how to demonstrate that a computer model is an acceptable representation of the system of interest for a

specific application. Operational test managers need to become familiar with the general techniques represented in this literature, and have access to experts as needed. 6

We suggest, then, a set of four activities that can jointly form a comprehensive process of validation: (1) justification of model form, (2) an external validation, (3) an uncertainty analysis including the contribution from model misspecification or alternative specifications, and (4) a thorough sensitivity analysis.

All important assumptions should be explicitly communicated to those in a position to evaluate their merit. This could be done in the "executive summary" described above.
A model's outputs should be compared with operational experience. The scenarios chosen for external validation of a model must be selected so that the model is tested under extreme as well as typical conditions. The need to compare the simulation with operational experience raises a serious problem for simulations used in operational test design, but it can be overcome by using operationally relevant developmental test results. Although external validation can be expensive, the number of replications should be decided based on a cost-benefit analysis (see the discussion in Chapter 5 on "how much testing is enough"). External validation is a uniquely valuable method for obtaining information about a simulation model's validity for use in operational testing, and is vital for accreditation.
An indication of the uncertainty in model outputs as a function of uncertainty in model inputs, including uncertainty due to model form, should be produced. This activity can be extremely complicated, and what is feasible today may be somewhat crude, but DoD experience at this will improve as it is attempted for more models. In addition, exploration of alternative model forms will have benefits in providing further understanding of the advantages and limitations of the current model and in suggesting modifications of its current form.
An analysis of which inputs importantly affect which outputs, and the direction of the effect, should be carried out and evaluated by those with knowledge of the system being developed. The literature cited above suggests a number of methods for carrying out a comprehensive sensitivity analysis. It will often be necessary to carry out these steps on the basis of a reduced set of "important" inputs: whatever process is used to focus the analysis on a smaller number of inputs should be described.

Recommendation 9.2: Validation for modeling and simulation, used to assist in the design and evaluation of operational testing of defense sys

There are tutorials that are provided at conferences, and other settings, and excellent reports in the DoD community (e.g., Wiesenhahn and Dighton, 1993), but they are not sufficient since they do not reflect recent statistical advances.

tems, should include: (1) a complete description of important assumptions, (2) external validation, (3) uncertainty analysis, and (4) sensitivity analysis. A description of any methods used to reduce the number of inputs under analysis should be included in each of the steps. Models and simulations used for operational testing and evaluation must be archived and fully documented, including the objective of the use of the simulation and the results of the validation.

The purpose of a simulation is a crucial factor in validation. For some purposes, the simulation only needs to be weakly predictive, such as being able to rank scenarios by their stress on a system, rather than to predict actual performance. For other purposes, a simulation needs to be strongly predictive. Experience should help indicate, over time, which purposes require what degree and what type of predictive accuracy.

Models and simulations are often written in a general form so that they will have wide applicability for a variety of related systems. An example is a missile fly-out model, which might be used for a variety of missile systems. A model that has been used previously is often referred to as a legacy model. In an effort to reduce the costs of simulation, legacy models are sometimes used to represent new systems, based on a complete validation for a similar system. Done to avoid costly development of a de novo simulation, this use of a legacy model presents validation challenges. In particular, new systems by definition have new features. Thus, a legacy model should not be used for a new application unless: a strong argument can be made about the similarity of the applications and an external validation with the new system is conducted. A working presumption should be that the simulation will not be useful for the new application unless proven otherwise.

MODELING AND SIMULATION IN TEST DESIGN

Modeling and simulation may have their greatest contribution to operational test through improving operational test design. Modeling and simulation were used to help plan the operational test for the Longbow Apache (see Appendix B). Constructive simulation models can play at least four key roles.

First, simulation models that properly incorporate both the estimated heterogeneity of system performance as a function of various characteristics (of test scenarios), as well as the size of the remaining unexplained component of the variability of system performance, can be used to help determine the error probabilities of any significance tests used in assessing system effectiveness or suitability. To do this, simulated relationships (based on the various hypotheses of interest) between measures of performance and environmental and other scenario characteristics can be programmed, along with the description of the number and characteristics of the test scenarios, and the results tabulated as in an operational

test. Such replications can be repeated, keeping track of the percentage of tests that the system passed. This approach could be a valuable tool in computing error probabilities or operating test characteristics for non-standard significance tests.

Second, simulation models can help select scenarios for testing. Simulation models can assist in understanding which factors need controlling and which can be safely ignored in deciding which scenarios to choose for testing, and they can help to identify appropriate levels of factors. They can also be used to choose scenarios that would maximally discriminate between a new system and a baseline system. This use requires a simulation model for the baseline system, which presumably would have been archived. For tests for which the objective is to determine system performance in the most stressful scenario(s), a simulation model can help select the most stressful scenario(s). As a feedback tool, assuming that information is to be collected from other than the most stressful scenarios, the ranking of the scenarios with respect to performance from the simulation model can be compared with that from the operational test, thereby providing feedback into the model-building process, to help validate the model and to discover areas in which it is deficient.

Third, there may be an advantage in using simulation models as a living repository of information collected about a system's operational performance. This repository could be used for test planning and also to chart progress towards development, since each important measure of performance or effectiveness would have a target value from the Operational Requirements Document, along with the values estimated at any time, using either early operational assessments or, for requirements that did not have a strong operational aspect, the results from developmental testing. The Air Force Operational Test and Evaluation Center is in the process of testing this concept for the B-1B defensive system upgrade.

Fourth, every instance in which a simulation model is used to design an operational test, and the test is then carried out, presents an opportunity for model validation. The assumptions used in the simulation model can then be checked against test experience. Such an analysis will improve the simulation model under question, a necessary step if the simulation model is to be used in further operational tests or to assess the performance of the system as a baseline when the next innovation is introduced. Feedback of this type will also help provide general experience to model developers as to which approaches work and which do not. (Of course, this kind of feedback will not be possible without the data archive recommended in Chapter 3. Also mentioned in Chapters 3, 6, and 8, inclusion of field use data in such an archive provides great opportunities for validation of methods used in operational test design.)

Recommendation 9.3: Test agencies in the military services should increase their use of modeling and simulation to help plan operational

tests. The results of such tests, in turn, should be used to calibrate and validate all relevant models and simulations.

Recommendation 9.4: Simulation should be used throughout system development as a repository of accumulated information about past and current performance of a system under development to track the degree of satisfaction of various requirements. The repository would include use of data from all relevant sources of information, including experience with similar systems, developmental testing, early operational assessments, operational testing, training exercises, and field use.

A final note is that validation for test design, although necessary, does not need to be as comprehensive as validation for simulation that is to be used for augmenting operational test evaluation. One can design an effective test for a system without understanding precisely how a system behaves. For example, simulation can be used to identify the most stressful environment without knowing what the precise impact of that environment will be on system performance.

MODELING AND SIMULATION IN TEST EVALUATION

The use of modeling and simulation to assist in the operational evaluation of defense systems is relatively contentious. On one side, modeling and simulation is used in this way in industrial (e.g. automobile) applications. Simulation can save money, is safer, does not have the environmental problems of operational test, is not constrained in defense applications by the availability of enemy systems, and is always feasible in some form. On the other side, information obtained from modeling and simulation may at times be limited in comparison with that from operational testing. Its exclusive use may lead to unreliable or ineffective systems passing into full-rate production before major defects are discovered.

An important example of a system for which the estimated levels for measures of effectiveness changed due to the type of simulation used is the M1A2 tank. In a briefing for then Secretary of Defense William Perry (see Wright, 1993), detailing work performed by the Army Operational Test and Evaluation Command, three simulation environments were compared: constructive simulation, virtual simulation, and live simulation (essentially, an operational test). The purpose was to "respond to Joint Staff request to explore the utility of the Virtual Simulation Environment in defining and understanding requirements." In the test, the constructive model indicated that the M1A2 was better than the M1A1. The virtual simulation indicated that M1A2 was not better, which was confirmed by the field test. (The problems with the M1A2 had to do, in part, with imm ature software.) The specific limitations of the constructive simulation were that the various assumptions underlying the engagements resulted in the M1A2 detecting and killing more targets. Even though the overall results agreed with the field

test, the virtual simulation was found to have problems as well. The primary problem was the lack of fidelity of the simulated terrain, which resulted in units not being able to use the terrain to mask movements or to emulate having dug-in defensive positions. In addition, insufficient uncertainty was represented in the scenarios. In this section we discuss some issues concerning how to use validated simulations to supplement operational test evaluation.

(The use of statistical models to assist in operational evaluation—possibly in conjunction with the use of simulation models—is touched on in Chapter 6. An area with great promise is the use of a small number of field events, modeling and simulation, and statistical modeling, to jointly evaluate a defense system under development. Unfortunately, the appropriate combination of the first two information sources with statistical modeling is extremely specific to the situation. It is, therefore, difficult to make a general statement about such an approach, except to note that it is clearly the direction of the future, and research should be conducted to help understand the techniques that work.)

Modeling and simulation have been suggested as ways of extrapolating or interpolating to untested situations or scenarios. There are two general types of interpolation or extrapolation that modeling and simulation might be used to support. First, in horizontal extrapolation, the operational performance of a defense system is first estimated at several scenarios—combinations of weather, day or night, tactic, terrain, etc. Simulation is then used to predict performance of the system in untested scenarios. The extent to which the untested scenarios are related to the tested scenarios typically determines the degree to which the simulation can predict performance. This extrapolation implies that the tested scenarios need to be selected (to the extent possible) so that the modeled scenarios of interest have characteristics in common with the tested scenarios (see discussion in Chapter 5 on Dubin's challenge). One way to ensure this commonality is to use factor levels to define the modeled scenarios that are less extreme than those used in the tested scenarios. In other words, extrapolation to an entirely different sort of environment would be risky, as would extrapolation to a similar environment, but at a more extreme level. For example, if a system was tested in a cold (32°F) and rainy (0.25″ per hour) environment and in a hot (85°F) and dry (0.00″ per hour) environment, there would be some reason to hope, possibly using statistical modeling, for a simulation to provide information about the performance of the system in a moderately hot (65°F) and somewhat rainy (.10″ per hour) environment. (The closer the untested environment is to the tested one, the closer one is to interpolation than extrapolation.) However, if no tested environments included any rain, it would be risky to use a simulation to extrapolate to rainy conditions based on the system performance in dry conditions. (Accelerated life testing, discussed in Chapter 7, is one way to extrapolate with respect to level.)

Second, vertical extrapolation is either from the performance of a single system (against a single system) to the performance of multiple systems in a

team, against a team of enemy systems; or from the performance of individual subsystem components to performance of the full system. The first type of vertical extrapolation involves an empirical question: whether the operational performance estimated for a single system can be used in a simulation to provide information about multiple system engagements. Experiments should be carried out in situations in which one can test the multiple system engagement to see whether this type of extrapolation is warranted. This kind of extrapolation should often be successful, and given the safety, cost, and environmental issues raised by multisystem engagements, it is often necessary. The second type of vertical extrapolation depends on whether information about the performance of components is sufficient for understanding performance of the full system. There are systems for which a good deal of operational understanding can be gained by testing portions of the system, for example, by using hardware-in-the-loop simulations. This is, again, an empirical question, and tests can be carried out to help identify when this type of extrapolation is warranted. This question is one for experts in the system under test rather than a statistical question.

The ATACMS/BAT calibration described in Appendix B represents both horizontal and vertical extrapolation. First, extrapolation is made to different types of weather, terrain, and tactics. Second, the extrapolation is made from several tanks to a larger number of tanks. The first extrapolation requires more justification than the second. In such situations, it might be helpful to keep the degree of true extrapolation to a minimum through choice of the test scenarios.

A third possible type of extrapolation is time extrapolation; the best example is reliability growth testing (see Chapter 7).

In conferences devoted to modeling and simulation for testing of military systems, most of the presentations have been concerned with potential and future uses of simulations for operational test evaluation. We have found few examples of constructive simulations that have been clearly useful for identifying operational deficiencies of a defense system. This lack may be due to the limitations of modeling and simulation for this purpose, to its lack of application, or to the lack of feedback to demonstrate the utility of such a simulation. To make a strong case for the increased use of modeling and simulation in operational testing and evaluation, examples of simulation models that have successfully identified operational deficiencies that were missed in developmental test need to be collected, and the simulations analyzed to understand the reasons for their success.

We are reluctant to make general pronouncements about which type of simulations would be effective or ineffective for operational assessment of a military system. Everything else being equal, the order of preference from most preferred to least preferred should be live, virtual, and then constructive simulation. For constructive simulations and the software aspects of virtual simulation, the more "physics-based" the better: the actions of the system (and any enemy systems) should be based on well-understood and well-validated physical representations of the process. The use of computer-aided design and computer-aided manufac

turing (CAD/CAM) representations of a system, for just this reason, is clearly worth further exploration for use in modeling and simulation for defense testing. In all cases, however, comprehensive model validation and feedback from real experience will indicate which approaches work and which do not in given situations.

IMPROVED ADMINISTRATION OF MODELING AND SIMULATION

While DoD directives are complete in their discussion of the need for validation and documentation, it is not clear that testers understand how to implement the directives. Not much seems to have changed since the following was noted over 10 years ago (U.S. General Accounting Office, 1987a:46, 61):

In general, the efforts to validate simulation results by direct comparison to data on weapon effectiveness derived by other means were weak, and it would require substantial work to increase their credibility. Credibility would also have been helped by . . . establishing that the simulation results were statistically representative. Probably the strongest contribution to credibility came from efforts to test the parameters of models and to run the models with alternative scenarios.

The office of the secretary of the Department of Defense has issued no formal guidance specifically for the management of simulations or how to conduct them and assess their credibility. Although several directives and at least one military standard have some bearing on simulations, we found no documented evidence that the secretary's office has sought to develop and implement appropriate quality controls that could be expected to directly improve the credibility of simulations.

A similar, more recent, observation was made by Giadrosich (1990):

The problem of how to keep track of the many assumptions, input data sources, multiple configurations, and so forth associated with digital simulation models, and how to defend the resulting validity and accreditation of the models has not been adequately addressed. These issues are further complicated by the fact that a widely accepted formal process for model validation and accreditation does not exist. The lack of transparency in most complex modeling analyses, the inability to get agreement on the basic assumptions, and the ability of these assumptions to drive the analysis results usually lead to poor model credibility.

And independent contractors have observed (Computer Sciences Corporation, 1994:ii,3,4): 7

The major disadvantages noted are: 1) potential inconsistencies in approach implementation: 2) potential biases of the analysts coupled with resource limi

Some of these comments refer to more general use of modeling and simulation and not specifically to modeling and simulation for operational test augmentation.

tations that might cause incomplete data gathering and/or improper comparisons; and 3) a possible lack of uniformity and experience in establishing acceptance criteria for each specific application.

Other interviewees pointed out that there is seldom enough time to carry out a formal model accreditation for a particular application since the study results are often required in the matter of a few weeks. . . . Occasionally, the lack of time forces analysts to use models that are not really suited to the study but are either the only ones available or the only ones with which the study analysts are familiar.

Several V&V [validation and verification] techniques were mentioned as being valuable in supporting accreditation. Although almost all interviewees recognized the desirability of performing extensive verification code checks and in depth comparisons between model results and real world data, funding and time constraints frequently preclude such extensive V&V. Instead, many model users turn to other less costly methods that are also less beneficial. Such methods typically include reviews of past usage and VV&A [validation, verification, and accreditation] results, face validation, and some comparisons between models.

Many organizations. faced with a lack of time and resources to perform in depth V&V. have relied on prior V&V, face validation, and benchmarking as the basis for informal model accreditation. Among those who use these methods are the Army Aviation and Troop Command (ATCOM), the Air Force Operational Test and Evaluation Center (AFOTEC), AMSAA IU.S. Army Materiel Systems Analysis Activity], OPTEC [U.S. Army Operational Test and Evaluation Command], ASC [Aeronautical Systems Command], and NAVAIR [Naval Air Systems Command]. A common view is that past usage, coupled with reviews by subject matter experts (SMEs) or comparisons between models, are sufficient to justify model selection and use.

All of these comments suggest an operational testing and evaluation community that needs guidance. A number of organizational improvements might remedy the cited problems. The key to the following suggestions are independence, expertise, and resources: validation must be carried out by those without a vested interest in the outcome of the validation process, with relevant expertise, and with sufficient resources committed to ensure comprehensive validation. Without these criteria, simulations in operational test design and evaluation should not be used.

Validation needs to be carried out by individuals with expertise. This means in-depth knowledge of the system being modeled, as well as facility with the technical, computer, and statistical tools needed to assess the model. The validation needs to be carried out by individuals that do not have a preference as to whether the model receives accreditation. System developers obviously have indepth knowledge, and they usually have modeling (and possibly model validation) expertise, but they have a strong preference that the model be accredited. The operational tester has no preference with respect to accreditation, except insofar as it permits information to be gathered when field testing is too difficult,

but the tester may have limited modeling or model validation skills. Competitors to the system developers—those that bid on the same contract—will have the expertise, but they may be biased against accreditation. Therefore, it may be difficult to satisfy the joint desires of expertise and independence, but the closer DoD can come to this, the better. The Navy Operational Test and Evaluation Force generally uses the Center for Naval Analysis to perform much of its model validation, which seems an excellent combination of independence and expertise. The other services might investigate whether other federally financed research and development centers, or possibly teams of academics might be used for this purpose.

As for resources, a substantial fraction of the money awarded for production of a simulation model should be earmarked for validation. The precise amount should be related to such considerations as the expense of external validation and the complexity of the simulation model. These funds should be protected from "raiding" by the developers of the simulation model if shortfalls in the simulation development budget occur.

As a final check on the comprehensiveness of validation and accreditation, the Office of the Director, Operational Test and Evaluation (DOT&E) should have the final say as to whether a particular simulation model can be used in conjunction with an operational test. The service test agency should provide all documentation concerning validation activities to DOT&E, along with the plan of how the simulation is to be used, in advance of its use. DOT&E would then examine the materials to determine whether the service test agency should be allowed to go forward with its plan to use the simulation model in operational test design or evaluation. Alternatively, each service could have its own modeling and simulation validation center, with a major function of the overall center being to oversee and support the individual service centers.

Finally, there is a need to develop a center of expertise of modeling and simulation for test design and evaluation in DoD. Such a unit could provide clear descriptions of what constitutes a comprehensive validation, examples of successful and unsuccessful experience in using simulation models for various purposes, and expertise as to the statistical issues that arise in the use of simulation for operational test design and evaluation. Either the recently established Defense Modeling and Simulation Office should add these responsibilities to its mission, or a new center for simulation; validation, verification, and accreditation; and testing should be established.

Recommendation 9.5: A center on use of modeling and simulation for testing and acquisition should be established, with a primary focus on validation, verification, and accreditation. This center could be included in the charter of the Defense Modeling and Simulation Office or established by another relevant organization in the defense community.

The notion of a casebook of examples is especially appealing. It could help educate test managers about the approaches to simulation of the operational performance of a defense system that work (and which do not), based on field use data, how these simulations were designed, and what statistical issues arise in relating information from the simulation to the field testing.

Recommendation 9.6: Modeling and simulation successes and failures for use in operational test design and evaluation should be collected in a casebook so that information on the methods, benefits, risks, and limitations of modeling and simulation for operational test can be developed over time.

The effectiveness of modeling and simulation for operational testing and evaluation is idiosyncratic. Some methods that work in one setting might not work in another. Indeed, it is unclear whether a simulation model could be declared to be working well when there is in fact limited information on operational performance prior to operational testing. Experiments need to be run for systems for which operational testing is relatively inexpensive and simulation models (which in this case would be redundant) developed to see if they agree with the field experience. Problems that are identified in this way can help analysts understand the use of models and simulations in situations in which they do not have an opportunity to collect full information on operational performance. Even in situations when operational testing is limited, it will be important to reserve some test resources to evaluate any extrapolations. For many types of systems, the state of the art for modeling and simulation is lacking, and field testing must stand on its own.

The National Academies of Sciences, Engineering, and Medicine
500 Fifth Street, NW | Washington, DC 20001

Statistics, Testing, and Defense Acquisition: New Approaches and Methodological Improvements Get This Book

MyNAP members save 10% online.
Login or Register to save!

For every weapons system being developed, the U.S. Department of Defense (DOD) must make a critical decision: Should the system go forward to full-scale production? The answer to that question may involve not only tens of billions of dollars but also the nation's security and military capabilities. In the milestone process used by DOD to answer the basic acquisition question, one component near the end of the process is operational testing, to determine if a system meets the requirements for effectiveness and suitability in realistic battlefield settings. Problems discovered at this stage can cause significant production delays and can necessitate costly system redesign.

This book examines the milestone process, as well as the DOD's entire approach to testing and evaluating defense systems. It brings to the topic of defense acquisition the application of scientific statistical principles and practices.