@comment FROM: Z-Cite-Z_Citations4CBL.bib via BibSplit on Wed Oct 13 11:32:41 EDT 2004 @techreport{ 1998-TR4CBL-04-Brglez, author = "F. Brglez", title = "{Design of Experiments to Evaluate CAD Algorithms: Which Improvements Are Due to Improved Heuristic and Which Are Merely Due to Chance?}", institution = "{CBL, CS Dept., NCSU, Box 7550, Raleigh, NC 27695}", number = "1998-TR4CBL-04-Brglez", month = "April", year = "1998", note = {Also available at http://www.cbl.ncsu.edu/\-publications/}, abstract = "{ More than a thousand mathematical problems arising in engineering and science have been shown to be NP-hard. Problems of practical size that are NP-hard can only be solved by devising polynomial-time heuristics, with no guarantee whatsoever on the quality of the solution. Extensive experimentation and comparative analysis is required before we can decide on a `better heuristic'. Past efforts to do either have been, for the most part, ad hoc. While there is no shortage of published claims of `incremental improvements' with a particular heuristic, they are not supported by a test of hypothesis such as `Is the improvement due to improved heuristic used by the algorithm or due merely to chance?' This paper introduces motivation, context, and an approach to begin addressing and resolving such hypotheses using the design of experiment techniques rooted in the scientific method. We accept the methods of experimental design: {\em randomization}, {\em replication}, and {\em organization to reduce error}, first formalized by Fisher in the early 1920s, and adopted widely in many fields of science since. In the limited space available, we ask questions and propose solutions to the following, numbered in the order of sections in which we organized this paper: (2) how to demonstrate the intrinsic variability in performance metrics that are optimized by heuristic algorithms solving NP-hard problems? Can we formulate and put to test the null hypothesis $H_0$ for two algorithms: {\em There is no difference between the algorithms -- i.e. any observed differences of `optimized performance measures' are merely due to random fluctuations in sampling from the same population}? (3) how to abstract the basic workflows in the design and execution of experiments? (4) how to develop a number of well-defined equivalence classes of data as inputs for test of hypothesis with a variety of algorithms, ranging from physical design to logic synthesis and test? (5) how to summarize, on a single page, results of a comprehensive experiment, involving 8 data set sizes, 6 equivalence classes for each data set, and 100 instances of a problem in each equivalence class -- a total of 8*6*100 = 4800 data files processed by each of the two algorithms under evaluation? (6) how to relate this tutorial to the recent and on-going team effort on mutant classes, comprehensive experiments, and the infrastructure to support collaborative and peer-reviewed design, execution, and archival of CAD algorithm experiments on the World Wide Web? }", }