American Daredevils – Hart am Limit | Jazz et Blues | Leave a Comment

Generating profile-based signatures for online intrusion and failure detection

Generating profile-based signatures for online intrusion and failure detection

Information and Software Technology 56 (2014) 238–251 Contents lists available at ScienceDirect Information and Software Technology journal homepage...

776KB Sizes 0 Downloads 17 Views

Information and Software Technology 56 (2014) 238–251

Contents lists available at ScienceDirect

Information and Software Technology journal homepage: www.elsevier.com/locate/infsof

Generating profile-based signatures for online intrusion and failure detection Wes Masri a,⇑, Rawad Abou Assi a, Marwa El-Ghali b a b

Electrical and Computer Engineering Department, American University of Beirut, Lebanon Computer Science Department, American University of Beirut, Lebanon

a r t i c l e

i n f o

Article history: Received 8 October 2012 Received in revised form 22 September 2013 Accepted 23 September 2013 Available online 1 October 2013 Keywords: Application-based intrusion detection Profile-based signatures Vulnerability-based signatures Online failure detection Genetic algorithm Combinations of program elements

a b s t r a c t Context: Program execution profiles have been extensively and successfully used in several dynamic analysis fields such as software testing and fault localization. Objective: This paper presents a pattern-matching approach implemented as an application-based intrusion (and failure) detection system that operates on signatures generated from execution profiles. Such signatures are not descriptions of exploits, i.e. they do not depend on the syntax or semantics of the exploits, but instead are descriptions of program events that correlate with the exploitation of program vulnerabilities. Method: A vulnerability exploit is generally correlated with the execution of a combination of program elements, such as statements, branches, and definition–use pairs. In this work we first analyze the execution profiles of a vulnerable application in order to identify such suspicious combinations, define signatures that describe them, and consequently deploy these signatures within an intrusion detection system that performs online signature matching. Results: To evaluate our approach, which is also applicable to online failure detection, we implemented it for the Java platform and applied it onto seven open-source applications containing 30 vulnerabilities/ defects for the purpose of the online detection of attacks/ failures. Our results showed that our approach worked very well for 26 vulnerabilities/defects (86.67%) and the overhead imposed by the system is somewhat acceptable as it varied from 46% to 102%. The exhibited average rates of false negatives and false positives were 0.43% and 1.03%, respectively. Conclusion: Using profile-based signatures for online intrusion and failure detection was shown to be effective. Ó 2013 Elsevier B.V. All rights reserved.

1. Introduction Intrusion detection systems, or IDSs, are categorized into two basic design approaches: anomaly-based and signature-based. The first operates by collecting information on normal or safe behavior and identifies attacks that vary from this expected set, while the second examines incoming executions for patterns of attacks that match its collection of attack signatures. While anomalybased intrusion detection mechanisms can detect previously unencountered attacks, they might suffer from a high rate of false positives. On the other hand, signature-based approaches, although potentially exhibiting less false positives, are unable to detect attacks not in their collection of attack signatures so can result in a higher rate of false negatives [17,59]. The focus of our

⇑ Corresponding author. Tel.: +961 1 350000x3490; fax: +961 1 744462. E-mail addresses: [email protected] (W. Masri), [email protected] (R. Abou Assi), [email protected] (M. El-Ghali). 0950-5849/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.infsof.2013.09.004

work is to investigate a signature-based technique to applicationbased intrusion detection. Intrusions or attacks, particularly in the context of an application, make use of a vulnerability within the application in order to induce unsafe behavior; the input used to take advantage of this vulnerability is referred to as an exploit. Consequently, signatures can be partitioned into exploit-based and vulnerability-based signatures; the former are derived from the inputs inducing the intrusion, whereas the latter are constructed using the vulnerability itself. Relying solely on exploit-based signatures in an IDS allows it to become viable to polymorphic attacks – attacks that differ syntactically but semantically trigger the same vulnerability [6]. As such, we opt to devise an intrusion detection approach that leverages signatures that are in part vulnerability-based, specifically, these signatures do not characterize the vulnerabilities themselves, but instead they characterize program events that correlate with (and not necessarily cause) the exploitation of program vulnerabilities.

W. Masri et al. / Information and Software Technology 56 (2014) 238–251

The research into vulnerability-based signatures has been recent and relying significantly on static program analysis and logical conditions [6,7]. In our work, we make no assumptions about requiring an application’s source code for analysis or identifying where the software vulnerabilities reside and how they might be exploited. Instead, we rely on program execution profiling information to define attack signatures, which we term profile-based signatures. In the context of our work, profiling is basically the process of collecting and recording runtime events, where the events include execution of program elements such as statements, branches and definition–use pairs. (Hereafter, program element execution and event will be used interchangeably.) Since the monitoring and collection of such events is easily automated, and they are considered as clear indicators of an application’s execution pattern, it is our goal to investigate the validity of drawing on these events to generate profile-based signatures. This is inspired by the idea that exploits induced by a certain vulnerability are likely to exhibit similar traits or patterns of execution, so by mining out the collection of events gathered during the occurrence of attacks, we would be able to characterize the events or program conditions that correlate with vulnerabilities, and hence devise corresponding signatures. As a result, these signatures can be deployed within an intrusion detection system to capture future exploits, by checking if any incoming executions induce the events they describe. Note that since the identified profile-based signatures correlate with a given exploit, there exists a potential of using them to understand, locate, and fix the associated vulnerability [20,34,1,51], but this is not the concern of this paper. Instead, our concern is to prevent a system from being attacked in cases when one is aware that attacks due to a certain vulnerability occurred, and the task of understanding, locating, and fixing that vulnerability is underway. It is not uncommon that such task might take a very long time or might be even abandoned altogether due to the risk of modifying the code (see Section 2.2). It is not often the case that an attack is induced by the execution of a single program element (e.g., the execution of a single statement or a single branch), but by the execution of a combination of such elements. Consider, for example, the following code fragment: s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12

x = 2; y = 2; if (. . .) x += 10; if (. . .) y = y ⁄ 2; y = y + 2⁄x; z = x ⁄ y; if (z >= 0)

239

and necessary condition for the occurrence of the intrusion, and thus any of them can be utilized as a signature in the detection system. The main contributions of this work are as follows: (1) A new intrusion detection approach that relies on matching profile-based suspicious signatures comprising combinations of program elements. This pattern matching approach does not assume any knowledge of the vulnerabilities (e.g. how they enable their respective exploits, their locations, or how to fix them), instead, it requires a training test set comprising both normal and unsafe tests. (2) A genetic algorithm that generates suspicious signatures. (3) Tool support for the Java platform. Therefore, our implementation demonstrates the effectiveness of the approach in detecting attacks not involving memory corruption. While noting that support for other platforms, such as binary or C, are expected to successfully demonstrate the approach’s applicability to this type of attacks. (4) Note that the proposed IDS approach/tool is also applicable as is to online failure detection. (5) An empirical study involving open source applications where some have a security focus and others do not. The remainder of this paper is organized as follows. Section 2 presents our proposed solution and assumptions. We begin explaining the phases of our implementation in Section 3 with execution profiling, followed by signature generation in Section 4, and signature matching in Section 5. The empirical evaluation and analysis of obtained results is given in Section 6. In Section 7 we comparatively discuss related work, and we conclude and comment on future work in Section 8. Finally, the online appendices [4] A through E, provide pseudocode for our algorithms that further describe our implementation and subject programs, and present the full data set of our results.

2. Proposed solution and assumptions 2.1. Solution description

// should be if (z > 0) access granted

else access denied

The intrusion occurs only when z is equal to 0 at s9, which is the site of the ‘‘vulnerability’’ in this case, allowing access when it should be denied. This action can only take place if the definition–use pairs DUP(s1,s7), DUP(s6,s7) and branches Branch(s3,s5), Branch(s5,s6) execute in a program run. Any single element of these on its own is not enough to induce the erroneous behavior, but a certain combination of them can allow for the exploitation of the vulnerability, and thereby an intrusion. For instance, detecting the execution of the branch Branch(s3,s5) and the def–use pair DUP(s6,s7), or alternatively the def–use pair DUP(s1,s7) and the branch Branch(s5,s6), should alert to an attack. Note that any one of these combinations is a sufficient

Our work tackles the problem of devising an IDS that leverages signatures characterizing the events that correlate with exploiting program vulnerabilities. These profile-based signatures are in the form of combinations of program elements comprising multiple types. The main phases of our proposed solution are elaborated on in the subsequent sections. In summary, the following tasks are applied on the vulnerable subject application: Task1. Creating a training set that adequately characterizes the usage pattern of the application and includes both normal tests and attack tests. The training set is formed by augmenting the existing application’s test suite by the observed attacks. Task2. Executing the training set and collecting the execution profiles. As described in Section 3, the collected execution profiles contain information about the execution of program elements of certain types. Task3. Generating the profile-based signatures that mostly correlate with the attacks. Since a brute-force enumeration of all possible combinations is not a feasible solution, this task uses a genetic algorithm (see Section 4) to identify the combinations of program elements that correlate with the attacks that occurred within the training set. Note that signature generation is carried out for each vulnerability individually; i.e., given an application with multiple vulnerabilities, multiple

240

W. Masri et al. / Information and Software Technology 56 (2014) 238–251

independent signatures are generated, which are merged at the end to be passed to the signature matching system. The benefits of this design decision are described in Section 5. Task4. Incorporating the generated signature(s) into the intrusion detection system. The IDS operates as an online signature matching mechanism to detect future attacks or failures. The details of this task are provided in Section 5. 2.2. Assumptions Our proposed solution necessitates the following prime assumptions: Assumption 1. The period between when a vulnerability is discovered and the time a patch is applied might be considerable in length. In fact, the practical value of our approach hinges on this assumption, which is pertinent in the following cases:

a. A vulnerability was discovered, but the process of understanding, locating, and fixing it is taking too long due to its complexity [33,55]; e.g., if the vulnerability involves concurrency or sizable missing code, or requires major modifications of the code base. b. The application provider is simply unable to locate and/or fix the vulnerability [33,55]. c. The vulnerability was deemed too risky to fix as a precautionary measure against regression bugs [18], e.g., impact analysis [3] revealed that the required changes will affect code that is large in size or is too critical. d. The application was discontinued or seized to be supported or maintained, which is not an uncommon problem, especially in the context of open source and free software. Note that discontinued software is termed abandonware [46] in case the provider did not issue an official notice of discontinuance. e. The client is not willing to take the risk of applying a patch or upgrading to a newer version of the application [18]. f. The client is not willing to pay for a newer version of the application g. In addition, arguments for using signature-based antivirus software are generally in support of this assumption. Assumption 2. The training set adequately characterizes the usage pattern of the target application in terms of both normal and malicious behavior. As mentioned earlier, the training set is formed by combining the existing application’s test suite with the observed attacks. A test suite that adequately characterizes the normal usage pattern of an application is likely to be available in practice due to the fact that when designing test cases (i.e., building validation test suites), developers and testers first try to leverage use case scenarios that, by definition, are meant to capture normal program behavior. And in regard to characterizing the malicious usage pattern of the target application, only a few attacks are required by our proposed approach; noting that a single attack could also be used but with an increased risk of false alarms. Finally, requiring that one (or more) attack be known is not unreasonable given that such requirement is shared by most signaturebased approaches Assumption 3. There exists at least one combination of program elements that highly correlates with exploits of any given vulnerability. In other words, all (or most) exploits of a given vulnerability, will exclusively induce the same execution pattern (or subpattern). The implications of this assumption are discussed in Section 6.5

Assumption 4. Given that such combination exists, the (nondeterministic) genetic algorithm will ultimately identify it (also see Section 6.5). Assumption 5. The application being monitored is stable and will not be altered by adding, removing or modifying code. Since the used signatures are generated as elements from within the application itself, changing it would render the signatures ineffective. This is a reasonable assumption as the intrusion detection system is designed for online deployment, at which point the application is at a stable stage. Finally, from a program analysis perspective, a vulnerability parallels a defect and an attack parallels a failure.1 Therefore, the application of our approach should not be only restricted to the security context; especially that the cases listed under Assumption1 are very relevant to the general case. Also, Section 7.1 discusses the work of Lorenzoli et al. [30] which relates to our work but has no security focus. 3. Execution profiling Execution profiling is the process of recording information about the program elements that executed during a test run. In this work we are simply interested in recording whether a certain element executed or not within a given run, e.g., we do not leverage frequency or sequence information of events. Each test in our training set, whether it represents an attack or a normal behavior, will have a profile generated for it containing information about the occurrence of the following program elements: (1) Method calls (MC): For every method M that is executed in at least one test, an MC profile entry indicates whether M is called in the current test. (2) Method call pairs (MCP): For every combination of methods M1 and M2 such that M1 calls M2 in at least one test, an MCP profile entry indicates whether M1 calls M2 in the current test. (3) Basic blocks (BB): For every basic block B such that B is executed in at least one test, a BB profile entry indicates whether B is executed in the current test. (4) Basic-block edges or branches (BBE): For every pair of basic blocks B1 and B2 such that there is a branch from B1 to B2 in at least one test, a BBE profile entry indicates whether this branch is taken in the current test. (5) Def–use pairs (DUP): For every pair consisting of a variable definition D(x) and a use U(x) such that D(x) dynamically reaches U(x) in at least one test, a DUP profile entry indicates whether D(x) dynamically reaches U(x) in the current test. In most cases, a combination of these elements should be enough to characterize the behavior of the profiled application and should allow some form of distinction between safe and malicious program runs. It is this distinction that we aim to leverage. Finally, it should be noted that the profiling tool that we use in this work targets the Java platform and has been developed in prior work [35,40,42,45]. 4. Signature generation 4.1. Motivation and overview The goal of our work is to generate signatures that are representative of attack patterns, and comprised of combinations of pro1 A failure induces a behavior that does not satisfy the requirements, and similarly, an attack induces a (malicious) behavior that does not satisfy the requirements (to the benefit of the attacker). Also, both failures and attacks might or might not be observable.

W. Masri et al. / Information and Software Technology 56 (2014) 238–251

gram elements of multiple types. Obviously, such combinations must be of relatively minimal-size, to allow for tolerable runtime overhead (during the matching process), and have executed in as many exploits and as few safe runs as possible, so as to ensure acceptable rates of false positives and false negatives. Generating theses signatures using the brute-force algorithm would entail considering all possible combinations of profiled elements, i.e. an exponential number of combinations with respect to the size of the profiles, which is not a viable solution. Therefore, an approximation algorithm is the rational alternative, and in our case, we use a genetic algorithm. Genetic algorithms are global search heuristics used to solve combinatorial optimization and learning problems [27]. In general, a genetic algorithm solves a given problem by operating on a population of candidate solutions, evaluating their quality, then applying a form of transformation over generations or iterations to improve the quality of these solutions, and ultimately evolving to a single solution – or set of solutions – that fits certain criteria. Basically, a genetic algorithm is characterized by the following phases: (1) Building an initial population of candidate solutions. (2) Iteratively: (a) Selecting two individuals or parents from the population. (b) Applying a transformation to generate an offspring, i.e., a child or children. (c) Inserting the offspring into the population by a replacement method. (d) Evaluating if the stopping criteria of the algorithm has been met; otherwise, keep iterating over the generations. (3) Reporting the obtained solution or solutions, which fit the specified criteria. To further illustrate how a genetic algorithm operates, we present the generic requirements of genetic algorithms before showing how these requirements map into our own implementation: (1) A representation of the problem solution, denoted as a chromosome, e.g., a bit string. (2) An evaluation mechanism for the quality of a chromosome, referred to as a fitness function; this is typically problemspecific and represents a numerical assessment of how good a solution the chromosome represents. (3) A population generation mechanism. (4) A transformation operator to improve population fitness across generations. (5) An acceptance criterion for solution selection. (6) A stopping criterion for the end of the algorithm and the required solution. 4.2. Genetic algorithm for signature generation Here we discuss the design of the genetic algorithm as used for the purpose of generating attack-inducing signatures. The general steps of this genetic algorithm are detailed in the form of pseudocode in Appendix A [4]. Recall that we generate one signature per vulnerability, even though one signature characterizing multiple vulnerabilities could also be generated using our algorithm. 4.2.1. Chromosome representation In our implementation, a chromosome has to represent a combination of multi-type elements, so a bit string notation would be suitable to indicate which profiled elements are included in each combination instance. The size of each bit string is equal to the total number of execution elements gathered during the profiling phase; a bit set to 1 implies that the corresponding element is included in this particular combination. Therefore, by varying the positions of 1s and 0s in a bit string, a new combination of program elements is created, and the number of 1s in the bit string corre-

241

sponds to the size of the combination. An example of chromosome representation in the context of our problem is given in Fig. 1. 4.2.2. Fitness function Once an individual chromosome is created, its fitness is evaluated to determine how good this combination solution is in identifying exploits. For this purpose, we employ a fitness function that relies on the number of safe and unsafe test cases in the training set which contain this combination. In particular, the fitness function is a numeric measure of the quality of the solution that indicates the execution frequency of a combination in malicious test cases relative to safe test cases as is shown in this equation:

fitnessðchromosomeÞ ¼ %unsafe runs  %safe runs where %unsafe runs is the number of malicious test cases containing the combination over the total number of malicious test cases, and likewise for %safe runs. Ultimately, the aim is to end up with a solution whose fitness is equal to 1, i.e. a combination (or set of combinations) that occurred in all exploits but not in any safe run. This would guarantee the elimination of false negatives and false positives from the training set, but not necessarily from all yet unseen tests, of course. 4.2.3. Population generation The population is a set of chromosomes signifying a collection of candidate solutions, which will evolve into the final solution. We begin with the initiating chromosome from which the entire population is spawned. The initiating chromosome is constructed as a function of the intersection of the execution profiles of all unsafe test runs in the training set (as described in Algorithm 1 in Appendix A [4]). The resulting program elements are the ones of interest since they occurred in all attacks associated with the given vulnerability. The population is then formed, one combination bit string at a time, by taking a random subset of the initiating chromosome. The subset selection is done in a probabilistically-randomized manner: for each element of the initiating chromosome, the chosen probability determines if it is to be included in the combination or not. The value of this probability governs the size of the resulting combination relative to the size of the initiating chromosome, so we tend to use rather small probabilities to satisfy our requirement for minimal-sized combinations (see PROB_SET in Appendix D [4]). To further illustrate the population generation mechanism, Fig. 2 presents a sample of three individual chromosomes derived from the initiating chromosome. 4.2.4. Transformation operator We employ fitness-based crossover [56], its basic functioning resembles that of genetic heredity, where a new chromosome is produced as a result of combining two parent chromosomes and passing down properties from each onto the new child, always favoring the parent with the higher fitness. The intended result is for the constructed child to have as good or better fitness than its parents. The adopted genetic algorithm is a steady-state one [56], implying that the transformation is applied across generations, in each generation creating a single new child which replaces another individual in the population. To diversify the child generation process, a child chromosome is either the result of a crossover or a completely random setting similar to that introduced in the population generation phase; however, crossover is applied at a much higher rate to maintain the property of inheritance (see PROB_CROSS in Appendix D [4]). To conduct the crossover, two parent combinations are randomly selected from the population then the child generated is a combination containing program elements from both parents, behaving similarly to the idea of inheriting traits from one’s parents. Nevertheless, to ensure improvement of the child’s fitness, more elements should be taken

242

W. Masri et al. / Information and Software Technology 56 (2014) 238–251

Fig. 1. Chromosome representation: the circled 1-bit indicates that the corresponding method call element is included in the combination; the 0-bit implies that the corresponding DU-pair element is not included.

Fig. 2. Initial population generation: (a) Initiating chromosome representing the intersection of exploit profiles. (b) Population chromosomes; the 0-bit circled in red indicates that the element did not execute in all unsafe test case, placing a 0-bit in all individuals. The 1-bits in the initiating chromosome are probabilistically assigned as 1s or 0s in the population chromosomes. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

from the parent combination with higher fitness. On the bit string level, each bit in the child chromosome is set to be equal to the same-indexed-bit in one of its parents, always favoring the better-fitted one. This is accomplished by assigning the probability factor of adopting the bit value from the parent with better fitness to be higher than that of the lower-fitness parent. The detailed implementation pseudocode is listed in Appendix A [4] and Fig. 3 portrays a schematic of the bit inheritance from parents to child. In the end, the child resulting from a crossover is added to the population by replacing the parent with the worse fitness, whereas a randomly generated child replaces a randomly chosen individual in the population.

4.2.5. Acceptance criterion The fitness function is a measure of the quality of a certain solution or combination, but to actually determine when a solution is deemed fit enough to be considered, a certain threshold for the fitness value must be set. In our case, we evaluate the fitness of a chromosome to determine whether it is fit enough to include in the general population once it is created. However, we do not require such threshold to be high. In fact, small values are desirable as they ensure the formation of a population to start with as well as

achieving more diversity among the candidate solutions (see ACCEPT_POPULATION in Appendix D [4]). 4.2.6. Stopping criterion The last step to determine when the generational evolution should stop, indicating that an adequate solution to the problem has been attained. This happens when one of the following conditions is met: (1) a solution with fitness equal to 1.0 is encountered, which means that there is no room for improvement anymore; or (2) a maximum number of generations is reached, as determined by the parameter NUM_GENERATIONS (see Appendix D). It is worth mentioning that we keep track of the best encountered solution (resulting from a crossover or random generation) throughout the entire process. This guarantees that we end up having the best alternative in case no solution with fitness of 1.0 was found. 4.2.7. Selecting the signature-set Applying the genetic algorithm to a given vulnerability might result in many candidate signatures where each by itself correlates with the associated exploit, i.e., many signatures might exhibit the highest fitness. Since monitoring all of them imposes a higher performance cost, we opt to randomly pick one of them to be matched online. And given multiple vulnerabilities in an application, we end

Fig. 3. Child creation via crossover: The red bits in the child are inherited from parent 1 (with higher fitness). The blue bits are taken from parent 2, but such bits are less frequent than those from parent1. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

W. Masri et al. / Information and Software Technology 56 (2014) 238–251

up with a signature set comprising one signature per vulnerability. Finally, the selected signature set is packaged in an XML format before being passed to the online signature matching system. The used XML schema is described in Appendix B [4]. 5. Signature matching The back-end of our proposed IDS involves execution profiling followed by signature generation; signature matching constitutes its front-end. First, it should be pointed out that signature generation is carried out for each individual vulnerability, i.e., given an application with multiple vulnerabilities, multiple independent signatures are generated, and their resulting XML representations are merged into a single definition file to be passed to the signature matching subsystem. This incremental process of generating signatures is likely to be more efficient and effective than generating a single signature for all vulnerabilities as it was done in [13]. The improved efficiency stems from the fact that: (1) the genetic algorithm is expected to converge much faster when dealing with individual vulnerabilities as opposed to multiple unrelated ones, due to the smaller search space, and (2) vulnerabilities are typically discovered one after the other and not all at once, therefore, it is faster to generate signatures for the newly discovered ones only as opposed to all. The improved effectiveness stems from the fact that the genetic algorithm is expected to identify more accurate signatures (having a higher fitness) when dealing with individual vulnerabilities as opposed to multiple unrelated ones, also due to the smaller search space. Hereafter, a signature generated for an individual vulnerability will be referred to simply as signature, and the collection of signatures for all vulnerabilities in the application will be referred to as signature set. Given the signature set definition file, the signature matching subsystem parses it at startup and maintains a description of the program elements characterizing the signatures in order to: (1) dynamically instrument the target application, (2) perform signature matching, and (3) alert in case of intrusions. This subsystem checks if all the program elements (MC, MCP, BB, BBE, DUP) constituting a given signature in the signature set occur in a single execution of the application. To enable this type of surveillance, instrumentation is required. Instrumentation is the injection of instructions at the Java bytecode-level of an application. In this instance, instrumentation is meant to detect when the elements identified as part of a signature are in fact encountered. Therefore, two modules are designed to construct the signature matching subsystem, and will be discussed next: (1) the instrumentation module, and (2) the matching module. 5.1. Instrumentation module In order to achieve acceptable online performance we choose to apply selective and dynamic instrumentation to enable the required application monitoring. Selective instrumentation is desirable as only the elements specified by the signatures (and some related ones) need to be analyzed [48]. The advantage of dynamic instrumentation is that it does not require stopping and restarting the deployed application any time a new signature is to be registered within the matching system, which is not the case with static instrumentation. In implementing our module we use the java.lang.instrument package, which enables dynamic instrumentation and can be used in conjunction with the bytecode manipulation library BCEL, which provides functionality to inject bytecode instructions. This package allows instrumentation by way of Java agents, which are pluggable libraries that run embedded in the Java Virtual Machine (JVM) and intercept the class-loading process. The

243

agents are run in tandem with the target application and are programmed to carry out the instrumentation. The implemented agent (1) reads in the XML formatted signature set definition file, (2) parses it to determine what particular elements are involved in each of the signatures, then (3) instruments the classes – at load time – to insert the necessary bytecode instructions. The inserted instructions which consist of calls to methods within the matching module vary according to the type of elements in a given signature; therefore, calls are inserted at the following locations: (1) For MC elements: at the entry statement of the method specified by the signature. (2) For MCP elements: at the entry and exit of all methods in the application as this is needed in order to track the application’s dynamic call stack. Given a signature specified as MCP(f1, f2), it is not sufficient to simply check whether functions f1 and f2 were entered, but the matching module should check the top of the call stack at the entry of f2 to make sure that f1 was actually the caller. (3) For BB elements: at the entry statement of the basic block specified by the signature. (4) For BBE elements: at the entry statement of all BBs in the method containing the BBE specified by the signature. A BBE entails a source BB and a target BB both of which belonging to the same method. Instrumenting the source and target of the BBE is not sufficient as it is not always the case that the target will execute right after the source. Therefore, the matcher must instrument all BBs in the method in order to track the last executed BB. In this manner, for proper BBE matching, if the target of the BBE is matched, the matcher must check that the last executed BB was actually the source. (5) For DUP elements: at the definition and use statements specified by the signature, and also at all other statements that define the involved variable. DUPs form a relationship between a store and a load of a particular variable. For DUPs involving local variables, this relationship is intra-procedural, whereas it might be intra- or inter-procedural for static variables, instance fields, or array elements. DUP(s1, s2) signifies that the variable loaded at the use site s2 was defined at the given definition site s1. Any killing definition executing in between these two sites would nullify the definition statement s1. Therefore, simply examining the definition and use locations specified by the signature is insufficient, and the instrumentation must treat all other definition sites of the variable involved so as to detect the occurrence of a redefinition of the variable. Instrumenting all possible definition sites entails injecting instructions in all the methods where the variable is defined – except for the case of local variables where only a single method is concerned – naturally adding to the instrumentation overhead; nevertheless, this measure is unavoidable for correctness. For DUPs involving instance fields or array elements, the associated objects must be used in identifying these variables. For that purpose, given a signature that specifies DUP(s1, s2) involving instance field o.x (or array element a [5]), the matching module performs the following: (1) when the definition site s1 is reached, object o (or array a) is added to a list associated with DUP(s1, s2), (2) when a killing definition occurs, the corresponding object (or array) is removed from the list, if it is already there, and (3) when the use site s2 is reached, a match is considered to be successful only if o (or a) is still in the list. As it is apparent from the above, the efficiency of online signature matching is dictated not only by the number of elements con-

244

W. Masri et al. / Information and Software Technology 56 (2014) 238–251

stituting the signatures in the signature set, but also by the type of these elements. Clearly, the cost of monitoring MCs and BBs is lower than of MCPs, BBEs and DUPs. For this reason, it is beneficial to first try to identify signatures containing only MCs and BBs, and in case none are found, then try to identify signatures containing the other types of elements. In fact, this scheme is adopted in our work, as described in Section 6.2. 5.2. Matching module The matching module maintains a flag for each of the signatures in the signature set, as well as a set of flags for each of the elements within the given signature. If all the elements constituting a particular signature have executed and had their flags set, then the flag pertaining to that signature is set as well, and a message alerts to the detection of an intrusion. Consider the following two scenarios where the element under inspection is a DUP(s1, s2) involving a local variable. In the first: (a) the definition site s1 is executed leading the matching module to set a flag indicating the execution of the definition, and (b) the use site s2 is executed while the definition flag is still set. In the second: (a) the definition site s1 is executed leading the module to set a flag indicating the execution of the definition, (b) the variable was redefined leading the matching module to reset the definition flag, and (c) the use site s2 is executed while the definition flag is not set as a result of (b). In the first scenario, the module will indicate that DUP(s1,s2) has matched, whereas in the second it will not.

6. Empirical evaluation Here we present our empirical evaluation. Section 6.1 describes our subject programs and test suites. Section 6.2 describes the experimental setup we used for assessing our approach. The results are reported in Section 6.3. Cost analysis and the threats to the validity of our approach are discussed in Section 6.4 and Section 6.5, respectively. 6.1. Subject programs and test suites Our study involves seven applications, three having a security focus, namely, Tomcat 3.0, Tomcat 3.2.1, and Jigsaw 2.0.5, and four that do not, i.e., print_tokens2, JTidy, schedule, and tot_info. The latter applications are used to demonstrate the applicability of the approach to online failure detection; noting that most existing work on online failure detection [5,9] is based on anomaly detection. For simplicity, we will refer to the defects and failures in print_tokens2, JTidy, schedule, and tot_info as vulnerabilities and attacks, respectively. The vulnerabilities investigated are listed in Appendix C [4]. Apache Tomcat 3.0 is an open-source Servlet/JSP container having four vulnerabilities: JSP source disclosure (vul-1), directory listing disclosure (vul-2), and JSP engine denial of service (vul-3 and vul-4). The test suite consists of 658 requests, 460 of which are safe, while the rest are attack-inducing. Note that in this program, as well as in the others, the number of malicious requests is far less than that of safe ones within the test suite, which is analogous to a real-life situation. The safe requests included 193 Servlet requests, 150 JSP requests, and 117 HTML/text requests. Whereas the unsafe requests included 150 requests exploiting vul-1, 38 requests exploiting vul-2, 6 requests for vul-3, and 4 requests for vul-4. Apache Tomcat 3.2.1 contains three vulnerabilities: vul-1 exhibits JSP source code disclosure, and vul-2 and vul-3 exhibit JSP engine denial of service. The test suite consists of 497 requests, with only 24 unsafe tests. The safe requests include 283 Servlet requests, 69 JSP requests, and 121 HTML/text requests. The unsafe

requests comprise 18 requests exploiting vul-1, 2 requests exploiting vul-2, and 4 requests for vul-3. Jigsaw 2.0.5 is also an open-source web server and Servlet container containing four vulnerabilities: denial of service (vul-1), path disclosure (vul-2), directory listing disclosure (vul-3) and illegal file access (vul-4). The test suite contains 490 normal requests and 40 exploits. The safe requests include 193 Servlet requests, 191 JSP requests, and 106 normal HTML/text requests. The unsafe requests include 1 request exploiting vul-1, 2 requests exploiting vul-2, 4 requests for vul-3, and 33 requests for vul-4. JTidy 3.0 is an HTML syntax checker and pretty printer. Its test suite comprises 1000 files (each containing 280 lines on average). Some of the tests were downloaded from the Google Groups (groups.google.com) using a web crawler and the others were part of the XML Conformance Test Suite. Of these, 192 were XML files and the rest were HTML files. JTidy failed on 180 of these test cases distributed as follows: (1) 83 exercised vul-1; (2) 2 exercised vul-2; (3) 95 exercised vul-3. Finally, it should be noted that a variant of the JTidy test suite was built and used in prior work [39,45]. print_tokens2 is a lexical analyzer developed as part of the Siemens benchmark [12]. We constructed a multiple-fault version using some of the original bugs. The test suite contains 1801 passing tests and 548 failing ones distributed among 7 vulnerabilities as follows: (1) 205 exercised vul-1; (2) 146 exercised vul-2; (3) 19 exercised vul-3; (4) 20 exercised vul-4; (5) 96 exercised vul-5; (6) 33 exercised vul-6; and (7) 29 exercised vul-7. schedule is a priority scheduler also from the Siemens suite. The multiple-fault version we used involved 981 passing tests and 1314 failing tests. Among the failing runs, 1070 exercised vul-1, 30 exercised vul-2, and 214 exercised vul-3. tot_info is another Siemens program, which computes information measures. We set up a multiple-fault version of it that includes 6 vulnerabilities. The test suite consists of 791 passing runs and 148 failing runs: (1) 20 exercised vul-1; (2) 19 exercised vul-2; (3) 37 exercised vul-3; (4) 1 exercised vul-4; (5) 3 exercised vul-5; and (6) 68 exercised vul-6. 6.2. Experimental setup We evaluate the effectiveness of the system in detecting attacks by measuring the rate of false positives and false negatives exhibited. We also consider the overhead imposed during its online deployment. For this purpose, we conduct experiments on our subject programs and compute the relevant metrics. Each subject program has an associated test suite T that, for the sake of our study, we consider to be an adequate representation of its input space. For each program we apply the following steps: (1) (2) (3) a.

Identify the safe and unsafe test cases in T. Generate the execution profiles for the tests in T. For each vulnerability perform the following: Construct a training set T0 that is a subset of T, which includes both safe and unsafe runs. We opt to vary the size of T0 in order to assess its effect on the accuracy of our approach, especially when only a fraction of the possible exploits is presented during the learning process. Specifically, we choose |T0 |/|T| to take on the values 5%, 10%, 20%, through 90%, respectively. Also, in our experiments we explore two modes for constructing T0 , clustering and random, described later in this section. b. Apply the genetic algorithm to generate a signature associated with the vulnerability comprising combinations of program elements. And as discussed in Section 5.1, we will first try to generate signatures containing only MCs and BBs, and in case none are found, then we will try to generate signatures containing the other types of elements. Specifically, signature

W. Masri et al. / Information and Software Technology 56 (2014) 238–251

(4) (5) (6) (7)

generation will be orderly conducted using the following types of elements until a high fitness signature is found: {MCs, BBs}, {MCs, BBs, DUPs}, {MCs, BBs, DUPs, MCPs}, {MCs, BBs, DUPs, MCPs, BBEs}. Save the resulting signature set in an XML definition file. Activate the signature matching subsystem using the signature set and rerun the application using T. Determine the number of false positives and false negatives detected. Measure the slowdown as the time ratio of running the test suite with and without activating the matching system.

The random mode for constructing the training set is conducted by simply randomly selecting tests from T to form T0 , whereas the clustering mode proceeds as follows: test cases in T are automatically clustered based on the similarity of their execution profiles comprising MCs, MCPs, BBs, BBEs, and DUPs. Then T0 is built by randomly selecting, if possible, two tests from each cluster, one safe test and one unsafe test associated with the given vulnerability. This process is repeated over all clusters until the desired percentage of safe/unsafe tests is reached. The goal here is for T0 to cover, as much as possible, the behaviors induced by T. A similar approach was used by the main author for the purpose of test suite minimization in [45]. Note that the size of T0 is dictated by the chosen number of clusters and by our decision to ensure that at least two unsafe test case from the given vulnerability is included. Obviously, these two modes are not applicable in practice since T is unknown. We investigate them strictly to validate our assumption that the training set must adequately characterize the usage pattern of the application (Assumption2), as we hypothesize that T0 built using clustering should characterize T better than when built using random. And thus, we expect our results to turn out better using clustering. Finally, due to the non-determinism introduced by the selection of the training sets and the use of the genetic algorithm, the results presented in the Section 6.3 are reported by averaging the results from five iterations of our approach.

6.3. Experimental results We measure the number of false alarms produced when a certain signature set is deployed. False alarms include false negatives induced by undetected exploits and false positives resulting from incorrect labeling of safe executions. By monitoring the output of our IDS, we can determine which test cases are flagged as attacks when the IDS issues an alert. Then, this collection of ‘‘attacks’’ is compared with the actual set of exploits given initially: a missing request is tagged as a false negative, and any safe request in the ‘‘attacks’’ collection is counted as a false positive. The percentage of false alarms is computed as follows: % False Negatives ¼ ð# of falsely flagged exploitsÞ=ðTotal # of exploitsÞ % False Positives ¼ ð# of falsely flagged safe runsÞ=ðTotal # of safe runsÞ

The amount of overhead induced is measured by running T on the application prior to instrumentation and then again after the IDS is activated, and in both cases recording the execution times. The slowdown is thus calculated as:

% Slowdown ¼ ðTime with IDS  Time without IDSÞ=ðTime without IDSÞ

245

In addition to the above three entities, we compute the following: (1) Average Signature Length (ASL): average length (number of program elements) of the generated individual signatures in the signature set. (2) Elements Types (Types): types of the elements constituting the signatures in the signature set (involved in all five iterations). (3) Best Possible Fitness using Training Set (BPFT0 ): fitness of the intersecting elements amongst the failing runs in T0 . If BPFT0 < 1.0, our approach will not yield good results. This metric could be computed prior to deployment; therefore, a user will be aware of the potential risk of incoming false alarms in case its value is less than 1.0. (4) Best Possible Fitness using Full Test Suite (BPFT): fitness of the intersecting elements amongst the failing runs in T. If BPFT < 1.0, our approach will not yield good results. Note that this number cannot be computed in practice as T is not known; we are computing it strictly for analysis purposes. Appendix E presents the complete set of our results, i.e., for each of the 2 modes of constructing the training set (clustering and random), it includes a table for each of the 30 vulnerabilities in addition to a table for each of the 7 applications. Presenting all 74 tables here is not practical; instead we present one sample table (Table 1) and one summary table (Table 2) comprised of selected entries from each of the original tables. The selected entries are meant to correspond to the cases when our approach performed best in terms of both low rates of false alarms and small sizes of the training sets. Table 1 presents the data for vul-3 in Tomcat 3.0 in which the training sets were formed using clustering. Not shown in the table is the value of BPFT which is 1. The fact that the value of BPFT and the values of BPFT0 (corresponding to each of the 5 runs) are 1 suggests that it is likely that the chosen training sets will generate effective signatures, i.e., signatures that closely characterize the vulnerability. Actually all training sets of size 30% or more resulted in no false alarms; therefore, the entry with a training set size of 30% will be selected to be shown in Table 2. Also shown is the average signature length which varies from 2.2 to 4.0, and the elements types which include MCs and BBs only. Table 2 presents selected entries from each of the 74 tables. The 7 rows denoted as All correspond to the cases when all vulnerabilities in a given application are considered (the whole signature set), the other 30 correspond to the cases when individual vulnerabilities are considered. The table presents the number of exploits/ failures, then for both modes of building the training set, it shows the best observed result. We make the following observations regarding the clustering mode: (1) Our approach performed very well except for vul-3 in jigsaw, vul-2 in schedule, and vul-2 and vul-3 in tot_info. That is, for 26 out of 30 cases, it exhibited no false alarms given small training sets. Note that out of the 26 successful cases, 19 involved training sets of size 5%, 3 of size 10%, 2 of size 20%, and 2 of size 30%. Also, the average signature length varied from 1.0 to 4.2 with an average of 2.14; and except for 6 cases, the signatures involved MCs and/or BBs only. This implies that the genetic algorithm was capable, in most cases, to identify small, simple and effective signatures from relatively small training sets. (2) The values of BPFT and BPFT0 is 1 for all entries except vul-2 in schedule, and vul-2 and vul-3 in tot_info (see Appendix E), which explains why our approach performed poorly for these cases. This shortcoming is due to the fact that given the used profile types there were no combinations of elements that highly correlate with the exploits at hand. Using

246

W. Masri et al. / Information and Software Technology 56 (2014) 238–251

Table 1 Data for Tomcat 3.0 – vulnerability 3. |T0 |/|T| (%)

%FP

%FN

ASL

Types

BPFT0 (run1)

BPFT0 (run2)

BPFT0 (run3)

BPFT0 (run4)

BPFT0 (run5)

5 10 20 30 40 50 60 70 80 90

0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

13.33 13.33 13.33 0.00 0.00 0.00 0.00 0.00 0.00 0.00

3.8 3.2 3.4 3.6 2.2 4 2.6 3 3.4 4.0

MC–BB MC–BB MC–BB MC–BB MC–BB MC–BB MC–BB MC–BB MC–BB MC–BB

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

(3)

(4)

(5)

(6)

(7)

more complex profile types might remedy this problem. Also, given that the values of BPFT0 are known to be less than 1.0 prior to deployment; a user of our approach could take precautionary measures to deal with the potential incoming false alarms. As expected, when our approach does not perform well on one or more individual vulnerabilities it also performs poorly on the combined vulnerabilities, as exhibited in the All entries. In regard to the slowdown induced by our matching system, it was not insignificant but not hindering as it varied from 46% to 102%. Specifically, the slowdown was 74.2%, 65.5%, 92.7%, 47%, 102%, 46%, and 60%, for Tomcat 3.0, Tomcat 3.2.1, Jigsaw 2.0.5, print_tokens2, JTidy, schedule, and tot_info, respectively. Note that these numbers correspond to the cases when the whole signature set is considered (i.e., All). Considering vul-3 in jigsaw, Appendix E shows that for the various training sets, it is the rate of false negatives that is excessively high and not the rate of false positives. This means that when forming the training sets, the selected unsafe tests are not adequately characterizing the malicious usage pattern of the application. Also, as shown in Table 2, random performed unexpectedly better than clustering. Therefore, the poor performance here is due to an inadequate training set, i.e., Assumption2 was not satisfied. Note that only 4 unsafe tests are present in T which is likely the reason behind this behavior. In Section 6.2 we expected that clustering would yield better results than random. Table 2 supports that in general except for the case of vul-3 in jigsaw. The implementation of our approach is limited in that it only analyzes (i.e., profiles and monitors) the subject programs themselves, it does not go as far as analyzing code generated by the subject programs, a generally very difficult task. This might pose a problem in our empirical evaluation which involves JSP engines. Following a JSP request, a JSP engine generates a Servlet representing the commands in the JSP file and in some cases the execution of some program elements in the Servlet code might be critical to generating an effective attack signature and later detecting the attack. Given this limitation, our implementation is expected to fail at detecting attacks induced by vul-3 and 4 in Tomcat 3.0, and vul-2 and 3 in Tomcat 3.2.1. It is so because these are the only cases where the vulnerabilities clearly reside inside the JSP file and not in the application code itself. (Note that these four vulnerabilities actually amount to only two distinct ones, as shown in Appendix C). Surprisingly, the results were good for these vulnerabilities. One explanation would be that the malicious JSP requests indirectly induced unusual behavior in the applications’ code that got identified during the signature generation process.

To summarize, our approach worked very well for 26 out of the 30 vulnerabilities but failed for the remaining 4. That is, in 86.67% of the cases it produced no false alarms using training sets of size 30% or less. Note that the average rates of false negatives and false positives were 0.43% and 1.03%, respectively. 6.4. Profiling and cost analysis Table 3 shows the number of profiling elements recorded for each of the seven applications, categorized by element type. For example, as a result of executing the Tomcat 3.0 test suite, a total of 26,137 distinct program elements were tracked and recorded of which 1083 were MCs, 1640 were MCPs, 7130 were BBs, 7533 were BBEs, and 8751 were DUPs. Typically, these numbers differ based on the type and structure of the application, and are indicative of the size of the bit strings operated on in the genetic algorithm. We now discuss the cost of each of the four tasks (see Section 2.1) in our proposed approach. Cost of Task1: since the training set is formed by simply augmenting the existing application’s test suite by the observed attacks, the cost of creating it was insignificant in our study. Cost of Task2 the cost of collecting the test suites profiles for Tomcat 3.0, Tomcat 3.2.1, Jigsaw 2.0.5, print_tokens2, JTidy, schedule, and tot_info was 13 min, 1.1 min, 3.6 min, 83 min, 271 min, 57 min, and 30 min, respectively. Noting that profiling the safe runs of a given test suite is required to be performed only once (for the lifetime of the version of the application), whereas the attack runs must be profiled right after they get discovered in the field. Cost of Task3 the signature generation process took, on average, less than one second for 26 vulnerabilities. But it took considerably more for vul-2 in schedule (63 s), vul-2 in tot_info (30 s), vul-3 in tot_info (20 s), and vul-2 in print_tokens2 (4 s). Cost of Task4 as stated in the previous section incorporating the generated signatures in the IDS induced a slowdown that varied from 46% to 102%. 6.5. Threats to validity Our experimental study involved only seven subject programs from the same computing platform; therefore, it is not possible to draw broad conclusions based on our results. To address this threat to external validity, further empirical studies with a variety of other subject programs from different domains and environments are needed. Regarding the internal threats to the validity of our approach, some of the assumptions we make (Section 2.2) may not always hold: (1) The training set adequately characterizes the usage pattern of the target application (Assumption2).

247

W. Masri et al. / Information and Software Technology 56 (2014) 238–251 Table 2 Selected data from all applications. # Fail

Clustering 0

Random

|T |/|T| (%)

%FP

%FN

ASL

Types

|T0 |/|T| (%)

%FP

%FN

ASL

Types

5 5 30 10 30

0 0 0 0 0

0 0 0 0 0

2.0 1.0 3.6 3.6 –

BB BB MC–BB MC–BB –

5 10 30 5 30

0 0 0 0 0

0 0 0 0 0

1.0 1.0 4.0 3.6 –

BB BB MC–BB MC–BB –

Tomcat 3.2.1 vul-1 18 vul-2 2 vul-3 4 All 24

5 5 5 5

0 0 0 0

0 0 0 0

2.8 2.6 3.8 –

MC–BB MC–BB MC–BB –

5 30 20 30

0 0 0 0

0 0 0 0

1.0 3.0 3.6 –

BB MC–BB MC–BB –

Jigsaw vul-1 vul-2 vul-3 vul-4 All

1 2 4 33 40

5 5 10 5 60

0 0 0 0 0

0 0 10 0 0

3.2 3.2 3.4 2.0 –

MC–BB MC–BB MC–BB BB –

5 5 5 5 5

0 0 0.08 0 0.08

0 0 0 0 0

4.0 3.2 1.8 2.4 –

MC–BB MC–BB MC–BB MC–BB –

JTidy vul-1 vul-2 vul-3 All

83 2 95 180

30 5 10 30

0 0 0 0

0 0 0 0

2.0 2.6 3.4 –

BB MC–BB MC–BB –

40 5 10 40

0 0 0.02 0

0 0 0 0

3.8 2.4 3.0 –

MC–BB BB MC–BB –

print_tokens2 vul-1 205 vul-2 146 vul-3 19 vul-4 20 vul-5 96 vul-6 33 vul-7 29 All 548

5 5 5 20 5 20 10 5

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1.0 1.0 2.0 2.0 1.0 2.0 1.0 –

BB BBE MC–BB MC–BB BB MC–BB MC –

5 5 10 60 30 20 20 5

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1.0 1.0 2.0 2.0 1.0 2.0 1.0 –

BB BBE MC–BB MC–BB BB MC–BB MC –

Schedule vul-1 vul-2 vul-3 All

1070 30 214 1314

5 40 5 40

0 5.5 0 5.5

0 4.7 0 0.11

1.0 4.2 2.0 –

BB MC–BB–DUP BB –

20 100 10 80

0 8.1 0 7.2

0 0 0 0.05

1.0 5.2 2.6 –

BB MC–BB–DUP BB –

tot_info vul-1 vul-2 vul-3 vul-4 vul-5 vul-6 All

20 19 37 1 3 68 148

5 50 20 5 5 5 20

0 14.4 2.78 0 0 0 10.2

0 0 0 0 0 0 1.2

1.0 1.6 2.2 1.0 1.0 1.0 –

DUP BB BB–BBE BB DUP DUP –

30 50 60 5 10 10 60

0 14.4 2.78 0 0 0 14.1

0 0 0 0 6.67 0 0

1.0 2.0 3.0 2.2 1.2 1.8 –

DUP BB BB–BBE MC–BB MC–DUP MC–DUP –

Tomcat 3.0 vul-1 vul-2 vul-3 vul-4 All

150 38 6 4 198

(2) Since at least two samples of any given exploit is always included in the training set, the risk of obtaining a high rate of false negatives is expected to be insignificant. Nevertheless, as exhibited by vul-3 in jigsaw, this assumption is not always easy to satisfy. To improve the quality of a training set the following will be explored: (a) using capture/replay tools to collect more inputs from the field [57], (b) using test case generation tools to automatically create additional inputs [54], and (c) leveraging test visualization tools to augment the training set [26]. It should be noted that this problem would typically stem from the fact that the training set was build upon a test suite that was inadequate to start with. (3) There exists at least one combination of program elements that highly correlates with exploits of any given vulnerability (Assumption3). (4) If no structural combination could characterize the exploit, as it was the case with vul-2 in schedule, and vul-2 and 3 in tot_info, additional information could be used in deriving signatures, such as state and event sequence information.

E.g., def-use pairs and branches could be augmented by computing predicates that relate the values (states) taken by their sources and targets [1]. (5) Given that such combinations exist, the (non-deterministic) genetic algorithm will ultimately identify them (Assumption4). (6) We did not encounter any case where a high fitness signature existed and the genetic algorithm was unable to find it. However, it is worth exploring other search algorithms such as simulated annealing and tabu search. 7. Related work This paper extends the work in [13] by: (1) using combinations of MC’s, MCP’s, BB’s, BBE’s and DUP’s as opposed to just BB’s; (2) conducting signature matching at the byte code level as opposed to the Java level, thus improving the accuracy of matching and reducing the rate of false positives; (3) generating the signature for each vulnerability individually then merging the resulting signatures into a signature set to be monitored online. This has the

248

W. Masri et al. / Information and Software Technology 56 (2014) 238–251

Table 3 Number of program elements per application.

Total MCs MCPs BBs BBEs DUPs

Tomcat 3.0

Tomcat 3.2.1

Jigsaw

print_tokens2

JTidy

schedule

tot_info

26,137 1083 1640 7130 7533 8751

24,438 1030 1637 6485 6777 8509

29,895 1216 2155 7553 7978 10,993

891 22 30 278 310 251

22,110 325 693 4853 5604 10,635

1047 24 37 238 280 468

1276 18 23 271 315 649

potential of improving the efficiency and effectiveness of signature generation; (4) using cluster analysis to build more representative training sets for the purpose of better evaluating our technique; (5) providing more detailed descriptions of the devised algorithms, techniques, and experimental setup; and (6) using additional and larger test suites that included applications with no security focus. The idea of mining for program elements that correlate with failure stems from the area of spectrum-based fault localization [20,34,1,36–38]. This work is different in aim and in that it mines for combinations of elements as opposed to individual ones [37]; also, it does not require localizing the fault or vulnerability. Note that the success of spectrum-based fault localization at localizing faults has been limited [51] as the better techniques still require the developer to examine around 10% of the code, on average. Next we describe work related to our approach that we categorize and list in order of relevance as follows: profile-based techniques, pattern matching techniques, taint-based techniques, and finally anomaly-based techniques; noting that some listed work fall under more than one category. 7.1. Profile-based techniques In previous work, the main author presented an approach to detecting security attacks, whose primary goal is facilitating identification and repair of security vulnerabilities rather than permitting online response to attacks [41]. The approach is based on online capture of execution inputs and offline replay, profiling, and analysis. The high level steps of the approach are as follows: 1. Online capture of n inputs that include both normal and malicious requests. 2. Instrumenting the target program to enable the collection of execution profiles that consist of information flow pairs [11,53,42–44]. 3. Replaying the instrumented program on the captured n inputs and collecting the associated profiles. 4. Selecting for (manual) auditing a subset of executions that exhibited unusual or distinctive profiles. This is done using cluster analysis as follows: (a) choosing a number of clusters c; (b) clustering the n executions based on their profiles; and (c) randomly selecting a single execution from each of the c clusters to be classified as either normal or abnormal. In this manner, the manual effort would be reduced by auditing c executions as opposed to n executions. The above profile/anomaly based approach aims at offline detection of attacks for the purpose of repairing vulnerabilities. This proposed work is different in focus as it aims at online intrusion detection using signature matching. In [32] Martin et al. presented PQL (Program Query Language), a language that allows developers to specify code patterns that characterize given vulnerabilities. The user provided code patterns are then used to generate a static matcher and a dynamic matcher, where the former is inherently more conservative and the latter is more relevant to our work as it performs the matching at run-

time. Aspect-oriented programming is used to instrument the application in order to generate execution traces to be analyzed by the dynamic matcher. This approach resembles our proposed solution as it also monitors executing events, but differs in the following: 1. It requires the user to examine the application code and manually specify the code patterns (i.e., signatures) to be matched, which is demanding on the part of the user and prone to error. In our solution, on the other hand, the signatures are automatically discovered. 2. The use of aspect-oriented programming limits the expressiveness and complexity of the user defined patterns [50]. The patterns discovered by our technique could be complex as they encompass combinations of profiling elements. In [30] Lorenzoli et al. presented a technique that identifies failure contexts and prevents future occurrences of the failures they describe. An outline of the technique follows: 1. The user manually specifies oracles, in the form of JML assertions, which address specific fault types. 2. Static data-flow analysis is used to automatically identify the program points that potentially affect the oracles. 3. The program points are monitored during training to augment the oracles with dynamically generated properties (using Daikon [14]). 4. The user specified oracles and automatically generated properties are then used for failure detection and failure analysis. The methodology of our approach is fundamentally different than what Lorenzoli et al. proposed, and more importantly our approach has two main advantages: (a) it does not require the user to define oracles; and (b) it is oblivious to fault type. 7.2. Pattern matching techniques Exploit-based pattern matching techniques analyze incoming input to extract a pattern to be matched against attack sequences stored in the detection system’s database. The pattern could be a sequence of bytes, or a combination of entities, e.g., Kim and Karp [22] used patterns involving the IP protocol number, the destination port number, and the sequence of bytes. This methodology falls short at detecting variants of a given exploit, which made some researchers shift their focus towards vulnerability-based pattern matching [6,7]. Our proposed pattern matching technique is neither exploit-based nor vulnerability-based, but is more similar to the latter than to the former since our generated signatures characterize the program behavior induced by the vulnerability. Brumley et al. [6] introduced the concept of a vulnerability signature which is a representation for the set of inputs that satisfy a specified vulnerability condition. Given a new detected exploit for an unknown vulnerability, and the tuple (P, T, x, c) where P is the program, x is the exploit string, c is a vulnerability condition satisfied by x (e.g., ‘‘heap overflow at a specific line number’’), and T is

W. Masri et al. / Information and Software Technology 56 (2014) 238–251

the execution trace of P on x. The aim is to generate a vulnerability signature that will match future malicious inputs x0 by examining them without running P. In [6], Brumley et al. proposed a formal static analysis approach that requires only a single sample exploit for creating the vulnerability signature. They presented three variants of their technique to come up with the signatures: Turing-machine, symbolic constraints or Boolean formulae, and regular expressions. These three exhibit trade-offs between detection accuracy and matching overhead: Turing-machines are the most accurate but the most difficult to match against, while regular expressions are the easiest to generate but cannot offer the same precision. A main step of their technique requires computing a program chop [19] with respect to the trace T that includes all paths leading to the vulnerability point described by condition c. Our technique is more advantageous as it does not require any information about the vulnerability, whereas their technique hinges on detailed information, namely, the vulnerability condition and location. The work in [6] is not very scalable as it uses forward symbolic execution to generate a separate signature for each program path an exploit may take. Loops are unrolled a fixed number of times and the generated signature is exponential in size to the number of program paths in the unrolled program. The work in [7] aims to reduce the signature size and generation time, making matching less expensive. The idea is based on weakest preconditions to make signatures more succinct, by determining the initial state that leads to a final state which satisfies the vulnerability condition. This is done by the use of a control-flow graph and a chop over the vulnerability, then the construction of a formula over the initial state, to indicate that its execution will necessarily cause an attack, and the signature becomes this Boolean formula which can be generated and evaluated easily. Basically, the process entails specifying the conditions necessary and sufficient to satisfy a certain vulnerability condition, then building a formula which evaluates to true when executing the point in the program preceding the vulnerability point (the point where the vulnerability is located). This is then applied recursively until the Boolean formula represents the condition(s) leading from the initial state to the vulnerability point, thus allowing evaluation of the input directly at program entry. Here also, our technique is more advantageous as it does not require any information about the vulnerability, whereas the approach in [7] does. As an extension of their previous work [24], Lee et al. [25] employed data mining techniques to capture both the behavior of intrusions and normal activities. Specifically, they developed a framework that provides implementations of algorithms for: 1. Classification: to train classifiers using ‘‘normal’’ and ‘‘abnormal’’ audit data, so that they can predict new unseen data as belonging to the normal class or the abnormal class. The classification rules are generated using RIPPER [10], which selects the unique features that identify the intrusions. These rules can be first inspected and edited by security experts, and then be incorporated into matching systems. 2. Association rules: [2]: to determine the correlation between system features to construct normal usage profiles to be incorporated into anomaly detection systems. 3. Frequent episodes: [31]: to model sequential patterns that can capture what time-based sequence of audit events are frequently occurring together. The discovered patterns could be used in both matching and anomaly based systems. The matching system’s detection rates observed in [25] averaged 80.2% for data that was part of the training set and 37.7% for unseen data. Table 2 shows that the average rates of false negatives and false positives exhibited by our solution are 0.43% and

249

1.03%, respectively. Therefore, even though a detailed comparison of our results to the ones presented in [25] would not be sensible given that they are based on two different data sets, the exhibited low rate of false alarms is an indication that our technique is clearly competitive. Finally, note that our data set involves thousands of features (see Table 3), whereas the data in [25] involves only 41 features. 7.3. Taint-based techniques In [49] Newsome and Song proposed the use of dynamic taint analysis for the automatic generation of exploit-based signatures. They implemented TaintCheck, a tool that enables the user to mark (taint) untrusted inputs to be tracked via dynamic data-flow analysis, in order to detect whether it is used to carry out an attack. TaintCheck allows the detection of attacks that cause sensitive program values to be overwritten with the attacker’s data, i.e., overwrite attacks. Following the detection of an attack, the tool provides information that could be used to automatically generate signatures to be deployed for attack filtering. Our proposed technique has the following advantages: (1) it is more general as it is not limited to overwrite attacks; (2) it does not require the user to pinpoint the inputs to be tainted; and (3) it is noted in [49] that TaintCheck slowed down the target application between 1.5 and 40 times, the overhead imposed by our implementation is much lower as it varied from 46% to 102%. In [61] Yin et al. proposed an approach for the detection and analysis of privacy-breaching malware such as spyware, keyloggers, network sniffers, stealth backdoors, and rootkits. Their tool, Panorama, is also based on dynamic taint analysis. It works by tainting sensitive information, and monitoring taint propagation over the applications, hardware, and the operating system. Taint graphs are generated which show the processes that access tainted data, how the data propagates through the system, and to which file or network connection this data is written to. The user can define policies based on the taint graphs that specify the characteristic behavior of different types of malware. Panorama can then automatically detect attacks by checking the policies against the taint graph of an unknown sample. The advantages of our proposed technique are the same as the three mentioned for TaintCheck, in addition, no policy definition is needed from the user. 7.4. Anomaly-based techniques Anomaly detection techniques [28], which are intended to enable the detection of novel attacks, are based on the assumption that attacks often induce unusual execution behavior that can be distinguished from normal behavior. Most research on host-based and application-based anomaly detection techniques involves the analysis of system call sequences. Forrest et al. [16] presented a form of anomaly detection that involves: (1) characterizing normal program behavior during a training phase by analyzing system call traces and constructing a database of contiguous subsequences of size k (usually k = 6) observed in the traces; and (2) analyzing new traces to identify subsequences not present in the database, which are assumed to indicate possible attacks. Other proposed host-based intrusion detection systems depend on models constructed from static analysis of program code [58]. Such models are prone to mimicry attacks, which could be remedied by using additional runtime information such as call stack information [15], system call arguments [47,23,8], and environment variables and active context during program execution [60]. Mutz et al. [47] applied multiple detection models to system call arguments and introduced a method of combining the anomaly scores from each model into an overall aggregate score. They also conducted a study that compares their results to those of four

250

W. Masri et al. / Information and Software Technology 56 (2014) 238–251

other anomaly-based IDS systems using the MIT Lincoln Lab data set [29]. The first is the system proposed by Forrest et al. [16]. The second [21] extends [16] by involving sets of system calls as opposed to sequences. The third system [52] uses k-nearest neighbor classification to identify outliers in the data set, and the fourth [52] uses cluster analysis to identify the outliers. The study showed that their technique performed best as it exhibited no false negatives but still a considerable number of false positives. Our solution is different than any of the aforementioned techniques as it is signature-based as opposed to anomaly-based. In regard to comparative effectiveness, recall that our average rates of false negatives and false positives are 0.43% and 1.03%, respectively. These low rates indicate that our technique is competitive, in addition, recall from Section 6.5 that if more complex profiling elements were used by our technique the rate of false alarms is expected to be even smaller.

8. Conclusions and future work This work proposes an approach to application-based intrusion detection relying on profile-based signatures. Using this type of signatures, we are able to define the conditions of software attacks, regardless of the format or content of the inducing exploit, by only referring to the set of execution elements that occur during an attack. Our solution, which is also applicable to online failure detection, aims at preventing a system from being attacked in cases when one is aware that attacks due to a certain vulnerability occurred, and the task of understanding, locating, and fixing that vulnerability is underway. The basic prelude to our approach is the ability to profile the execution of an application, i.e. collect exercised program elements, including method calls, method call pairs, basic blocks, basic block edges, and definition–use pairs. The actual construction of the needed signature is a learning process, implemented via a genetic algorithm, which generates from a training set of executions the combinations of elements that lead to unsafe executions. These combinations form the signature, which is fed to a matching system that monitors the application by way of dynamic selective instrumentation and alerts to intrusions when a match is detected. As means of evaluation, our empirical work consists of generating the needed signatures by running the genetic algorithm over varying sizes of training sets and gathering the results when these signatures are used within the intrusion detection modules. The results expected are the number of false alarms issued by the system and the overall increase in running-time due to the activation of this system. Our experiments involved seven Java applications containing 30 vulnerabilities/defects. The results showed that our approach worked very well for 26 vulnerabilities/defects (86.67%) and the overhead imposed by the system is somewhat acceptable as it varied from 46% to 102%. The exhibited average rates of false negatives and false positives were 0.43% and 1.03%, respectively. We believe that 3 out of the 4 cases in which our approach performed poorly, are preventable and can be addressed in future work. In these cases, signatures that characterize the exploits could not be found because we only relied on structural program elements to build our execution profiles. To remedy this shortcoming, we plan to use additional information in deriving signatures, such as state (values of variables) [1] and event sequence information. Also, since the performance of our approach is dependent on the quality of the training sets, we will explore the following: (a) using capture/replay tools to collect inputs from the field [57], (b) using test case generation tools to automatically create new inputs [54], and (c) leveraging test visualization tools to augment them [26]. Finally, at present we have chosen a genetic algorithm as the basic heuristic for the signature generation mechanism, we will also

investigate other techniques that can accomplish the same purpose. Other heuristics, such as simulated annealing or tabu search, are possible alternatives, as well as syntactical pattern recognition algorithms. Acknowledgments This material is based upon work supported by NSF Grant No. 0819987 and by the Lebanese National Counsel for Scientific Research. References [1] R. Abou-Assi, W. Masri, Identifying failure-correlated dependence chains, in: First International Workshop on Testing and Debugging, TeBug/ICST 2011, Berlin, March 2011. [2] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of the ACM SIGMOD Conference on Management of Data, 1993, pp. 207–216. [3] Taweesup Apiwattanapong, Alessandro Orso, Mary Jean Harrold: Efficient and precise dynamic impact analysis using execute-after sequences, ICSE, 2005, pp. 432–441. [4] Appendices: . [5] P. Bodik, G. Friedman, L. Biewald, Combining visualization and statistical analysis to improve operator confidence and efficiency for failure detection and localization, in: Proceedings of Second Int’l Conf. Autonomic Computing (ICAC ’05), June 2005, pp. 89–100. [6] D. Brumley, J. Newsome, D. Song, H. Wang, S. Jha, Towards automatic generation of vulnerability-based signatures, in: Proceedings of the 2006 IEEE Symposium on Security and Privacy, Berkeley, California, USA, May 2006, pp. 2–16. [7] D. Brumley, H. Wang, S. Jha, D. Song, Creating vulnerability signatures using weakest preconditions, in: Proceedings of the 20th IEEE Computer SecurityFoundations Symposium (CSF ’07), Venice, Italy, July 2007. [8] A. Chaturvedi, S. Bhaktar, R. Sekar, Improving attack detection in host-based IDS by learning properties of system call arguments, in: Proceedings of the IEEE Symposium on Security and Privacy, 2005. [9] H. Chen, G. Jiang, C. Ungureanu, K. Yoshihira, Online tracking of component interactions for failure detection and localization in distributed systems, IEEE Transactions on Systems, Man, and Cybernetics, Part C 37 (4) (2007) 644–651. [10] W.W. Cohen, Fast effective rule induction, in: Machine Learning: the 12th International Conference, Lake Taho, CA, Morgan Kaufmann, 1995. [11] D.E. Denning, A lattice model of secure information flow, Communications of the ACM 19 (5) (May 1976) 236–242. [12] Hyunsook Do, Sebastian Elbaum, Gregg Rothermel, Supporting controlled experimentation with testing techniques: an infrastructure and its potential impact, Empirical Software Engineering 10 (4) (2005) 405–435. [13] M. El-Ghali, W. Masri, Intrusion detection using signatures extracted from execution profiles, in: 5th International Workshop on Software Engineering for Secure Systems, SESS, Vancouver, Canada, 2009. [14] M.D. Ernst, J. Cockrell, W.G. Griswold, D. Notkin, Dynamically discovering likely program invariants to sup-port program evolution, IEEE Transactions on Software Engineering 27 (2) (2001) 99–123. [15] H. Feng, O. Kolesnikov, P. Fogla, W. Lee, W. Gong, Anomaly detection using call stack information, in: Proceedings of the IEEE Symposium on Security and Privacy, Oakland, CA, USA, May 2003, p. 62. [16] S. Forrest, S. Hofmeyr, A. Somayaji, T. Longstaff, A sense of self for Unix processes, in: Proceedings of the 1996 IEEE Symposium on Security and Privacy, Los Alamitos, CA, USA, May 1996, pp. 120–128. [17] J. Giffin, S. Jha, B. Miller, Automated discovery of mimicry attacks, in: Proceedings of the 9th International Symposium on Recent Advances in Intrusion Detection (RAID ’06), Hamburg, Germany, September 2006, pp. 41– 60. [18] Todd L. Graves, Mary Jean Harrold, Jung-Min Kim, Adam A. Porter, Gregg Rothermel: an empirical study of regression test selection techniques, ACM Transactions on Software Engineering and Methodology 10 (2) (2001) 184– 208. [19] D. Jackson, E. Rollins, Chopping: a generalization of slicing, in: Proceedings of the Second ACM SIGSOFT Symposium on the Foundations of, Software Engineering, 1994. [20] J. Jones, M.J. Harrold, J. Stasko, Visualization of test information to assist fault localization, in: Proceedings of the 24th International Conference on Software Engineering, Orlando, Florida, USA, May 2002, pp. 467–477. [21] D.-K. Kang, D. Fuller, V. Honavar, Learning classifiers for misuse and anomaly detection using a bag of system calls representation, in: Proceedings of 6th IEEE Systems Man and Cybernetics Information Assurance Workshop (IAW), 2005. [22] H. Kim, B. Karp, Autograph: toward automated, distributed worm signature detection, in: Proceedings of the 13th USENIX Security Symposium, CA, 2004, pp. 271–286.

W. Masri et al. / Information and Software Technology 56 (2014) 238–251 [23] C. Kruegel, D. Mutz, F. Valeur, G. Vigna, On the detection of anomalous system call arguments, in: Proceedings of the 8th European Symposium on Research in Computer Security, Gjovik, Norway, October 2003, pp. 326–343. [24] W. Lee, S.J. Stolfo, Data mining approaches for intrusion detection, in: Proceedings of the 7th USENIX Security Symposium, San Antonio, TX, January 1998. [25] W. Lee, S.J. Stolfo, K.W. Mok, A data mining framework for building intrusion detection models, in: Proceedings of the 1999 IEEE Symposium on Security and Privacy, IEEE Computer Society, Los Alamitos, CA, 1999, pp. 120–132. [26] David Leon, Andy Podgurski, Lee J. White, Multivariate visualization in observation-based testing, in: Proceedings of the 2000 International Conference on Software Engineering, pages 116–125, New York, NY, USA, 2000, ACM. [27] Z. Li, M. Harman, R.M. Hierons, Search algorithms for regression test case prioritization, IEEE Transactions on Software Engineering 33 (4) (2007) 225– 237. [28] G. Liepins, H.S. Vaccaro, Anomaly detection: purpose and framework, 12th National Computer Security Conference, Baltimore, 1989, pp. 495–504. [29] R. Lippmann, J.W. Haines, D.J. Fried, J. Korba, K. Das, Analysis and results of the 1999 DARPA off-line intrusion detection evaluation, in: Proceedings of Recent Advances in Intrusion Detection, LNCS, Springer, Toulouse, France, 2000, 162– 18. [30] D. Lorenzoli, L. Mariani, M. Pezze, Towards self-protecting enterprise applications, in: Proceedings of the 18th IEEE International Symposium on Software Reliability (ISSRE ’07), Trollhättan, Sweden, November 2007. [31] H. Mannila, H. Toivonen, A.I. Verkamo, Discovering frequent episodes in sequences, in: Proceedings of the 1st International Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada, August 1995. [32] M. Martin, B. Livshits, M. Lam, Finding application errors and security flaws using PQL: a Program Query Language, 20th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, CA, 2005, pp. 365–383. [33] Steve McConnell, Code Complete, second ed., Microsoft Press, 2004. [34] W. Masri, Fault localization based on information flow coverage, Software Testing, Verification and Reliability. 20 (2) (2010) 121–147. [35] W. Masri, Exploiting the empirical characteristics of program dependences for improved forward computation of dynamic slice, Empirical Software Engineering (ESE) 13 (2008) 369–399. [36] W. Masri, R. Abou-Assi, Cleansing test suites from coincidental correctness to enhance fault-localization, in: Third International Conference on Software Testing, Verification and Validation, ICST 2010, Paris, April, 2010. [37] W. Masri, R. Abou-Assi, M. El-Ghali, N. Fatairi, An empirical study of the factors that reduce the effectiveness of coverage-based fault localization, International Workshop on Defects in Large Software Systems, DEFECTS, Chicago, IL, 2009. [38] W. Masri, R. Abou Assi, F. Zaraket, N. Fatairi, Enhancing Fault Localization via Multivariate Visualization, Regression/ICST 2012, Montreal, Canada, April 2012. [39] W. Masri, M. El-Ghali, Test case filtering and prioritization based on coverage of combinations of program elements, in: Seventh International Workshop on Dynamic Analysis, WODA, Chicago, IL, 2009. [40] W. Masri, H. Halabi, An algorithm for capturing variables dependences in test suites, Journal of Systems and Software (JSS) 84 (7) (2011) 1171–1190. [41] W. Masri, A. Podgurski, Application-based anomaly intrusion detection with dynamic information flow analysis, Computers & Security 27 (2008) 176–187. [42] W. Masri, A. Podgurski, Algorithms and tool support for dynamic information flow analysis, Information and Software Technology 51 (February) (2009) 385–404.

251

[43] W. Masri, A. Podgurski, An empirical study of the relationship between information flow and program dependence, Fourth International Workshop on Dynamic Analysis (WODA 2006), Shanghai, China, May 2006. [44] W. Masri, A. Podgurski, Measuring the Strength of Information Flows in Programs, ACM Transactions on Software Engineering and Methodology (TOSEM) 19 (2) (2009). Article 5, October. [45] W. Masri, A. Podgurski, D. Leon, An empirical study of test case filtering techniques based on exercising information flows, IEEE Transactions on Software Engineering 33 (7) (2007) 454. [46] Frederic P. Miller, Agnes F. Vandome, John McBrewster, Abandonware: Computer Software, Copyright, Office Suite, Public Domain, List of Commercial Video Games Released as Freeware, Alpha Press, Orphan Works, 2009. [47] Darren Mutz, Fredrik Valeur, Giovanni Vigna, Christopher Kruegel, Anomalous system call detection, ACM Transactions on Information and System Security 9 (1) (2006) 61–93. [48] James Newsome, David Brumley, Dawn Song, Vulnerability-specific execution filtering for exploit prevention on commodity software, in: Proceedings of the 13th Symposium on Network and Distributed System Security (NDSS ‘06), 2006. [49] James Newsome, Dawn Song, Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software, in: Proceedings of the 12th Annual Network and Distributed System Security Symposium (NDSS ‘05), 2005. [50] Amjad Nusayr, Jonathan Cook, Using AOP for detailed runtime monitoring instrumentation, WODA (2009) 8–14. [51] C. Parnin, A. Orso, Are automated debugging techniques actually helping programmers? ISSTA 2011, pp. 199–209. [52] L. Portnoy, E. Eskin, S. Stolfo, Intrusion detection with unlabeled data using clustering, in: ACM CSS Workshop on Data Mining Applied to Security (DMSA), 2001. [53] A. Sabelfeld, A.C. Myers, Language-based information-flow security, IEEE Journal on Selected Areas in Communications 21 (1) (2003). [54] K. Sen, D. Marinov, G. Agha, Cute: a concolic unit testing engine for C, in: 10th European, Software Engineering Conference, 2005. [55] Forrest Shull, Vic Basili, Barry Boehm, A. Winsor Brown, Patricia Costa, Mikael Lindvall, Dan Port, Ioana Rus, Roseanne Tesoriero, Marvin Zelkowitz, What we have learned about fighting defects, in: Proceedings of the 8th International Symposium on Software Metrics (METRICS ‘02), IEEE Computer Society, Washington, DC, USA, 2002. [56] A. Singh, A.K. Gupta, A hybrid heuristic for the maximum clique problem, Journal of Heuristics 12 (1–2) (2006) 5–22. [57] S. Steven, P. Chandra, B. Fleck, A. Podgurski, jRapture: a capture/replay tool for observation-based testing, 2000 International Symposium on Software Testing and Analysis, Portland, Oregon, August 2000, pp.158–167. [58] D. Wagner, D. Dean, Intrusion detection via static analysis, in: Proceedings of the 2001 IEEE Symposium on Security and Privacy, Oakland, CA, USA, May 2001, pp. 156–168. [59] D. Wagner, P. Soto, Mimicry attacks on host-based intrusion detection systems, in: Proceedings of the 9th ACM Conference on Computer and Communications Security, Washington, DC, USA, November 2002, pp. 255– 264. [60] H. Xu, W. Du, S. Chapin, Context sensitive anomaly monitoring of process control flow to detect mimicry attacks and impossible paths, in: Proceedings of the 7th International Symposium on Recent Advances in Intrusion Detection (RAID ’04), Sophia Antipolis, France, September 2004, pp. 21–38. [61] H. Yin, D. Song, M. Egele, C. Kruegel, E. Kirda, Panorama: capturing systemwide information flow for malware detection and analysis, The 14th ACM conference on Computer and Communications Security (CCS ’07), VA, November 2007, pp. 116–127.