Sunday, 10 January 2010

The Need for Standards to evaluate Static Analysis tools

In Jan 2010, on the security static analysis space (also called SAST for Static Application Security Testing (you can download the Gartner's Magic Quadrant report from Fortify's website)) there are a number of security focused commercial products (and services) for analyzing an application's source code (or binaries):  Fortify SCA, IBM with Source Edition (was OunceLabs) and Developer Edition, Armorize CodeSecure, CodeScan, Vericode Security ReviewMicrosoft's CAT.NET, Coverity Static Analysis, Klocwork TruePath, Parasoft Application Security Solution and Art-of-Defence HyperSource (I didn't include any Open Source tool because I am not aware of any (actively used) that is able to perform security focused taint-flow analysis)

The problem is that we don't have any standards, methodology or test-cases for objectively and pragmatically, evaluate, compare and rate these different tools!

That creates a big problem for buyers, because they are not able to make knowledgeable decisions about which is the best tool for the target applications.

For example, one of the most fundamental issues that we have when looking at the results from this type of tool, is the lack of visibility into what they have (or not) done. Namely, we need to know what the tools know, and what the tools don't know. That's the only way we can actually be assured that the tool(s) actually worked and the results we have (or don't have) are actually meaningful.

Key Concept: When we review a tool's report, it's as important to know its blind spots as it is important to know what it found. If you have a scan report of an application with NO (i.e. zero) High or Critical issues, is it because there were NO vulnerabilities, or because the scanner had NO VISIBILITY of what's going on in the target application? (very common if that application used a framework like Struts or Spring MVC).

In order to know/predict how effective one of these tool can be, we need standard ways to list its capabilities and to compare/map them to what the target application(s) actually contains (namely the languages and frameworks used).

As an example, if a tool has problems following interfaces (i.e it doesn't follows the calls through interface implementations), even before we scan the code, we should be able to say, "Well...  XYZ tool is going to have a problem with this application"

This type of visibility will not only allow the buyers to make much more informed decisions,  but will also allow them to effectively use these tools in their organization (and get higher ROI).

Ultimately, I actually thinking that in the short-term solution (until the industry and technology matures)  is that most large companies will have to buy multiple tools and services! The reason is simple. Given the variety of technologies and programming practices that they have internally, only using multiple tools will they be able to get the coverage and quality of results that they need (note that these tools would have to be driven by knowledgeable security teams or 'security savvy developers')

Note 1: WASC has done a good job with WASSEC (Web Application Security Scanner Evaluation Criteria). The problem is that as far as I am aware, there has been no public & peer-reviewed test-cases and ratings (which means that anybody wanting to use WASSEC will have to pay somebody (internally or externally) to perform the tool comparison)

Note 2: NIST is also trying to map the tools performance via its SATE efforts, but my understanding is that the vendor participation is not as good as expected and the results are not fully published (they do seem to have good test cases which should be included in a 'static analysis test-cases')


Andrew Petukhov said...

Hi, Dinis.

Indeed, this is an interesting and complicated matter. Let me share some thoughts on the topic.
The ultimate goal of evaluation would be a vector of features of a SA tool. Each component of the vector measures performance of certain feature during the evaluation.
Having such an ideal goal in mind let's consider one major obstacle to reaching it.
Some tasks in SA are operator-driven. More formally, Feature Performance = Function(web app under analysis, operator skills). Let us say, that during an evaluation of a tool CoolestTOOL by some company XYZ the performance of a feature F was measured to N. How this result can be reproduced or verified without borrowing a tester :)?
Or does it mean that will have to withdraw such a requirement as "produces repeatable results" from our evaluation methodology?

One possible solution for the time being is to start measuring 'objective' features of SA tools, feature that do not require any customization.
What can be done is as follows:
- make an assumption that the main internal task performed by SA tools during their workflow is the calculation of dependencies (and even slices);
- say that the quality (i.e. precision and completness) of SA analysis is in direct ratio to the quality of dependency analysis;
- develop test cases for SA tools that allow to evaluate the permormance of a dependency analyzer;
- prove that those test cases are correct and sound (the most difficult and interesting part herein); indeed, this means that we need to develop and justify a methodology for creating such test cases.

Here is my view of the problem. I hope, that I was not boring :)

romain said...

Hey Diniz, there is much to say about your blog post.

1. You unfortunately list few types of SAST. Many of tools don't implement taint analysis -- if you go in the Ada/C/C++ world, you won't see much of taint based analysis, but other technologies such as symbolic execution, abstract interpretation, etc.
A list of SAST can be found on the NIST SAMATE website:

2. As said on twitter, concerning the WASSEC, I don't believe it's important to have public evaluation of commercial/open-source tools.
Also, WASSEC lists some vulnerabilities that the tool should look for, we don't provide test cases so it's not nearly possible to claim that a tool effectively test for a given problem, e.g. difference between two tools:

Tool A- only test XSS with few payloads and does regexp matching of the rendered html
Tool B- a smarter engine that automagically crafts attacks and look at the resulting html with a JS engine (or so, that leads to fewer FP).

Depending on who you are and what you want, you might very well say that those two tools have the same support for XSS...

Moreover, tools are changing so quickly that an evaluation would only be accurate at the time you make it.

3. NIST SATE is literally an exposition. NIST choose test cases (real open-source program that covers different type of functionalities and technologies) and ask tool makers to run their SAST on those programs. The goal isn't to compare the tool to claim that one is better than the other for a type of techno, but it's too see how tools (in general) performs, to see how many types of weaknesses the tools find and also what is the overlap of tool findings (which resulted in a very little amount of findings).

More generally, as Andrew said, a SAST isn't only an analysis engine that finds weaknesses in a program; it's a suite of functionalities:
- support technologies
- allows users to develop custom checks (or custom rules)
- displays the weaknesses to the user (allow to rank/prune and explain problem) and reporting capabilities
Ultimately, every one of those elements are important and need to be tested, but again, the importance of those depend on who you are and how you want to use the SAST (from simple compliance type of scan to exhaustive security testing).

Just to tell you, NIST SAMATE (organizers of SATE) have been thinking a lot of those problem and there is no easy solution for evaluating SAST... But the last SATE report explains some of the problems we (I was in SAMATE team at the time) faced:

diniscruz said...

Hi, Andrew and Romain, I just replied to your comments on

dre said...

The commercial SAST vendor, Checkmarx, also just donated to the OWASP Foundation the use of their free SaaS offering (they also have a standalone product).

I'm also curious as to what you guys think of AppCodeScan, Graudit, OWASP Code Crawler, CAT.NET 2.0 Beta, etc. These are free or open-source secure code review tools, correct? Perhaps only CAT.NET does taint tracking...