Sunday, 24 June 2012

In SAST the issue is 'Trace Connection', not 'Scan Size'

One of the 'wrong problem to be solving' paradox that happens in the SAST world is the focus on making their engines able to 'scan large code bases'. It is not a coincidence that the key question I got from SAST engine guys on the Real-time Vulnerability Creation Feedback inside VisualStudio (with Greens and Reds) was 'Humm.... interresting but will is scale to large applications?' 

I actually blame the SAST clients for this, since they are the ones asking (and paying for) the wrong question:

 "How can you 'vendor xyz' scan my million lines of code application"

Instead they should be asking:

 "When you scan my code, can you connect the traces?"

'Connecting the traces' means that you are able to scan parts of the application separately and then connect them at a later stage.


And this is not just needed for scalability purposes, large code-bases will have tons of air-gaps created by interface-driven/WebServices/Message-Queues architectures, with usually the 'formula' that connects these layers existing in: Xml/Config files, Code Attributes, Live Binding, Reflection mappings, etc...

So without understanding these mappings, scanning a large code base means that there will be massive gaps in scanning that code.

In fact, one of the ways the commercial scanning engines are able to scan big code-bases is by starting to be more 'aggressive' in how they handle traces and what type of analysis they do (usually refereed by  'dropping traces in the floor'). And in the cases where the scanning engines do find large sets of findings or traces (lets say 10,000 findings with 50+ entries), their GUIs are absolutely not able to handle them (try to load 100,000 or 1M traces in those GUI :) ). Ironically, if they are not able to find that amount of traces, they are probably not having enough code coverage :). See If you not blowing up the database, you're not testing the whole app for a similar DAST analogy.

The need to scan in a modular way is very important from a scalability and from an accuracy point of view.

For example the approach that I took when creating O2's SAST engine was to create files that contain all relevant code, and then only scan then, see O2 .NET SAST Engine: MethodStream and CodeStrams for a WebService Method for what this looks like.

To see a practical example of what I mean by 'Trace Connection' or 'Trace Joining', look at this screenshot taken from this video: O2 Video - Demo Script - HacmeBank Full PoC


In this trace you will see 2 very important 'trace connections':

  • Url to Entry point - the URL of the vulnerability was mapped to the method that contains the Source of Tainted data
  • WebServices invocation - A webservice call that was mapped from the Invoke (on the Web Tier) to the WebMethod (on the WebServices layer).
This is what we need to be doing, since all real-world applications have 'air gaps' that need to be connected. To scale to large code bases, we analyse each module separately, create sources/sinks rules for them (i..e inputs and outputs) and then connect/join those traces (where a Source of a module is a Sink to its users).

Unfortunately (for the current SAST vendors who are still trying to create the 'one click scan engine'), this means that we will need to be able to customise and adapt the scanning engine/rules. We also will need to create specialized tools/scripts per framework, which in essence will describe its behaviour. 

Note that this is not easy to do, we will need very powerful APIs that exposes the SAST engine capabilities/rules/data. For example, in the past I had to build entire O2 modules just to handle these type of activities.


Like the video below shows, this is like a move from Billions to Trillions (and you can't build a bridge to it). I also like the concept that 'Nature uses Layered Complexity', which is exactly what we need to do on SAST (i.e. we need to rules for every layer, and scan its behaviour separately)



Trillions from MAYAnMAYA on Vimeo.

3 comments:

Dinis Cruz said...

On the topic of connecting traces, here is a cool post from the WhiteHat guys on what they are doing to connect traces: Keyed Collections and Propagation, PI.

Unknown said...

Your post is a bit confusing to me. Seems to start talking about incremental dataflow analysis (or how to be able to make your DF work when the application is partial), and "connecting the traces" which is really about understanding the frameworks (and maybe more generally having a better understanding of the programs).

Just to get back to my question, I was really interested to know if you simply hooked CAT.NET and do a new pass each time the code is changed, or if you did some incremental analysis. The later is a very interesting area, but I doubt that CAT.NET would support that by default; I was just curious.
Thing is your concept of real-time defect is interesting, but this is very challenging if you want to go further than grepping the asts.

Not gonna touch so much the "scanning large code base" thing; I disagree with you. Applications are large, and if your tool takes a week to scan an app it totally defeats the purpose of doing static analysis...

Romain
(geez, I need to get used to be called a "sast engine guy" :)

Dinis Cruz said...

'Connecting the Traces' is key to understand how the frameworks. For example that is what Eric is doing at http://ericsheridan.blogspot.co.uk/2012/06/keyed-collections-and-propagation-pi.html , but note that that is ONLY a very simple scenario of trace connection (and in fact, as I ask in the comments, I doubt how much coverage of that setter/getter connection he is able to do)

Like I mentioned on Twitter, this version of Cat.NET doesn't support incremental analysis, BUT since I'm running it in memory (i.e. I'm not starting the process for every scan), I'm already able to optimize it's scan process/speed. Having looked at its code, I know I can create incremental analysis with it. I just need to able to publish those changes under Apache 2.0.

I have no idea where you go the idea of Grepping the ASTs, since I'm doing a LOT more than that . For example on the HacmeBank example I'm using the Fully Qualified names to connect the traces (which you don't get from the AST).

That said, AST can be very useful, for example, the type of SAST rule discussed here, can probably come mostly from an AST (connected to SAST results, http://www.reddit.com/r/websec/comments/vg466/interesting_potential_problems_with_net_ispostback

You are also disagreeing with me on the wrong thing :). I want real time compilation of LARGE code bases, and I want that in seconds (not in hours, days or weeks). That said, it is ok to have one-off cost of some time (days, weeks) to create that environment. But when exposing the developers to SAST, we need real-time (or on-save) results.

Like you mentioned, the real interesting topic here is 'Real-time Vulnerability Feedback in the IDE', and that is my benchmark.

First we need to do that for a file or smallish solution (which I have), and then we scale up in a way that doesn't degrades the user experience.

This is why I posted the Trillions video, the SAST world (me include) is trying to achieve the same result ('climb the mountain' analogy), and it just happens that I'm taking a different path/strategy :)