The architecture of SAST tools: An explainer for developers

More developers will have to fix security issues in the age of shifting left. Here, we break down how SAST tools can help them find and address vulnerabilities.

A schematic diagram depicting the steps an SAST tool takes to scan the source code of an SQL application under an SQL injection attack. The first step is tokenizing the source code, the second is abstracting the source code, the third conducting semantic analysis, the fourth conducting taint analysis, and the last generating a security alert about the SQL injection vulnerability.
|
| 13 minutes

In today’s age of shifting left—an approach to coding that integrates security checks earlier into the software development lifecycle (SDLC)—developers are expected to be proficient at using security tools. This additional responsibility can be overwhelming for developers who don’t specialize in security. The main issue: on top of their normal responsibilities, developers have to sift through many false positive alerts to find and address the real, critical vulnerabilities.

But shifting left isn’t going anywhere. Its benefits have been proven. So, what can developers do to improve their security experience? They can start by understanding how different security tooling works, the latest advancements, and why they matter. By understanding the inner workings of a static application security testing (SAST) tool, developers can better interpret its results and fix vulnerable code, feel empowered to contribute to security discussions and decisions, and improve their relationship with security teams.

In this post, we’ll cover what our security experts, Sylwia Budzynska, Keith Hoodlet, and Nick Liffen, have written about SAST tools—from what they are and how they work—and break down why they’re important to developers who are coding in the age of security-first development.

Why developers and security experts use SAST tools

GitHub Security Researcher Sylwia Budzynska wrote a post about common uses of SAST tools. Here’s a quick recap.

Developers and security experts rely on SAST tools to:

  • Automate source code scanning to prevent vulnerabilities and catch them earlier in the development pipeline.
  • Expand vulnerability detection. Through a technique called variant analysis, SAST tools can find new vulnerabilities by detecting variants of a known vulnerability in different parts of the code base.

  • Assist with manual code reviews. In CodeQL, GitHub’s SAST tool, your code is treated and analyzed as data. This allows you to execute queries against the database to retrieve the data you want from your code, like patterns that highlight potential vulnerabilities. You can run standard CodeQL queries written by GitHub researchers and community contributors, or write your own to conduct a custom analysis.

Screenshot of a security alert on GitHub.com that describes a new CVE, CVE-2023-35947, found with AI-generated models and variant analysis.

GitHub’s security team used a combination of AI-generated models and variant analysis to discover a new vulnerability. Here’s how.

For comprehensive coverage, organizations often use SAST in tandem with other security testing, including:

  • Software composition analysis (SCA), which identifies vulnerabilities in a codebase that stem from a third-party dependency. An SCA tool like Dependabot scans the origins of the third-party code for security threats and licensing requirements. With this information, you can update your codebase by addressing vulnerabilities, attributing credit, or complying with the open source license accordingly. SCA tools can be used at any point in the SDLC. \
  • Dynamic application security testing (DAST) finds security vulnerabilities in an application once it’s running.
  • Interactive application security testing (IAST) combines SAST and DAST to identify vulnerabilities that might be missed by either one alone.

The pros and cons of SAST

Let’s start with the pros:

  • Modern SAST tools can be used early in the SDLC. Ideally, this means that builds and production shouldn’t be disrupted by vulnerable code.
  • Given that organizations typically have one security expert for every 100 developers, vulnerabilities are bound to be overlooked when code reviews are only done manually. SAST tools are designed and known for their ability to analyze the entire codebase, which can augment manual code reviews. This helps teams detect challenging vulnerabilities and improves the speed at which they ship secure and quality code.
  • SAST tools are valued for their ability to trace data flows. This allows them to identify where in the source code sensitive data might leak, check that all data inputs are validated and sanitized, and verify that security protocols are followed when data is stored or transferred.
  • While most SAST tools only trace partial data flows, CodeQL uses a database to represent your source code, so it has a full understanding of how data flows throughout your whole application.
  • Advanced SAST tools can integrate directly in your build process, or CI/CD pipeline, while having access to your codebase. That means your code is automatically scanned for vulnerabilities with every push or build.
  • SAST tools like CodeQL make it easy for developers to address vulnerabilities by providing security alerts that state the exact line of code that triggered the alert, the nature of the alert, properties of the alert (like its severity level), and how to fix the problem.
  • Coupling SAST tooling with generative AI speeds up this process by suggesting an AI-generated code fix to developers. For instance, developers who use Github Advanced Security’s code scanning autofix feature can apply AI-generated code fixes to easily remediate vulnerabilities directly in a pull request.

Discover five easy ways to make developers love security >

Now, some cons. Well, mainly the big one: false positives.

Problems and consequences Solutions
Your SAST tool might match vulnerable patterns in a database to patterns found in comments throughout the source code and in harmless function names. Adding a lexical analysis function–which transforms code into tokens and ignores characters that aren’t related to the semantics of code—filters out pattern matches unrelated to the source code (like patterns found in your code comments).
A legacy SAST tool might not be able to differentiate between input data that comes from a user (and therefore exploitable) and input data that comes from a local source (and therefore benign).

Your SAST tool might not detect when input data has been sanitized or validated as it moves throughout your source code (making the data safe).

Abstracting your code into a hierarchical structure provides a better understanding of where input data enters and is used throughout your code. As a result, the SAST tool will better determine when input data is actually exploitable and raises fewer false positives.
With so many false positives, developers and security experts may lose confidence in the tool’s data and get alert fatigue, which can cause them to skim past critical alerts. A SAST tool with an alert system that can be set with custom and automated triage rules ensures that the most urgent security alerts are addressed first. Engineering teams should also be able to filter and search alerts to sift through all the results and focus on a particular type of alert.

For further reading on false positives, read:

How do SAST tools work?

SQL injections—malicious SQL code that allows users to gain access to sensitive data—are a common vulnerability that is easier to find with SAST than with other testing methods. This is because SAST tools trace data flows, and SQL injections target applications that store and retrieve data in SQL databases.

A schematic diagram depicting an SQL application under an SQL injection attack. The attack vector is shown at the point of data entry by a user. The diagram then depicts the application processing the data with an SQL database and generating an output.
Click diagram to enlarge and save.

Methods used by SAST tools to find vulnerabilities include:

  • Signature-based pattern matching matches patterns of known injection techniques to patterns found in your source code.
  • Semantic analysis not only matches patterns, but also considers how your code was constructed—the surrounding context, logic, and dependencies between different code parts—when scanning for vulnerabilities.
  • Taint analysis tracks the flow of input data throughout your source code to see if it ends up in a function that could be exploited with malicious user input.

🕵🏻‍♀️ Let’s focus on semantic and taint analysis, which provide more flexibility, precision, and broader coverage than signature-based matching (or other legacy methods of static analysis).

We’ll break down how an advanced SAST tool like CodeQL uses semantic and taint analysis to trace a full data flow and find vulnerabilities in your code. 👇

A schematic diagram depicting the steps an SAST tool takes to scan the source code of an SQL application under an SQL injection attack. The first step is tokenizing the source code, the second is abstracting the source code, the third conducting semantic analysis, the fourth conducting taint analysis, and the last generating a security alert about the SQL injection vulnerability.
Click diagram to enlarge and save.

Here’s code that contains a user-controlled parameter, or a parameter where the user submits data:

@postmapping("/sqlinjection/attack11")
@responsebody
public Attackresult completed(@requestparam string name, @requestparam String auth_tan) {
return injectablequeryintegrity(name, auth_tan);
}

1. Tokenize the source code

SAST tools use lexical analysis to transform code into tokens. Because tokens are categorized according to the grammar rules of a programming language, your code becomes a list of standardized parts that makes it easier to analyze. Tokenizing source code allows the SAST tool to conduct a semantic analysis that ignores characters unrelated to the semantics of your code.

2. Abstract the source code

To help decipher the meaning of source code and its structure, most SAST tools visualize your source code as a tree. An abstract syntax tree (AST) transforms lines of your code into a hierarchical structure to show relationships between code, which code belongs with which function, and more.

Here’s what an AST looks like and here’s how to view the AST of your source code.

3. Conduct semantic analysis

Aided by an abstraction of the source code, this analysis allows the SAST tool to understand the code’s meaning and structure. As we mentioned above, a semantic analysis enables CodeQL to ignore tokens that aren’t related to the semantics of your source code. As a result, the SAST tool scans your source code (and not your code comments) for vulnerabilities.

4. Conduct a taint analysis

Earlier, we noted that SQL injections enter your source code through unsanitized or invalidated user input data. But SAST tools look for vulnerabilities in the way your source code handles data, not the data itself. In other words, SAST tools scan the source code written by a developer, not input data entered by a user. This is where taint analysis comes in.

A SAST tool uses taint analysis to do three things:

  • Identify sources, sanitizers, and sinks. Sources are where the input data enters the source code, sanitizers are functions that make input safe, and sinks are functions that if called with unsanitized data could cause harm.
  • Track the flow of input data from a source to a sink. The taint analysis uses abstractions from your source code to follow input data from a source to see if it ends up in a sink or a function that could be exploited with that user input data.
  • Check if input data—whether it’s from the user or a local, benign source— passes through sanitization or validation functions as the input data flows from source to sink.

Advanced SAST tools like CodeQL can evaluate how well these functions actually sanitize or validate the data, and use that judgment to decide whether or not to raise the path as a potential vulnerability. If any input data doesn’t pass through these sanitizing functions, the tool will flag the path as a potential vulnerability.

A schematic diagram shows the four steps of a taint analysis. Two user-controlled parameters, `name` and `auth_tan` pass through a function call, which creates a query statement. The query statement is executed with warnings that it contains user-controlled parameters.
Click diagram to enlarge and save.

Above is an example of how CodeQL traces data flow. The two statements in the last step, Query might include code from this user input is evidence of the CodeQL data flow at work.

The SAST tools recognizes that user-provided data, name and auth_tan, are directly embedded into the SQL query, SELECT * FROM employees WHERE last_name = ‘“ + name + “‘ AND auth_tan = ‘“ + auth_tan + “‘“.

Executing this query might generate a security alert if the user input hasn’t passed through sufficient sanitizers or any sanitizers at all.

📝 Two important notes:

  • Finding a sink is not the same as finding a vulnerability. A sink or a sensitive function on its own isn’t a vulnerability, and many sinks can be used safely. Locating a sink means the tool has found a _potentially _vulnerable source of code in that it’s a function that _could _be exploited if unsanitized input data calls that function.
  • For a vulnerability to exist, the tool must find that unsanitized or invalidated input data can flow from source to sink without going through a sanitizer. To determine if a sink is a vulnerability, the SAST tool must use taint analysis to trace the possible paths that the unsanitized or invalidated input data can take to sensitive functions. (That’s where abstractions of your source code come in handy.) If a SAST tool finds such a data flow, it’ll send an alert about a potential security vulnerability to engineering teams.

🔒 At the end of these four processes, engineering teams will receive security alerts from the SAST tool. To ensure teams address the most urgent security alerts first, it’s important to set custom and automated triage rules and have the ability to filter and search alerts to focus on a particular type.

5. Run custom queries

Another way that some SAST tools can find vulnerabilities is when developers and security experts write queries to search for certain vulnerabilities. CodeQL, in particular, is known for its flexibility in that it allows developers to write custom queries that meet the needs of their codebase.

You can run standard CodeQL queries or write your own to conduct a custom analysis.

Learn how to practice writing your own CodeQL query with these resources:

Automating SAST

Instead of manually pushing a code scan, developers can integrate a modern SAST tool into their current CI/CD pipeline, automating vulnerability code scans with every push or build.

A SAST tool that’s integrated into your build process while having access to your codebase means the tool can better understand the semantic elements of your code and conduct a more comprehensive taint analysis, according to Keith Hoodlet, principal security specialist at GitHub.

Why we need SAST

When you create or work with an application, you need to make sure that it handles data securely. For instance, an educational project designed to be used by students might be subject to the Children’s Online Privacy Protection Act (COPPA), which requires websites that gather data from children under 13 report any breaches to parents.

Wordplay, an educational programming language and web-based IDE is designed to be used by students, which means most project contributors are aspiring developers who are just starting to learn best practices of secure code writing. Consequently, the project needs to protect against data breaches that could expose the projects and identity of those students.

“When students submit pull requests that touch any private data, they need to know that they aren’t shipping common vulnerability patterns that might leak it,” says Amy Ko, founder.

But limited bandwidth makes it difficult to review every single line of submitted code. That’s why the Wordplay team relies on CodeQL to integrate vulnerability checks into its CI/CD pipeline.

“CodeQL is like having a community of experienced security developers regularly code reviewing our work,” Ko adds. “The key reason we use it is to expand our team’s expertise and capacity.”

In addition to using CodeQL to prevent data breaches, the Wordplay team has ideas about how it might use the tool to find patterns that could lead to accessibility issues, like lack of feedback in response to keyboard inputs. For example, CodeQL could be used to identify input sequences that don’t provide feedback.

How SAST helps developers in a new age of coding

Modern SAST tools help developers adapt to this new age of coding. As they take on additional security responsibilities in the shift left movement, developers can rely on SAST tools to trace data flows and locations of exploitable vulnerabilities throughout their projects. What’s more: as developers write more code with the help of AI coding assistants, they can feel more confident that SAST is analyzing their entire source code for vulnerabilities.

Getting to know the workings of a SAST tool, which is one of the most widely used security tools, will give developers a better understanding of its results and security alerts, and that can empower them to actively participate in security discussions and decisions—ultimately benefiting engineering and security teams alike, and organizations overall.

Harness the power of CodeQL. Learn more or get started now.

Written by

Related posts