Why Your SCA is Always Wrong
A breakdown of why your SCA results are always so full of false positives (and sometimes false negatives), and why treating source code as a first class citizen can lead us to the solution.
A breakdown of why your SCA results are always so full of false positives (and sometimes false negatives), and why treating source code as a first class citizen can lead us to the solution.
A breakdown of why your SCA results are always so full of false positives (and sometimes false negatives), and why treating source code as a first class citizen can lead us to the solution.
A breakdown of why your SCA results are always so full of false positives (and sometimes false negatives), and why treating source code as a first class citizen can lead us to the solution.
A breakdown of why your SCA results are always so full of false positives (and sometimes false negatives), and why treating source code as a first class citizen can lead us to the solution.
Why are all SCA tools wrong?
“Like ships in the night
You keep passing me by
Just wasting time
Trying to prove who’s right
And if it all goes crashing into the sea
If it's just you and me
Trying to find the light”
Lyrics by M. Kearny”
Open source software (OSS) makes up the majority of code in modern applications and a top way vulnerabilities are brought into application code. Software composition analysis (SCA) tools are commonly used to detect these vulnerabilities. But depending on the languages used by your organization, those SCA results may be incredibly inaccurate. In particular, Python and JavaScript pose a significant problem for traditional SCA tools. To build effective programs and evaluate security tools, AppSec professionals need to understand the basics of dependency management and how SCA tools work (or don't).
Package Managers and Compilers - Ships in the Night
Modern software systems are constructed using a blend of proprietary code (created by the primary developer) and third-party components, many of which stem from open-source packages. Source composition analysis involves scrutinizing a software program to pinpoint any third-party elements it depends on. Recognizing the software's composition is essential for both security and maintenance reasons. For instance, it's crucial to ascertain whether a third-party component contains security vulnerabilities or bugs that could compromise the overall system's quality.
In most programming languages, constructing a software system and managing third-party dependencies involve a synergy between two essential tools:
Package Manager
This tool is responsible for organizing and consolidating specific versions of third-party components. Beginning with a list provided by the developer that outlines the needed third-party components, the package manager then downloads these specified components, including any of their dependent packages. Some widely-used package managers are Maven for Java, npm for JavaScript, and PyPI for Python.
Compiler or Run-time Environment
This tool takes the code supplied by the developer and the packages retrieved by the package manager to either compile or execute the final system.
Together, these tools ensure that the software system is built with all the necessary components and can run efficiently.
In many programming languages, especially those conceived before the surge in open-source software's popularity, the two tools—package manager and compiler/runtime—operate in isolation, much like "ships passing in the night." This implies that the compiler remains oblivious to the packages downloaded by the package manager, while the package manager is unaware of the actual necessities of the compiler. This separation results in dependency definitions primarily being human-driven, making the process susceptible to errors. Notably, Go stands out as an exception to this norm. In Go, package management and the compiler are seamlessly integrated, ensuring cohesive operation.
To better understand the challenges surrounding dependency management in software development, let's delve into specific problems using Python as a primary example, while also drawing parallels to other languages.
Phantom Dependencies
In Python, importing third-party libraries is straightforward. A simple import statement, like from foo import bar, is used. The Python runtime will then search predetermined paths to locate a package named "foo". If located, the required code is imported. The challenge arises when determining how "foo" was downloaded and situated within the Python paths.
Often, package managers like PyPI utilize a requirements.txt file, declaring "foo" as a necessity. The manager fetches the relevant package from pre-set public or private repositories. Yet, this isn't the only method. A developer might manually write a script to download "foo" or even manually copy its files. If the code imports a dependency not defined in the package manager's requirements, we call it a "phantom dependency."
Misused Dependency Scopes
Many package managers permit users to specify if dependencies are for runtime or just development. Such categorizations are termed dependency "scopes." Both runtime and dev dependencies are retrieved during installation, a practice prevalent in both Python and JavaScript. But a pitfall exists. After dependencies are placed in the requisite directories, nothing prohibits a developer from using a "dev" or "test" scoped dependency in the main codebase. The package manager's definitions remain invisible to the compiler or runtime.
Direct Use of Transitive Dependencies:
Let's assume a package manager is in use. When a developer indicates a dependency on a package "foo," the manager might fetch dozens, or even hundreds, of other dependencies needed by "foo." While these are essential for the main program's operation, they aren't its "direct" dependencies and shouldn't be employed by it. However, once fetched, the primary program can freely directly use any of these transitive dependencies, bypassing the intended constraints.
Unused Dependencies
Echoing the issue of phantom dependencies, there's often a disconnect between what's declared in package management files and what's actively employed within the code. Consequently, a package manager might introduce dependencies that the code never actually utilizes. Often dependencies are denoted as optional in the package management definitions and they depend on target platforms or other parameters. Relying only on package manager information leads to a large amount of 3rd party dependencies that are never used.
Why Are All SCA Tools Wrong?
The examples provided underscore a significant challenge in software development: relying solely on package management definitions can lead to a skewed understanding of a program's true dependencies. As we will see now, SCAs fail to capture an accurate list of dependencies, exactly because they only look at the package manager and not the actual source code.
Most SCA tools employ a relatively straightforward process to analyze dependencies:
- File System Scan: The tools scan the file system to identify package management definition files.
- File Parsing: These files are read and parsed to extract dependency information.
- Assumption of Dependency Accuracy: SCA tools generally assume that the dependencies declared in the package management files are an accurate representation of those employed within the software.
- Dependency Graph Construction: Based on the parsed data, SCA tools create a dependency graph. This visually represents how various dependencies interrelate and the nature of their inclusion in the software.
- SBOM Generation and Vulnerability Analysis: By analyzing the dependency graph, SCA tools generate Software Bill of Materials (SBOMs). This comprehensive list details every piece of software within the application. Simultaneously, vulnerability analysis is performed, assessing potential risks within these dependencies.
By relying strictly on the information provided by the package managers they miss all of these situations:
- Phantom Dependencies: These lead to false negatives in the tools' outputs. If the SCA tools don’t recognize or account for a dependency because it's not explicitly listed in the package manager's definition files, but the program utilizes it, there's a risk. It means vulnerabilities in this 'invisible' dependency might go undetected.
- Misused Dependency Scopes: By prioritizing or de-prioritizing dependencies based on their declared scopes (e.g., "test" or "development"), SCA tools might overlook dependencies that are in fact vital in the runtime environment. This might lead to the underestimation of vulnerabilities in what is perceived as "lower priority" components, even if they're actively employed within the software.
- Transitive Dependency Assumptions: When the tools make presumptions about the nature and use of transitive dependencies, there's potential for misguidance. They might assign incorrect severity levels to vulnerabilities or suggest inappropriate remediation methods for issues stemming from these indirect dependencies.
- Unused Dependencies: On the other end of the spectrum, by assuming all declared dependencies are active, SCA tools might generate false positives. They might raise alarms about vulnerabilities in software components that, while present on the file system, aren't utilized in the software's operation. This can lead to unnecessary mitigation efforts.
SCA Through Program Analysis
Given these considerations, an ideal Software Composition Analysis (SCA) tool would:
Source Code as a Ground Truth: As the primary "source of truth", the source code offers the clearest insight into which dependencies are actually called upon and used.
Correlate with Package Manager Data: After establishing the dependencies from the source code, the tool should cross-reference this data with package manager information. This step is crucial to identify any discrepancies, such as phantom or unused dependencies.
Correlate with The File System: By comparing the set of dependencies declared by package management manifests with those used in the code and those available in the file system, one can get a complete picture of the actual dependencies used.
Highlight Discrepancies: Any variations between the actual code and package manager definitions should be clearly marked, alerting developers to potential issues like missed vulnerabilities or unnecessary packages.
Conclusion
The increasing reliance on third-party components in software development necessitates a profound understanding of a software's composition, both for ensuring security and proper maintenance. While package managers and compilers are critical tools in this process, their historical isolation from each other in many programming languages can result in various challenges, including phantom dependencies, misused dependency scopes, and over-reliance on transitive or unused dependencies. Traditional Software Composition Analysis (SCA) tools, which predominantly lean on package management definitions, often miss these nuances, leading to potential security vulnerabilities or unnecessary mitigation efforts. For effective and accurate software composition analysis, tools must start with the source code as the primary point of reference, cross-referencing it with package manager data to identify discrepancies. By doing so, the software development community can ensure a more accurate, secure, and efficient software composition, affirming that SCA grounded in program analysis is the right approach.