Software Analysis
Introduction
Gordon Fraser
Lehrstuhl für Software Engineering II
Contents
• What is Software Analysis?
• Course Organisation
• First Steps: Character Level Analysis
What is Software Analysis?
The process of extracting information about a program
From its source code or artefacts
(e.g., Java bytecode, execution traces, Git history, …)
Information is extracted using automatic tools
Example
Can you tell me what’s wrong?
Example
Can you tell me what’s wrong?
Example
Can you tell me what’s wrong?
Example
Can you tell me what’s wrong?
Example
Can you tell me what’s wrong?
Goals of Software Analysis
• Establish whether given properties hold for some software
• Structural properties must hold at design time:
- the code is indented using tabs
- class Bar has only visible attributes
- module baz contains more than 500 lines of code
• Behavioural properties must hold while the software is
running:
- method foo always terminates
- in 80% of the runs, statement xyz executes in less than 10 ms
- the program crashes with input 42
Why Software Analysis?
<Insert generic software quality motivation here>
<Insert generic programmer productivity motivation here>
Hierarchy of Analysis
• Deduction
• Observation
• Induction
• Experimentation
Deduction
• Deduction is reasoning from the general to the particular; it lies
at the core of all reasoning techniques.
• In program analysis, deduction is used for reasoning from the
program code (or other abstractions) to concrete runs for
deducing what can or cannot happen.
• Deduction does not require any knowledge about the concrete
runs; hence, the analysis is static
Deduction
1 char* format = “a = %d”;
2 if (p)
3 a = compute_value();
4 sprintf(buf, format, a);
Assume that buf contains the string “a = 0” (an error)
Observation
• Observation allows the programmer to inspect arbitrary
aspects of an individual program run.
• Since an actual run is required, the associated techniques are
called dynamic.
• Observation brings in actual facts of a program execution;
unless the observation process is awed, these facts cannot be
denied.
fl
Observation
1 char* format = “a = %d”;
2 if (p)
3 a = compute_value();
4 sprintf(buf, format, a);
Induction
• Induction is reasoning from the particular to the general.
• In program analysis, induction is used to summarize multiple
program runs to some abstraction that holds for all
considered program runs (i.e., invariants)
• A “program” may also be a piece of code, like a function or loop
body.
Induction
1 char* format = “a = %d”;
2 if (p)
3 a = compute_value();
4 sprintf(buf, format, a);
Example dynamic invariant: a < 2054567 || a % 2 == 1
Experimentation
• Many problems in program understanding can be formulated as a
search for causes:
- What is the cause for buf containing "a = 0”?
• To prove actual causality, one needs two experiments: one where
cause and effect occur, and one where neither cause nor effect
occur.
• The cause must precede the effect, and the cause must be a minimal
difference between these experiments.
• Searching for the actual cause thus requires a series of experiments,
re ning and rejecting hypotheses until the actual cause is isolated.
fi
Experimentation
1 char* format = “a = %d”;
2 if (p)
3 a = compute_value();
4 sprintf(buf, format, a);
Experimentation
1 char* format = “a = %d”;
2 if (p)
3 a = compute_value();
4 sprintf(buf, format, a);
“a = %d” but a is a oat
fl
Experimentation
1 char* format = “a = %f”;
2 if (p)
3 a = compute_value();
4 sprintf(buf, format, a);
“a = %d” but a is a oat
fl
Hierarchy of Analysis
• Deduction from an abstraction into the concrete
For instance, analysing program code to deduce what can or cannot happen in
concrete runs.
• Observation of concrete events
e.g. tracing, monitoring or pro ling a program run or using a debugger.
• Induction for summarising observations into an abstraction
an invariant, for example, or some visualisation.
• Experimentation for isolating causes of given effects
e.g. narrowing down failure-inducing circumstances by systematic tests.
fi
Software Analysis
Executable Program Execution Trace
0101011010
1010110101
0101101010
1010101010
1010111111
Entry
x <= y Exit
true
false true
x == y x>0
return False
true true
print
y == 17
true
Source Code
return True
Internal Representation
Analysis Algorithm
Software Analysis
• We’ve written a program P and we want to know...
• Does P satisfy property of interest φ?
- For example, P does not dereference a null pointer, all casts in P are safe, ...
• Manually verifying that P satis es φ may be tedious or
impractical if P is large or complex
• Would be great to write a program that can inspect the source
code of P and determine if φ holds!
fi
Analyser that decides if a variable in a
program has a constant value in any
execution
x = 17;
Analyser that decides if a variable in a
if (TM(j)) program has a constant value in any
x = 18; execution
x has a constant value if and only if the j’th Turing machine does
not halt on empty input. If the hypothetical constant-value
analyser exists, then we have a decision procedure for the
halting problem, which is known to be impossible.
Rice's Theorem
"All non-trivial semantic properties of programs are undecidable"
• Non-trivial: there exists both a program that has that property
and one that does not
• Undecidable: no automated method can determine whether
the property holds for any program
We can abstract the behaviours of the program into a
decidable approximation and attempt the analysis in the
abstract version of the program
Example Analysis
• Given a program P, determine the sign (positive, negative, or zero) of all
of its variables (sign analysis).
• Applications:
• Check for division by 0
• Optimize by storing positive variables as unsigned
integers
• Check for negative values, e.g., negative array indices.
Example Approximation
• Abstract concrete domain of integer values into abstract
domain of signs {⊖, ⊚, ⊕} ∪ {⊤,⊥}
• Step 1: De ne abstraction function abstract() over integer
literals:
• abstract(i) = ⊖ if i < 0
• abstract(i) = ⊚ if i == 0
• abstract(i) = ⊕ if i > 0
fi
Example Analysis
Concrete domain of ints Abstract domain of signs
⊕ positive ints
x=5
⊖ negative ints
x = -5
⊚ zero
x=0
⊤ all ints (unknown)
⊥ no ints (unde ned)
fi
Example Analysis
• De ne transfer functions over the abstract domain that show how to
evaluate expressions in the abstract domain:
• ⊕+⊕ =⊕
• ⊖+⊖ =⊖
• ⊚+⊚ =⊚
• ⊚+⊕ =⊕
• ⊚+⊖ =⊖
• … …
• ⊕+⊖ =⊤
• {⊕, ⊖, ⊚, ⊤} + ⊤ = ⊤
• {⊕, ⊖, ⊚, ⊤} / ⊚ = ⊥
fi
Example Analysis
Concrete domain of ints Abstract domain of signs
⊕ positive ints
x=5
⊖ negative ints
x = -5
⊚ zero
x=0
x = b ? -1 : 1
⊤ all ints (unknown)
⊥ no ints (unde ned)
fi
Example Analysis
Concrete domain of ints Abstract domain of signs
⊕ positive ints
x=5
⊖ negative ints
x = -5
⊚ zero
x=0
x = b ? -1 : 1
⊤ all ints (unknown)
x = y/0 ⊥ no ints (unde ned)
fi
Example Analysis
⊤ Z
{i | i<0} ⊖ ⊚ {0} ⊕ {i | i>0}
⊥ {}
Example Analysis
⊤ Z
⊆ ⊆ ⊆
{i | i<0} ⊖ ⊚ {0} ⊕ {i | i>0}
⊆
⊆ ⊆
⊥ {}
Example Approximation
Now, every integer value and expression has been abstracted
into a {⊖, ⊚, ⊕, ⊤, ⊥}. This is very simplistic (so easy to solve)
but even so it is useful. For example:
• Check for division by zero by looking for occurrences of ⊥
• Optimize the program by storing:
⊕ variables as unsigned integers
⊚ variables false boolean literals
Example Analysis
a = 5; a = ⊕;
b = -3; b = ⊖;
c = a ⨉ b; c = ⊖;
d = 0; d = ⊚;
e = c ⨉ d; e = ⊚;
f = 10 / e; f = ⊥; Detected division by
zero! Just look for
variables that the
analysis maps to ⊥.
Example Analysis
a = 5; a = ⊕;
b = -3; b = ⊖;
c = a + b; c = ⊤;
d = 0; d = ⊚;
e = c - d; e = ⊤; False positive! This
program can never
f = 10 / e; f = ⊤; throw an error, but
the analysis reports
that f may contain any
value (including
unde ned)
fi
Does Program P Satisfy φ?
Overapproximate (sound) analysis
All behaviours of P
Underapproximate
(complete) analysis
A sound static analysis over-approximates: it reports all the
errors but may report some "false alarms".
A complete static analysis under-approximates: it reports
only real errors but may not report all of them.
Which Programs Satisfy φ?
Programs satisfying φ Programs violating φ
P2
P1 P3
Tool A
(sound)
P5 P7
Tool B
(neither)
P4
P6
Tool C (complete)
Soundness and Completeness
Complete Incomplete
Reports all errors Reports all errors
Sound
Reports no false alarms May report false alarms
Undecidable Decidable
May not report all errors May not report all errors
Unsound
Reports no false alarms May report false alarms
Decidable Decidable
Soundiness
• Static analysis tools that give up soundness under
speci c circumstances but still achieve few false
alarms are de ned soundy
• Soundy tools cannot handle speci c features (e.g.,
re ection) but can handle correctly programs that do
not make use of those feature (aim for the common
case)
fl
fi
fi
fi
Does Program P Satisfy φ?
(Static Analysis)
Over-approximate (sound) analysis
All behaviours of P
Under-approximate
(complete) analysis
Does Program P Satisfy φ?
(Dynamic Analysis)
All behaviours of P
Covered program behaviours
(complete) analysis
Example Analysis
a = 5; a = ⊕;
b = -3; b = ⊖;
c = a + b; c = ⊤;
d = 0; d = ⊚;
e = c - d; e = ⊤; False positive! This
program can never
f = 10 / e; f = ⊤; throw an error, but
the analysis reports
that f may contain any
value (including
unde ned)
fi
Dynamic Analysis
def foo(a):
b = -3
foo(5)
c=a+b
foo(3)
d=0
True positive!
e=c-d
f = 10 / e
return f
Which Programs Satisfy φ?
(Static Analysis)
Programs satisfying φ Programs violating φ
P2
P1 P3
Tool A
(sound)
P5 P7
Tool B
(neither)
P4
P6
Tool C (complete)
Which Programs Satisfy φ?
(Dynamic Analysis)
Programs satisfying φ Programs violating φ
P2
P1 P3
Test Suite A
P5 P7
P4 Test Suite B
P6
Static vs Dynamic Analysis
• The main challenge of building a static analysis is choosing a
good abstraction function
• The main challenge of performing a good dynamic analysis is
selecting a representative set of test cases.
Which Programs Satisfy φ?
(Neural Analysis)
Programs satisfying φ Programs violating φ
P2
P1 P3
86%
95% 74%
P5 P7
13% 21%
P4
P6 75%
95%
Contents
• What is Software Analysis?
• Course Organisation
• First Steps: Character Level Analysis
Software Analysis
Executable Program Execution Trace
0101011010
1010110101
0101101010
1010101010
1010111111
Entry
x <= y Exit
true
false true
x == y x>0
return False
true true
print
y == 17
true
Source Code
return True
Internal Representation
Analysis Algorithm
int x = 0;
Characters 1. Introduction, Character-level
if (y > z){
analysis
int x = 0;
Tokens 2. Token-level analysis:
if (y > z){
NLP for Source Code
3. Syntax Analysis
BlockStatement
Syntax Trees
AssignmentStatement IfStatement WhileStatement
x 4 > AssignmentStatement > BlockStatement
y 42 x 10 x 0 AssignmentStatement
Entry
x
x
-
1
4. Control ow Analysis
Control Flow 5. Data ow Analysis
true
x <= y
false true
x == y
6. Dependence Graphs
false
print
false
x>0
true
Data Flow 7. Interprocedural Analysis
true
y == 17
8. Symbolic Execution, Dynamic
false
Entry
Dependency
returnx <=
False return True
Symbolic Execution
y Exit
true
false true
Exit x == y x>0
return False
9. Dynamic Analysis
true true
print
y == 17
Executions
true
return True
fl
fl
Passing Requirements
• There is no exam
• Three programming assignments (“Portfolio”)
• Short viva at the end of the semester
• Registration on HISQIS: 28.04. - 23.05.
• Detailed grading criteria will be speci ed for each assignment
• Tests for your implementation are part of the grading criteria!
fi
Tentative Assignments
• Assignment 1 (Week 4):
Code metrics + prediction
• Assignment 2 (Week 7):
Interprocedural sign analysis — nding division by zero errors
• Assignment 3 (Week 10):
Dynamic program slicing
• Java (Non-negotiable)
fi
Exercise Sessions
• Content: Assignment presentation,
Q&A, and discussion
Patric Feldmeier
Wednesday 14:00-16:00
Information and Literature
• Lecture slides on StudIP
• Jupyter Notebooks with code examples
Contents
• What is Software Analysis?
• Course Organisation
• First Steps: Character Level Analysis
Code Clones
• Code fragment: A sequence of code lines
• Code clone: A code fragment in source les that is identical or
similar to another
• Number 1 on Beck and Fowler’s “Stink Parade of Bad Smells”
• Between 7%-23% of the code in a typical software system is
cloned
fi
Why Clone?
• Development Strategy
• Simple reuse by Copy/Paste
• Forking (reuse of similar solutions with the hope that they will be diverged signi cantly with the
evolution of the system)
• Delay in restructuring
• Maintenance Bene ts
• Risk in developing new code
• Speed up maintenance
• Overcoming Limitations
• Lack of reuse mechanism of programming languages
• Signi cant efforts in writing reusable code
• Time limit assigned to developers
• Lack of ownership of the code to be reused
• Developer’s lack of knowledge in a problem domain
fi
fi
fi
Drawbacks of Clones
• Increased probability of bug propagation
• Increased probability of introducing a new bug
• Increased probability of bad design
• Increased dif culty in system improvement/modi cation
• Increased maintenance cost
• Increased resource requirements
fi
fi
Advantages and Applications
of Detecting Code Clones
• Improves software quality
• Detects library candidates
• Helps in program understanding
• Finds usage pattern
• Detects plagiarism and copyright infringement
Clone Types
• Type-1 clones
Identical code fragments but may have some variations in whitespace, layout,
and comments
• Type-2 clones
Syntactically equivalent fragments with some variations in identi ers, literals,
types, whitespace, layout and comments
• Type-3 clones
Syntactically similar code with inserted, deleted, or updated statements
• Type-4 clones
Semantically equivalent, but syntactically different
fi
Clone Detection Strategies
• Text matching
• No program structure is taken into consideration
• Type-1 clones & some Type-2 clones
• Exact string match vs Ambiguous match (e.g., Longest Common Subsequence match,)
• —> See Jupyter Notebook
• Token sequence matching
• No program structure is taken into account, either
• Type-1 and Type-2 clones
• Graph matching
• Type-1, Type-2, and Type-3 clones; some Type-4
• Syntactic and semantic understanding (AST, CFG, PDG)
• Metrics based
Contents
• What is Software Analysis?
• Course Organisation
• Character Level Analysis