0% found this document useful (0 votes)
22 views65 pages

1 Introduction

The document provides an overview of software analysis, which involves extracting information from source code or artifacts using automated tools. It discusses various aspects such as goals, methods (deduction, observation, induction, experimentation), and the importance of soundness and completeness in static and dynamic analysis. Additionally, it introduces concepts like Rice's Theorem and examples of program analysis techniques to determine properties of interest in software.

Uploaded by

0xefe4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views65 pages

1 Introduction

The document provides an overview of software analysis, which involves extracting information from source code or artifacts using automated tools. It discusses various aspects such as goals, methods (deduction, observation, induction, experimentation), and the importance of soundness and completeness in static and dynamic analysis. Additionally, it introduces concepts like Rice's Theorem and examples of program analysis techniques to determine properties of interest in software.

Uploaded by

0xefe4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Software Analysis

Introduction

Gordon Fraser
Lehrstuhl für Software Engineering II
Contents

• What is Software Analysis?

• Course Organisation

• First Steps: Character Level Analysis


What is Software Analysis?

The process of extracting information about a program

From its source code or artefacts


(e.g., Java bytecode, execution traces, Git history, …)

Information is extracted using automatic tools


Example

Can you tell me what’s wrong?


Example

Can you tell me what’s wrong?


Example

Can you tell me what’s wrong?


Example

Can you tell me what’s wrong?


Example

Can you tell me what’s wrong?


Goals of Software Analysis
• Establish whether given properties hold for some software

• Structural properties must hold at design time:


- the code is indented using tabs
- class Bar has only visible attributes
- module baz contains more than 500 lines of code

• Behavioural properties must hold while the software is


running:
- method foo always terminates
- in 80% of the runs, statement xyz executes in less than 10 ms
- the program crashes with input 42
Why Software Analysis?
<Insert generic software quality motivation here>

<Insert generic programmer productivity motivation here>


Hierarchy of Analysis

• Deduction
• Observation
• Induction
• Experimentation
Deduction

• Deduction is reasoning from the general to the particular; it lies


at the core of all reasoning techniques.

• In program analysis, deduction is used for reasoning from the


program code (or other abstractions) to concrete runs for
deducing what can or cannot happen.

• Deduction does not require any knowledge about the concrete


runs; hence, the analysis is static
Deduction

1 char* format = “a = %d”;


2 if (p)
3 a = compute_value();
4 sprintf(buf, format, a);

Assume that buf contains the string “a = 0” (an error)


Observation

• Observation allows the programmer to inspect arbitrary


aspects of an individual program run.

• Since an actual run is required, the associated techniques are


called dynamic.

• Observation brings in actual facts of a program execution;


unless the observation process is awed, these facts cannot be
denied.
fl
Observation

1 char* format = “a = %d”;


2 if (p)
3 a = compute_value();
4 sprintf(buf, format, a);
Induction

• Induction is reasoning from the particular to the general.

• In program analysis, induction is used to summarize multiple


program runs to some abstraction that holds for all
considered program runs (i.e., invariants)

• A “program” may also be a piece of code, like a function or loop


body.
Induction

1 char* format = “a = %d”;


2 if (p)
3 a = compute_value();
4 sprintf(buf, format, a);

Example dynamic invariant: a < 2054567 || a % 2 == 1


Experimentation
• Many problems in program understanding can be formulated as a
search for causes:
- What is the cause for buf containing "a = 0”?

• To prove actual causality, one needs two experiments: one where


cause and effect occur, and one where neither cause nor effect
occur.

• The cause must precede the effect, and the cause must be a minimal
difference between these experiments.

• Searching for the actual cause thus requires a series of experiments,


re ning and rejecting hypotheses until the actual cause is isolated.
fi
Experimentation

1 char* format = “a = %d”;


2 if (p)
3 a = compute_value();
4 sprintf(buf, format, a);
Experimentation

1 char* format = “a = %d”;


2 if (p)
3 a = compute_value();
4 sprintf(buf, format, a);

“a = %d” but a is a oat


fl
Experimentation

1 char* format = “a = %f”;


2 if (p)
3 a = compute_value();
4 sprintf(buf, format, a);

“a = %d” but a is a oat


fl
Hierarchy of Analysis
• Deduction from an abstraction into the concrete
For instance, analysing program code to deduce what can or cannot happen in
concrete runs.

• Observation of concrete events


e.g. tracing, monitoring or pro ling a program run or using a debugger.

• Induction for summarising observations into an abstraction


an invariant, for example, or some visualisation.

• Experimentation for isolating causes of given effects


e.g. narrowing down failure-inducing circumstances by systematic tests.
fi
Software Analysis
Executable Program Execution Trace

0101011010
1010110101
0101101010
1010101010
1010111111

Entry

x <= y Exit

true
false true

x == y x>0
return False

true true

print
y == 17

true

Source Code
return True

Internal Representation

Analysis Algorithm
Software Analysis
• We’ve written a program P and we want to know...

• Does P satisfy property of interest φ?


- For example, P does not dereference a null pointer, all casts in P are safe, ...

• Manually verifying that P satis es φ may be tedious or


impractical if P is large or complex

• Would be great to write a program that can inspect the source


code of P and determine if φ holds!
fi
Analyser that decides if a variable in a
program has a constant value in any
execution
x = 17;
Analyser that decides if a variable in a
if (TM(j)) program has a constant value in any
x = 18; execution

x has a constant value if and only if the j’th Turing machine does
not halt on empty input. If the hypothetical constant-value
analyser exists, then we have a decision procedure for the
halting problem, which is known to be impossible.
Rice's Theorem
"All non-trivial semantic properties of programs are undecidable"

• Non-trivial: there exists both a program that has that property


and one that does not

• Undecidable: no automated method can determine whether


the property holds for any program

We can abstract the behaviours of the program into a


decidable approximation and attempt the analysis in the
abstract version of the program
Example Analysis
• Given a program P, determine the sign (positive, negative, or zero) of all
of its variables (sign analysis).

• Applications:

• Check for division by 0

• Optimize by storing positive variables as unsigned


integers

• Check for negative values, e.g., negative array indices.


Example Approximation
• Abstract concrete domain of integer values into abstract
domain of signs {⊖, ⊚, ⊕} ∪ {⊤,⊥}

• Step 1: De ne abstraction function abstract() over integer


literals:

• abstract(i) = ⊖ if i < 0

• abstract(i) = ⊚ if i == 0

• abstract(i) = ⊕ if i > 0
fi
Example Analysis
Concrete domain of ints Abstract domain of signs

⊕ positive ints
x=5
⊖ negative ints
x = -5
⊚ zero
x=0
⊤ all ints (unknown)

⊥ no ints (unde ned)


fi
Example Analysis
• De ne transfer functions over the abstract domain that show how to
evaluate expressions in the abstract domain:
• ⊕+⊕ =⊕
• ⊖+⊖ =⊖
• ⊚+⊚ =⊚
• ⊚+⊕ =⊕
• ⊚+⊖ =⊖
• … …
• ⊕+⊖ =⊤
• {⊕, ⊖, ⊚, ⊤} + ⊤ = ⊤
• {⊕, ⊖, ⊚, ⊤} / ⊚ = ⊥
fi
Example Analysis
Concrete domain of ints Abstract domain of signs

⊕ positive ints
x=5
⊖ negative ints
x = -5
⊚ zero
x=0

x = b ? -1 : 1
⊤ all ints (unknown)

⊥ no ints (unde ned)


fi
Example Analysis
Concrete domain of ints Abstract domain of signs

⊕ positive ints
x=5
⊖ negative ints
x = -5
⊚ zero
x=0

x = b ? -1 : 1
⊤ all ints (unknown)

x = y/0 ⊥ no ints (unde ned)


fi
Example Analysis

⊤ Z

{i | i<0} ⊖ ⊚ {0} ⊕ {i | i>0}

⊥ {}
Example Analysis

⊤ Z
⊆ ⊆ ⊆

{i | i<0} ⊖ ⊚ {0} ⊕ {i | i>0}



⊆ ⊆
⊥ {}
Example Approximation

Now, every integer value and expression has been abstracted


into a {⊖, ⊚, ⊕, ⊤, ⊥}. This is very simplistic (so easy to solve)
but even so it is useful. For example:

• Check for division by zero by looking for occurrences of ⊥

• Optimize the program by storing:


⊕ variables as unsigned integers
⊚ variables false boolean literals
Example Analysis

a = 5; a = ⊕;
b = -3; b = ⊖;
c = a ⨉ b; c = ⊖;
d = 0; d = ⊚;
e = c ⨉ d; e = ⊚;
f = 10 / e; f = ⊥; Detected division by
zero! Just look for
variables that the
analysis maps to ⊥.
Example Analysis

a = 5; a = ⊕;
b = -3; b = ⊖;
c = a + b; c = ⊤;
d = 0; d = ⊚;
e = c - d; e = ⊤; False positive! This
program can never
f = 10 / e; f = ⊤; throw an error, but
the analysis reports
that f may contain any
value (including
unde ned)
fi
Does Program P Satisfy φ?
Overapproximate (sound) analysis

All behaviours of P

Underapproximate
(complete) analysis

A sound static analysis over-approximates: it reports all the


errors but may report some "false alarms".

A complete static analysis under-approximates: it reports


only real errors but may not report all of them.
Which Programs Satisfy φ?
Programs satisfying φ Programs violating φ

P2
P1 P3
Tool A
(sound)

P5 P7

Tool B
(neither)
P4

P6

Tool C (complete)
Soundness and Completeness
Complete Incomplete

Reports all errors Reports all errors


Sound

Reports no false alarms May report false alarms

Undecidable Decidable

May not report all errors May not report all errors
Unsound

Reports no false alarms May report false alarms

Decidable Decidable
Soundiness

• Static analysis tools that give up soundness under


speci c circumstances but still achieve few false
alarms are de ned soundy

• Soundy tools cannot handle speci c features (e.g.,


re ection) but can handle correctly programs that do
not make use of those feature (aim for the common
case)
fl
fi
fi
fi
Does Program P Satisfy φ?
(Static Analysis)

Over-approximate (sound) analysis

All behaviours of P

Under-approximate
(complete) analysis
Does Program P Satisfy φ?
(Dynamic Analysis)

All behaviours of P

Covered program behaviours


(complete) analysis
Example Analysis

a = 5; a = ⊕;
b = -3; b = ⊖;
c = a + b; c = ⊤;
d = 0; d = ⊚;
e = c - d; e = ⊤; False positive! This
program can never
f = 10 / e; f = ⊤; throw an error, but
the analysis reports
that f may contain any
value (including
unde ned)
fi
Dynamic Analysis

def foo(a):
b = -3
foo(5)
c=a+b
foo(3)
d=0
True positive!
e=c-d
f = 10 / e
return f
Which Programs Satisfy φ?
(Static Analysis)

Programs satisfying φ Programs violating φ

P2
P1 P3
Tool A
(sound)

P5 P7

Tool B
(neither)

P4

P6

Tool C (complete)
Which Programs Satisfy φ?
(Dynamic Analysis)

Programs satisfying φ Programs violating φ

P2
P1 P3
Test Suite A

P5 P7

P4 Test Suite B

P6
Static vs Dynamic Analysis

• The main challenge of building a static analysis is choosing a


good abstraction function

• The main challenge of performing a good dynamic analysis is


selecting a representative set of test cases.
Which Programs Satisfy φ?
(Neural Analysis)

Programs satisfying φ Programs violating φ

P2
P1 P3
86%
95% 74%

P5 P7

13% 21%

P4

P6 75%
95%
Contents

• What is Software Analysis?

• Course Organisation

• First Steps: Character Level Analysis


Software Analysis
Executable Program Execution Trace

0101011010
1010110101
0101101010
1010101010
1010111111

Entry

x <= y Exit

true
false true

x == y x>0
return False

true true

print
y == 17

true

Source Code
return True

Internal Representation

Analysis Algorithm
int x = 0;
Characters 1. Introduction, Character-level
if (y > z){
analysis
int x = 0;
Tokens 2. Token-level analysis:
if (y > z){
NLP for Source Code
3. Syntax Analysis
BlockStatement

Syntax Trees
AssignmentStatement IfStatement WhileStatement

x 4 > AssignmentStatement > BlockStatement

y 42 x 10 x 0 AssignmentStatement

Entry
x

x
-

1
4. Control ow Analysis
Control Flow 5. Data ow Analysis
true
x <= y

false true
x == y

6. Dependence Graphs
false
print

false
x>0
true
Data Flow 7. Interprocedural Analysis
true
y == 17

8. Symbolic Execution, Dynamic


false
Entry

Dependency
returnx <=
False return True

Symbolic Execution
y Exit

true
false true

Exit x == y x>0
return False

9. Dynamic Analysis
true true

print
y == 17

Executions
true

return True
fl
fl
Passing Requirements
• There is no exam

• Three programming assignments (“Portfolio”)

• Short viva at the end of the semester

• Registration on HISQIS: 28.04. - 23.05.

• Detailed grading criteria will be speci ed for each assignment

• Tests for your implementation are part of the grading criteria!


fi
Tentative Assignments

• Assignment 1 (Week 4):


Code metrics + prediction

• Assignment 2 (Week 7):


Interprocedural sign analysis — nding division by zero errors

• Assignment 3 (Week 10):


Dynamic program slicing

• Java (Non-negotiable)
fi
Exercise Sessions

• Content: Assignment presentation,


Q&A, and discussion

Patric Feldmeier

Wednesday 14:00-16:00
Information and Literature

• Lecture slides on StudIP

• Jupyter Notebooks with code examples


Contents

• What is Software Analysis?

• Course Organisation

• First Steps: Character Level Analysis


Code Clones

• Code fragment: A sequence of code lines

• Code clone: A code fragment in source les that is identical or


similar to another
• Number 1 on Beck and Fowler’s “Stink Parade of Bad Smells”
• Between 7%-23% of the code in a typical software system is
cloned
fi
Why Clone?
• Development Strategy
• Simple reuse by Copy/Paste

• Forking (reuse of similar solutions with the hope that they will be diverged signi cantly with the
evolution of the system)

• Delay in restructuring

• Maintenance Bene ts
• Risk in developing new code

• Speed up maintenance

• Overcoming Limitations
• Lack of reuse mechanism of programming languages

• Signi cant efforts in writing reusable code

• Time limit assigned to developers

• Lack of ownership of the code to be reused

• Developer’s lack of knowledge in a problem domain


fi
fi
fi
Drawbacks of Clones
• Increased probability of bug propagation

• Increased probability of introducing a new bug

• Increased probability of bad design

• Increased dif culty in system improvement/modi cation

• Increased maintenance cost

• Increased resource requirements


fi
fi
Advantages and Applications
of Detecting Code Clones

• Improves software quality

• Detects library candidates

• Helps in program understanding

• Finds usage pattern

• Detects plagiarism and copyright infringement


Clone Types
• Type-1 clones
Identical code fragments but may have some variations in whitespace, layout,
and comments

• Type-2 clones
Syntactically equivalent fragments with some variations in identi ers, literals,
types, whitespace, layout and comments

• Type-3 clones
Syntactically similar code with inserted, deleted, or updated statements

• Type-4 clones
Semantically equivalent, but syntactically different
fi
Clone Detection Strategies
• Text matching
• No program structure is taken into consideration

• Type-1 clones & some Type-2 clones

• Exact string match vs Ambiguous match (e.g., Longest Common Subsequence match,)

• —> See Jupyter Notebook

• Token sequence matching


• No program structure is taken into account, either

• Type-1 and Type-2 clones

• Graph matching
• Type-1, Type-2, and Type-3 clones; some Type-4

• Syntactic and semantic understanding (AST, CFG, PDG)

• Metrics based
Contents

• What is Software Analysis?

• Course Organisation

• Character Level Analysis

You might also like