INFO 554x Data Analytics For Security: Spring '24
INFO 554x Data Analytics For Security: Spring '24
N1
N12
N2
N8
N3 N5 N4
N4 N6 N9 N10 N11 N3
0 4 8 12 16
Time
Goal : to identify potential “collusions” among the entities responsible for these two events
3
Logical View
Request Request
Business
User Internet Application
Response Response
Within this pipeline there could be several points through which the
request and response pass
Leading to several opportunities in the end-to-end process for data
collection to help understand when a cyber threat may occur in this
process
Firewall Firewall
Request
4/21/24 7
Computational Model- Password Theft
4/21/24 8
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved.
Data Analytics Methods for Persistent Threats -
Data discretization:
Segregates data into time segment bins representing the multiple time periods, which makes it easier to evaluate
the persistence of the threats across time periods represented by bins.
Network traffic, which is suspicious, is captured through Intrusion detection System (IDS) in an alert log capturing
the source, destination IP with timestamps and the priority.
The high priority threats can be obvious suspicious activities and the low priority threats can be potential alerts that
may or may not be real. Such high priority alerts when looked at in combination over a long period of time may turn
out to be persistent
Frequent Patterns:
Analyzing threats individually may not provide the overall persistence of a threat. It is important to study the data
from the overall perspective and identify frequent patterns on a given timeframe to discover persistent threats.
Thus, the data can be mined using Association Rule Mining techniques and isolate the Persistent High Priority
Threats (PHPT) from the individual high priority threats. The PHPT are consistent and may indicate APT behaviors.
4/21/24 9
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved.
Data Analytics Methods for Persistent
Threats - Persistent Threats
Persistent threats are high priority, unusual
threats that stay consistent over a long period
of time
The binned datasets and their corresponding
frequent patterns can be intersected with each
other to isolate the non-intersecting high
priority threats to detect the potential
persistent threats
Persistent threat pattern have the following
key characteristics:
Consistent over time (occurs repeatedly)
Single source (same key players)
Individually each is a threat
Unusual (pattern may be repeated at an
unusual time of the day or single source
accessing different sources repeatedly at
the same time of the day)
Non-obvious (non-overlapping)
4/21/24 10
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved.
Definition: Association Rule
TID Items
Association Rule 1 Bread, Milk
– An implication expression of the form 2 Bread, Diaper, Beer, Eggs
X Y, where X and Y are itemsets 3 Milk, Diaper, Beer, Coke
– Example: 4 Bread, Milk, Diaper, Beer
{Milk, Diaper} {Beer} 5 Bread, Milk, Diaper, Coke
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent
itemset
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a
binary partitioning of a frequent itemset
X , Y : ( X Y ) s( X ) s(Y )
Support of an itemset never exceeds the support of its subsets
This is known as the anti-monotone property of support
Illustrating Apriori Principle null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to
be
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Pruned
supersets ABCDE
Graph theory
Network Monitoring
Essential for Network Management Tcpdump
Unix-based command-line tool used
Router and Firewall policy to intercept packets
Including filtering to just the
Detecting abnormal/error in networking packets of interest
Reads “live traffic” from interface
Access control specified using -i option …
… or from a previously recorded trace
file specified using -r option
Security Management You create these when capturing
live traffic using -w option
Detecting abnormal traffic Tshark
Tcpdump-like capture program that
Traffic log for future forensic analysis comes w/ Wireshark
Very similar behavior & flags to
tcpdump
Wireshark
GUI for displaying tcpdump/tshark
packet traces
17
Network Layered Structure
What is the Internet?
Network IP Network
Physical
link
18
Capture filters
A capture filter takes the form of a series of primitive expressions connected
by conjunctions (and/or) and optionally preceded by not:
[not] primitive [and|or [not] primitive ...]
Examples:
• tcp port 23 and host 10.0.0.5 (Telnet from source )
• tcp port 23 and not src host 10.0.0.5
Primitives:
• [src|dst] host <host>
• ether [src|dst] host <ehost>
• gateway host <host>
• [src|dst] net <net> [{mask <mask>}|{
<len>}]
• [tcp|udp] [src|dst] port <port>
• less|greater <length>
• ip|ether proto <protocol>
Filtering While viewing
Other features
• Follow stream
• Flow Graph
• Conversations
Preprocessing techniques
Discretization is a process of mapping continuous attributes into nominal attributes. The main objective of the
discretization process is to discover a set of cut points, which divide the range into small number of intervals.
Normalization
•Standardization: Normalize numerical features to have zero mean and unit variance.
•Min-Max Scaling: Scale features to a specific range (e.g., 0 to 1) for consistent comparisons.
Timestamp Alignment:
•Synchronization of Time Stamps: Ensure that time stamps across different devices or sources
are synchronized for accurate temporal analysis.
Flow Aggregation:
•Flow-level Aggregation: Aggregate packet-level data into flows (source-destination pairs) to
simplify analysis and reduce data volume.
Sessionization:
•Session Identification: Group packets into sessions to analyze interactions between hosts over a
defined time window.
Traffic Segmentation:
•Segmentation by Protocol: Analyze different protocols separately to understand their specific
characteristics and behaviors.
Feature selection and feature extraction
• Both reduce dimensionality
• Feature selection performs the removal of features that are irrelevant or
redundant in posterior processes for data representation and classification.
• Feature extraction consists of the projection of the original data set into a
new space where the linear dependence of features (axis or variables of the
new space) is minimized, causing therefore a reduction in the number of
required features.
• Feature selection
• Wrapper, Filters, Hybrids
• Forward-backward elimination, variance based, mutual information, information gain
• Feature extraction
• PCA, ICA, SVD
Eigenvalues & eigenvectors
› Vectors x having same direction as Ax are called
eigenvectors of A (A is an n by n matrix).
› In the equation Ax=x, is called an eigenvalue of
A.
2 3 3 12 3
x 4 x
2 1 2 8 2
12 12 0
2 2 0
u1 12 u2 , u3 orthonormal basis for Null(AAT ) AA 2 2 0
T u 2 12 u3 0
0 0 0 0 0 1
Range(A) Rank(A)
Null(A)
1 1 12 1
0 2 0
1 1 1
2
1
2
1
2
2
1
2
0 0 0 1 1
0 0 0 0 1 0 0 2 2
Fuzzy Set and Fuzzy Cluster
Clustering methods (What we discussed last semster)
Every data object is assigned to exactly one cluster
Some applications may need for fuzzy or soft cluster assignment
Ex. An e-game could belong to both entertainment and software
Methods: fuzzy clusters and probabilistic model-based clusters
Fuzzy cluster: A fuzzy set S: FS : X → [0, 1] (value between 0 and 1)
Fuzzy C-Means (FCM) is a data clustering technique wherein each data point belongs to a
cluster to some degree that is specified by a membership grade
Example: Attack category is defined as a fuzzy mapping
26
Linear Separators
Which of the linear separators is optimal?
2
2r
Then the margin can be expressed through (rescaled) w and b as: w
Linear SVMs Mathematically (cont.)
Then we can formulate the quadratic optimization problem:
Find w and b such that
2
is maximized
w
and for all (xi, yi), i=1..n : yi(wTxi + b) ≥ 1
ξi
ξi
Non-linear SVMs
› Datasets that are linearly separable with some noise work
out great
0 x
0 x
0 x
The “Kernel Trick”
› The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj
› If every datapoint is mapped into high-dimensional space via some transformation Φ:
x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
› A kernel function is a function that is equivalent to an inner product in some feature
space.
› Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
› Thus, a kernel function implicitly maps data to a high-dimensional space (without the
need to compute each φ(x) explicitly).
Examples of Kernel Functions
Linear: K(xi,xj)= xiTxj
Mapping Φ: x → φ(x), where φ(x) is x itself
2
xi x j
2 2
Gaussian (radial-basis function): K(xi,xj) = e
Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is mapped to a function (a Gaussian);
combination of functions for support vectors is the separator.
Higher-dimensional space still has intrinsic dimensionality d (the mapping is not onto), but linear
separators in it correspond to non-linear separators in original space.
Graph Algorithms- Community
● Community in Network is a structure in where vertices are
densely connected internally
34
Girvan-Newman Algorithm
● Detects communities by progressively removing edges from the
original network
● The algorithm's steps for community detection are summarized
below
1. Calculate betweenness of all existing edges in the network
2. Remove edge(s) with the highest betweenness
3. Calculate betweenness of all edges affected by the removal
4. Repeat steps 2 and 3 until no edges remain
35
Log Data
Log data refers to records of events that occur within a system or
network, such as user logins, file accesses, network connections, system
errors, and administrative actions.
Structured vs unstructured
Log Collection and Management
Log collection involves gathering log data from various sources, including
systems, applications, network devices, and security appliances. Several
methods are commonly used for log collection:
•Agents: Software agents installed on individual systems or devices to collect
and forward log data to a centralized log management system.
•Syslog: Syslog is a standard protocol used for sending log messages over a
network. Devices such as routers, switches, firewalls, and servers can generate
syslog messages and transmit them to a syslog server for centralized storage and
analysis.
•APIs (Application Programming Interfaces): Many applications and services
provide APIs for programmatic access to log data. Organizations can use APIs to
extract log data from applications, cloud services, and third-party platforms for
centralized storage and analysis.
ELK (Elasticsearch, Logstash, and
Kibana)
•The ELK Stack, consisting of Elasticsearch, Logstash,
and Kibana, is a powerful combination of open-source
tools used for log management, analytics, and
visualization.
Logstash
Elasticsearch Elements
A node is simply a running instance of Elasticsearch.
• Data Nodes
• Master Nodes
A cluster is a collection of nodes running Elasticsearch.
Index(Noun): "index" refers to a collection of documents that share
similar characteristics.
Index(Verb): "index" refers to the action of adding documents to an
Elasticsearch index.
Each shard is technically a standalone search engine. There are two
types of shards: primary and replica.
In Elasticsearch, data is stored in the form of
documents, which are JSON objects. Each document
represents a single data entity and is stored in an index.
CRUD operations and queries
Create, Read, Update, Delete
{
"match": {
"description": "high-performance smartphone"
}
}
{
"wildcard": {
"name": "smart*"
}
}
Text Analysis/ Text analyzer
Text analysis enables Elasticsearch to perform full-text search,
where the search returns all relevant results rather than just exact
matches.
• Mass malware
• Intended to infect as many machines as possible
• Most common type
• Targeted malware
• Tailored to a specific target
• Very difficult to detect, prevent, and remove
• Requires advanced analysis
• Ex: Stuxnet
Viruses
Virus propagates by infecting other programs
Automatically creates copies of itself, but to propagate, a human has to
run an infected program
Self-propagating viruses are often called worms
Encrypted viruses: constant decryptor followed by the encrypted
virus body
Polymorphic viruses: each copy creates a new random encryption
of the same virus body
Decryptor code constant and can be detected
Historical note: “Crypto” virus decrypted its body by brute-force key
search to avoid explicit decryptor code
48
Virus Detection
Simple anti-virus scanners
Look for signatures (fragments of known virus code)
Heuristics for recognizing code associated with viruses
Example: polymorphic viruses often use decryption loops
Integrity checking to detect file modifications
Keep track of file sizes, checksums, keyed HMACs of contents
49
Mutation Techniques
Real Permutating Engine/RPME, ADMutate, etc.
Large arsenal of obfuscation techniques
Instructions reordered, branch conditions reversed, different register
names, different subroutine order
Jumps and NOPs inserted in random places
Garbage opcodes inserted in unreachable code areas
Instruction sequences replaced with other instructions that have the same
effect, but different opcodes
Mutate SUB EAX, EAX into XOR EAX, EAX or
MOV EBP, ESP into PUSH ESP; POP EBP
Dynamic Analysis
Run the malware and monitor its effect
• Basic dynamic analysis
• Easy but requires a safe test environment
• Not effective on all malware
• Advanced Dynamic Analysis
• Run code in a debugger
• Examines internal state of a running malicious executable
PE Header
• Information about the code
• Type of application
• Required library functions
• Space requirements
The PE header lists every library and function that will be loaded.
Their names can reveal what the program does
Normal programs have a lot of DLLs, Malware often has very few
DLLs.
Important PE Sections
.text -- instructions for the CPU to execute
.rdata -- imports & exports
.data – global data
.rsrc – strings, icons, images, menus
• Hardware
• Microcode
• Machine code
• Low-level
languages
• High-level
languages
• Interpreted
languages
Disassembly
• Malware on a disk is in binary form at the machine code level
• Disassembly converts the binary form to assembly language
• IDA Pro is the most popular disassembler
• Graph and Text Mode
• Highlighting, functions and data calls & references, imports,
exports,
Why Perform Dynamic Analysis?
Dynamic analysis is efficient and will show you exactly what the
malware does.
Running malware deliberately, while monitoring the results.
Requires a safe environment
Real Machine vs Virtual Machine
Snapshots
Monitor system state
Boosting
• The term ‘Boosting’ refers to a family of algorithms
which converts weak learner to strong learners.
• To find weak rule, we apply base learning algorithms with a
different distribution. Each time base learning algorithm is
applied, it generates a new weak prediction rule. This is an
iterative process. After many iterations, the boosting algorithm
combines these weak rules into a single strong prediction rule.
Gradient Boosting
• In gradient boosting, it trains many models sequentially. Each new model gradually minimizes the loss
function (y = ax + b + e, ‘e’ needs special attention as it is an error term) of the whole system using Gradient
Descent method.
• The learning procedure consecutively fit new models to provide a more accurate estimate of the response
variable.
• The principle idea behind this algorithm is to construct new base learners which can be maximally correlated
with negative gradient of the loss function, associated with the whole ensemble.
o Y = M(x) + error
o error = G(x) + error2
o error2 = H(x) + error3
o Y = M(x) + G(x) + H(x) + error3
o Y = alpha * M(x) + beta * G(x) + gamma * H(x) + error4
where L(Ft) is our loss function and (Ft) is our regularization function.
o
DBSCAN Algorithm
Input: The data set D
Parameter: , MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For
Core, Border & Outlier
Outlier ›Given and MinPts,
categorize the
objects into three
Border exclusive groups.
A point is a core point if it has
Core more than a specified number of
points (MinPts) within Eps These
are points that are at the interior of
a cluster.
Input
71
Neural Networks
Neural Networks are functions:
where , or and
Robust approach to approximating real-valued, discrete-valued and vector
valued target functions.
𝑇3
)
) 𝑇5
Trainable Parameters:
, 𝑇4
72
The Backpropagation Algorithm
Create a fully connected three layer network. Initialize weights.
Until all examples produce the correct output within (or other criteria)
For each example in the training set do:
1. Compute the network output for this example
2. Compute the error between the output and target value
73
Recurrent Neural Networks (RNNs)
Definition of the RNN:
y1 y2 y3 y4 y5 yt
h1 h2 h3 h4 h5 h
x1 x2 x3 x4 x5 xt
74
Types of RNN
LSTM Network Architecture
76