0% found this document useful (0 votes)
17 views76 pages

INFO 554x Data Analytics For Security: Spring '24

The document discusses the importance of data analytics in cybersecurity, highlighting various threats and methods for identifying persistent threats through data analysis. It covers concepts such as association rule mining, network monitoring, and preprocessing techniques to enhance security measures. The document emphasizes the need for a multi-dimensional approach to understanding and mitigating cybersecurity risks.

Uploaded by

Vismay Parekh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views76 pages

INFO 554x Data Analytics For Security: Spring '24

The document discusses the importance of data analytics in cybersecurity, highlighting various threats and methods for identifying persistent threats through data analysis. It covers concepts such as association rule mining, network monitoring, and preprocessing techniques to enhance security measures. The document emphasizes the need for a multi-dimensional approach to understanding and mitigating cybersecurity risks.

Uploaded by

Vismay Parekh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 76

INFO 554x

Data Analytics for Security


Spring ’24
Introduction
Cybersecurity refers to securing valuable electronic assets and physical assets,
which have electronic access, against unauthorized access. These assets may
include personal devices, networked devices, information assets, and
infrastructural assets, among others.
• Applications with several dependencies,
• Logical errors in software code (such as Heartbleed),
• Organizational risks (multiple partners, such as in cyber-attacks at Target and
the Pacific Northwest National Laboratories [PNNL]),
• Lack of user awareness of cybersecurity risks (such as in social engineering and
phishing),
• Personality traits of individuals using the systems (phishing), and
• Inherent issues in the Internet protocol being used.
Multi-dimensional view of Threats
• Computer networks evolve over time, and
communication patterns change over time.  Events become relevant when they occur together
• Attacks may have a spatial pattern. Sources and
destinations in certain key geo locations are
These events become relevant with proximities rather than
more important for monitoring and preventing an causation
attack.  The two items are in close Proximity, based on
• Any type of an attack has common underpinnings
of how it is carried out;
• Source Proximity
Spatial Distance • Destination Proximity
• Temporal proximity or Delay

N1

N12
N2
N8

N3 N5 N4

N4 N6 N9 N10 N11 N3

0 4 8 12 16
Time
Goal : to identify potential “collusions” among the entities responsible for these two events

3
Logical View

Request Request
Business
User Internet Application

Response Response

The logical view of the user requesting access to a business application


appears to be fairly straightforward.

Within this pipeline there could be several points through which the
request and response pass
Leading to several opportunities in the end-to-end process for data
collection to help understand when a cyber threat may occur in this
process

03/25/2025 ADD A FOOTER 4


Physical View
Request
User Domain
Firewall
Name
Response Router System
Internet
Service
Internet Provider
Routing
Internet Table
Exchange
Router
Router Enterprise Server
Router
Routing
Table Internal IDS
Systems
Router Internet
Router DB Files
Service
Provider IDS
Router IDS
Router Switch

Firewall Firewall
Request

IDS Web Server


Response
Public Facing Systems
Business Application
5
4/21/24 6
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved.
Advanced Persistent Threat

4/21/24 7
Computational Model- Password Theft

4/21/24 8
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved.
Data Analytics Methods for Persistent Threats -
Data discretization:
Segregates data into time segment bins representing the multiple time periods, which makes it easier to evaluate
the persistence of the threats across time periods represented by bins.
Network traffic, which is suspicious, is captured through Intrusion detection System (IDS) in an alert log capturing
the source, destination IP with timestamps and the priority.
The high priority threats can be obvious suspicious activities and the low priority threats can be potential alerts that
may or may not be real. Such high priority alerts when looked at in combination over a long period of time may turn
out to be persistent
Frequent Patterns:

Analyzing threats individually may not provide the overall persistence of a threat. It is important to study the data
from the overall perspective and identify frequent patterns on a given timeframe to discover persistent threats.
Thus, the data can be mined using Association Rule Mining techniques and isolate the Persistent High Priority
Threats (PHPT) from the individual high priority threats. The PHPT are consistent and may indicate APT behaviors.

4/21/24 9
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved.
Data Analytics Methods for Persistent
Threats - Persistent Threats
Persistent threats are high priority, unusual
threats that stay consistent over a long period
of time
The binned datasets and their corresponding
frequent patterns can be intersected with each
other to isolate the non-intersecting high
priority threats to detect the potential
persistent threats
Persistent threat pattern have the following
key characteristics:
Consistent over time (occurs repeatedly)
Single source (same key players)
Individually each is a threat
Unusual (pattern may be repeated at an
unusual time of the day or single source
accessing different sources repeatedly at
the same time of the day)
Non-obvious (non-overlapping)
4/21/24 10
Data Analytics for Cybersecurity, ©2022 Janeja All rights reserved.
Definition: Association Rule
TID Items
 Association Rule 1 Bread, Milk
– An implication expression of the form 2 Bread, Diaper, Beer, Eggs
X  Y, where X and Y are itemsets 3 Milk, Diaper, Beer, Coke
– Example: 4 Bread, Milk, Diaper, Beer
{Milk, Diaper}  {Beer} 5 Bread, Milk, Diaper, Coke

 Rule Evaluation Metrics


Example:
– Support (s)
{Milk, Diaper}  Beer
 Fraction of transactions that contain
both X and Y
 (Milk , Diaper, Beer) 2
– Confidence (c) s  0.4
 Measures how often items in Y |T| 5
appear in transactions that
 (Milk, Diaper, Beer) 2
contain X c  0.67
 (Milk , Diaper) 3
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having
support ≥ minsup threshold
confidence ≥ minconf threshold
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf thresholds
Þ Computationally prohibitive!

Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent
itemset
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a
binary partitioning of a frequent itemset

Frequent itemset generation is still computationally expensive


Reducing Number of Candidates
Apriori principle:
If an itemset is frequent, then all of its subsets must also be frequent

Apriori principle holds due to the following property of the


support measure:

X , Y : ( X  Y )  s( X ) s(Y )
Support of an itemset never exceeds the support of its subsets
This is known as the anti-monotone property of support
Illustrating Apriori Principle null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to
be
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent

ABCD ABCE ABDE ACDE BCDE

Pruned
supersets ABCDE
Graph theory
Network Monitoring
Essential for Network Management Tcpdump
Unix-based command-line tool used
Router and Firewall policy to intercept packets
Including filtering to just the
Detecting abnormal/error in networking packets of interest
Reads “live traffic” from interface
Access control specified using -i option …
… or from a previously recorded trace
file specified using -r option
Security Management You create these when capturing
live traffic using -w option
Detecting abnormal traffic Tshark
Tcpdump-like capture program that
Traffic log for future forensic analysis comes w/ Wireshark
Very similar behavior & flags to
tcpdump
Wireshark
GUI for displaying tcpdump/tshark
packet traces

17
Network Layered Structure
What is the Internet?

Application Web, Email, Application


VOIP
Transport TCP, UDP Transport

Network IP Network

Data Link Ethernet, cellular Data Link

Physical
link

18
Capture filters
A capture filter takes the form of a series of primitive expressions connected
by conjunctions (and/or) and optionally preceded by not:
[not] primitive [and|or [not] primitive ...]
Examples:
• tcp port 23 and host 10.0.0.5 (Telnet from source )
• tcp port 23 and not src host 10.0.0.5
Primitives:
• [src|dst] host <host>
• ether [src|dst] host <ehost>
• gateway host <host>
• [src|dst] net <net> [{mask <mask>}|{
<len>}]
• [tcp|udp] [src|dst] port <port>
• less|greater <length>
• ip|ether proto <protocol>
Filtering While viewing

Other features
• Follow stream
• Flow Graph
• Conversations
Preprocessing techniques
Discretization is a process of mapping continuous attributes into nominal attributes. The main objective of the
discretization process is to discover a set of cut points, which divide the range into small number of intervals.
Normalization
•Standardization: Normalize numerical features to have zero mean and unit variance.
•Min-Max Scaling: Scale features to a specific range (e.g., 0 to 1) for consistent comparisons.
Timestamp Alignment:
•Synchronization of Time Stamps: Ensure that time stamps across different devices or sources
are synchronized for accurate temporal analysis.
Flow Aggregation:
•Flow-level Aggregation: Aggregate packet-level data into flows (source-destination pairs) to
simplify analysis and reduce data volume.
Sessionization:
•Session Identification: Group packets into sessions to analyze interactions between hosts over a
defined time window.
Traffic Segmentation:
•Segmentation by Protocol: Analyze different protocols separately to understand their specific
characteristics and behaviors.
Feature selection and feature extraction
• Both reduce dimensionality
• Feature selection performs the removal of features that are irrelevant or
redundant in posterior processes for data representation and classification.
• Feature extraction consists of the projection of the original data set into a
new space where the linear dependence of features (axis or variables of the
new space) is minimized, causing therefore a reduction in the number of
required features.
• Feature selection
• Wrapper, Filters, Hybrids
• Forward-backward elimination, variance based, mutual information, information gain

• Feature extraction
• PCA, ICA, SVD
Eigenvalues & eigenvectors
› Vectors x having same direction as Ax are called
eigenvectors of A (A is an n by n matrix).
› In the equation Ax=x,  is called an eigenvalue of
A.
 2 3   3   12   3
  x    4 x 
 2 1  2  8   2

– Eigenvalues and eigenvectors of covariance matrix.


23
THEOREM: always possible to decompose
matrix A into A = U VT , where
SVD - Basics U, , V: unique (if singular value are unique)
U, V: column orthonormal (ie., columns are
The SVD of a m-by-n matrix A
unit vectors, is given byto
orthogonal theeach
formula :
other)
 UTU = I; VTV = I (I: identity matrix)
 : singular value are positive, and sorted in
decreasing A=UV
order T
Where :
U is a m-by-n matrix of the orthonormal eigenvectors of AAT
VT is the transpose of a n-by-n matrix containing the
orthonormal eigenvectors of ATA
is a n-by-n Diagonal matrix of the singular values which are the
square roots of the eigenvalues of ATA
How to compute SVD (by hand)
•Range of A: Span  of columns of U correspondingT to non-zero  singular values.
A UV T  AAT (UV T )(VT U T ) AA UT U T   ( A)   ( AT A)
•Rank ofT A: Number
T T
of non-zero singular values. 
A V
•Null Space of A:U  SpanA of
T columns
T T of V) corresponding
T T T to zero
T  singular values.
 A  (V U )(U V A A V  V   ( A)   ( AA T
)
•Span: the span of a set of vectors is the set of all possible linear combinations of
those vectors.
Example:  1 1 2  2
   2 2 det(W  I )  0
A  1 1 
T
W  A A   2 2 
 2 2
 0 0

1 1 1 1 1 1  1 2 1 4


u1  Av1  2 1
A  v1  1 , v2    1
1
2
 
2
   2 0 2 0
1
2

 12   12   0
 2 2 0  
 
u1  12  u2 , u3 orthonormal basis for Null(AAT ) AA  2 2 0
T u 2   12  u3  0
 0  0 0 0  0  1
  Range(A) Rank(A)  
Null(A)
 1 1  12 1
0  2 0
 1 1   1
2
   1
2
1
2

   2
1
2
0  0 0  1 1 
 0 0  0 0 1  0 0  2 2 
Fuzzy Set and Fuzzy Cluster
Clustering methods (What we discussed last semster)
Every data object is assigned to exactly one cluster
Some applications may need for fuzzy or soft cluster assignment
Ex. An e-game could belong to both entertainment and software
Methods: fuzzy clusters and probabilistic model-based clusters
Fuzzy cluster: A fuzzy set S: FS : X → [0, 1] (value between 0 and 1)
Fuzzy C-Means (FCM) is a data clustering technique wherein each data point belongs to a
cluster to some degree that is specified by a membership grade
Example: Attack category is defined as a fuzzy mapping

26
Linear Separators
Which of the linear separators is optimal?

Maximum Margin Classification


Linear SVM Mathematically
Let training set {(xi, yi)}i=1..n, xiRd, yi  {-1, 1} be separated by a hyperplane
with margin ρ. Then for each training example (xi, yi):
wTxi + b ≤ - ρ/2 if yi = -1
wTxi + b ≥ ρ/2 if yi = 1
 yi(wTxi + b) ≥ ρ/2

For every support vector xs the above inequality is an equality. After


rescaling w and b by ρ/2 in the equality,
y s (w T we
x s  obtain
b) 1 that distance between
r 
each xs and the hyperplane is w w

2
 2r 
Then the margin can be expressed through (rescaled) w and b as: w
Linear SVMs Mathematically (cont.)
Then we can formulate the quadratic optimization problem:
Find w and b such that
2
 is maximized
w
and for all (xi, yi), i=1..n : yi(wTxi + b) ≥ 1

Which can be reformulated as:

Find w and b such that


Φ(w) = ||w||2=wTw is minimized
and for all (xi, yi), i=1..n : yi (wTxi + b) ≥ 1
Soft Margin Classification
What if the training set is not linearly separable?
Slack variables ξi can be added to allow misclassification of
difficult or noisy examples, resulting margin called soft.

ξi
ξi
Non-linear SVMs
› Datasets that are linearly separable with some noise work
out great
0 x

› But what are we going to do if the dataset is just too hard?

0 x

› How about… mappingx2data to a higher-dimensional space:

0 x
The “Kernel Trick”
› The linear classifier relies on inner product between vectors K(xi,xj)=xiTxj
› If every datapoint is mapped into high-dimensional space via some transformation Φ:
x → φ(x), the inner product becomes:
K(xi,xj)= φ(xi) Tφ(xj)
› A kernel function is a function that is equivalent to an inner product in some feature
space.
› Example:
2-dimensional vectors x=[x1 x2]; let K(xi,xj)=(1 + xiTxj)2,
Need to show that K(xi,xj)= φ(xi) Tφ(xj):
K(xi,xj)=(1 + xiTxj)2,= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2=
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2] =
= φ(xi) Tφ(xj), where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
› Thus, a kernel function implicitly maps data to a high-dimensional space (without the
need to compute each φ(x) explicitly).
Examples of Kernel Functions
Linear: K(xi,xj)= xiTxj
Mapping Φ: x → φ(x), where φ(x) is x itself

Polynomial of power p: K(xi,xj)= (1+ xiTxj)p  d  p


 
Mapping Φ: x → φ(x), where φ(x) has dimensions p 

2
xi  x j

2 2
Gaussian (radial-basis function): K(xi,xj) = e

Mapping Φ: x → φ(x), where φ(x) is infinite-dimensional: every point is mapped to a function (a Gaussian);
combination of functions for support vectors is the separator.

Higher-dimensional space still has intrinsic dimensionality d (the mapping is not onto), but linear
separators in it correspond to non-linear separators in original space.
Graph Algorithms- Community
● Community in Network is a structure in where vertices are
densely connected internally

34
Girvan-Newman Algorithm
● Detects communities by progressively removing edges from the
original network
● The algorithm's steps for community detection are summarized
below
1. Calculate betweenness of all existing edges in the network
2. Remove edge(s) with the highest betweenness
3. Calculate betweenness of all edges affected by the removal
4. Repeat steps 2 and 3 until no edges remain

35
Log Data
Log data refers to records of events that occur within a system or
network, such as user logins, file accesses, network connections, system
errors, and administrative actions.

Detecting Security Incidents


Facilitating Incident Response
Supporting Compliance
Enabling Threat Hunting
Improving Security Posture
Challenges in Log Data Analysis:
Despite its benefits, log data analysis poses several challenges, including:
Volume and Variety: Organizations generate vast amounts of log data from diverse sources, making it
challenging to manage, process, and analyze effectively.
Complexity and Noise: Log data often contains a mix of legitimate events, false positives, and irrelevant
information, requiring sophisticated analysis techniques to distinguish between normal and suspicious
activities.
Timeliness and Scalability: Analyzing log data in real-time and at scale requires efficient log collection,
processing, and storage mechanisms to handle large volumes of data and meet the demands of
dynamic computing environments.
Integration and Correlation: Integrating log data from disparate sources and correlating events across
different systems can be complex, requiring standardized log formats, interoperable tools, and robust
correlation techniques.
Privacy and Data Protection: Log data may contain sensitive information, such as user credentials, IP
addresses, and personal identifiable information (PII), raising privacy concerns and compliance
challenges related to data protection laws and regulations.
Types of Logs
Access Logs:
Examples: Web server access logs, database access logs, application access logs.
Use Cases: Access logs are crucial for monitoring user activity, tracking resource
usage, and identifying suspicious access attempts or unauthorized activities.
Event Logs:
Examples: Windows Event Logs, syslog messages, application event logs.
Use Cases: Event logs are essential for troubleshooting system issues, diagnosing errors,
and monitoring the health and performance of IT infrastructure.
Audit Logs:
Examples: Authentication logs, authorization logs, configuration change logs.
Use Cases: Audit logs play a critical role in compliance monitoring, regulatory compliance, and forensic
investigations. They help organizations detect and investigate security incidents, maintain audit trails,
and demonstrate compliance with industry regulations and standards.

Structured vs unstructured
Log Collection and Management
Log collection involves gathering log data from various sources, including
systems, applications, network devices, and security appliances. Several
methods are commonly used for log collection:
•Agents: Software agents installed on individual systems or devices to collect
and forward log data to a centralized log management system.
•Syslog: Syslog is a standard protocol used for sending log messages over a
network. Devices such as routers, switches, firewalls, and servers can generate
syslog messages and transmit them to a syslog server for centralized storage and
analysis.
•APIs (Application Programming Interfaces): Many applications and services
provide APIs for programmatic access to log data. Organizations can use APIs to
extract log data from applications, cloud services, and third-party platforms for
centralized storage and analysis.
ELK (Elasticsearch, Logstash, and
Kibana)
•The ELK Stack, consisting of Elasticsearch, Logstash,
and Kibana, is a powerful combination of open-source
tools used for log management, analytics, and
visualization.
Logstash
Elasticsearch Elements
A node is simply a running instance of Elasticsearch.
• Data Nodes
• Master Nodes
A cluster is a collection of nodes running Elasticsearch.
Index(Noun): "index" refers to a collection of documents that share
similar characteristics.
Index(Verb): "index" refers to the action of adding documents to an
Elasticsearch index.
Each shard is technically a standalone search engine. There are two
types of shards: primary and replica.
In Elasticsearch, data is stored in the form of
documents, which are JSON objects. Each document
represents a single data entity and is stored in an index.
CRUD operations and queries
Create, Read, Update, Delete
{
"match": {
"description": "high-performance smartphone"
}
}

{
"wildcard": {
"name": "smart*"
}
}
Text Analysis/ Text analyzer
Text analysis enables Elasticsearch to perform full-text search,
where the search returns all relevant results rather than just exact
matches.

An analyzer — whether built-in or custom — is just a package which


contains three lower-level building blocks: character
filters, tokenizers, and token filters.
Character filters are used to preprocess the stream of characters
before it is passed to the tokenizer.
A tokenizer receives a stream of characters, breaks it up into
individual tokens (usually individual words), and outputs a stream of tokens.
Token filters accept a stream of tokens from a tokenizer and can modify tokens,
delete tokens or add tokens.
Inverted index
› The inverted index is a data
structure used in
information retrieval
systems, including search
engines like Elasticsearch,
to efficiently store and
retrieve documents based
on their textual content.
› It's called "inverted"
because it reverses the
relationship between terms
and documents compared
to the original documents.
The Growth of a Threat
Malwares Mass email campaign: Love letter, Melissa
The Growth of a Threat Multiple vectors of infection, attacks against AV software,
Combined infection vectors, dangerous payloads: Code Red, Nimda

Email virus + social engineering: Xmas Exec


Large scale pandemics: Morris worm
Infected 10% of the Internet.

Self replicating Sophisticated engineering: Conficker


Program: Creeper Use of Crypto.
Social Networks/cell phone worms.
Stuxnet,…
Types of Malware
Backdoor, Botnet, Downloader, Information-stealing malware, Launcher, Rootkit, Scareware
• Spam-sending malware
• Attacker rents machine to spammers
• Worms or viruses
• Malicious code that can copy itself and infect additional computers

• Mass malware
• Intended to infect as many machines as possible
• Most common type
• Targeted malware
• Tailored to a specific target
• Very difficult to detect, prevent, and remove
• Requires advanced analysis
• Ex: Stuxnet
Viruses
Virus propagates by infecting other programs
Automatically creates copies of itself, but to propagate, a human has to
run an infected program
Self-propagating viruses are often called worms
Encrypted viruses: constant decryptor followed by the encrypted
virus body
Polymorphic viruses: each copy creates a new random encryption
of the same virus body
Decryptor code constant and can be detected
Historical note: “Crypto” virus decrypted its body by brute-force key
search to avoid explicit decryptor code
48
Virus Detection
Simple anti-virus scanners
Look for signatures (fragments of known virus code)
Heuristics for recognizing code associated with viruses
Example: polymorphic viruses often use decryption loops
Integrity checking to detect file modifications
Keep track of file sizes, checksums, keyed HMACs of contents

Generic decryption and emulation


Emulate CPU execution for a few hundred instructions, recognize known
virus body after it has been decrypted
Does not work very well against viruses with mutating bodies and viruses
not located near beginning of infected executable

49
Mutation Techniques
Real Permutating Engine/RPME, ADMutate, etc.
Large arsenal of obfuscation techniques
Instructions reordered, branch conditions reversed, different register
names, different subroutine order
Jumps and NOPs inserted in random places
Garbage opcodes inserted in unreachable code areas
Instruction sequences replaced with other instructions that have the same
effect, but different opcodes
Mutate SUB EAX, EAX into XOR EAX, EAX or
MOV EBP, ESP into PUSH ESP; POP EBP

There is no constant, recognizable virus body


50
Malware Analysis
• Dissecting malware to understand
• How it works
• How to identify it
• How to defeat or eliminate it
• A critical part of incident response
• Exactly what happened
• Ensure you’ve located all infected machines and files
• How to measure and contain the damage
• Find signatures for intrusion detection systems
Static v. Dynamic Analysis
Static Analysis
Examines malware without running it
• Basic static analysis
View malware without looking at instructions
• Advanced static analysis
Reverse-engineering with a disassembler
Complex, requires understanding of assembly code

Antivirus scanning, Hashes, file’s strings, functions, and headers

Dynamic Analysis
Run the malware and monitor its effect
• Basic dynamic analysis
• Easy but requires a safe test environment
• Not effective on all malware
• Advanced Dynamic Analysis
• Run code in a debugger
• Examines internal state of a running malicious executable
PE Header
• Information about the code
• Type of application
• Required library functions
• Space requirements

The PE header lists every library and function that will be loaded.
Their names can reveal what the program does
Normal programs have a lot of DLLs, Malware often has very few
DLLs.
Important PE Sections
.text -- instructions for the CPU to execute
.rdata -- imports & exports
.data – global data
.rsrc – strings, icons, images, menus
• Hardware
• Microcode
• Machine code
• Low-level
languages
• High-level
languages
• Interpreted
languages
Disassembly
• Malware on a disk is in binary form at the machine code level
• Disassembly converts the binary form to assembly language
• IDA Pro is the most popular disassembler
• Graph and Text Mode
• Highlighting, functions and data calls & references, imports,
exports,
Why Perform Dynamic Analysis?
Dynamic analysis is efficient and will show you exactly what the
malware does.
Running malware deliberately, while monitoring the results.
Requires a safe environment
Real Machine vs Virtual Machine
Snapshots
Monitor system state
Boosting
• The term ‘Boosting’ refers to a family of algorithms
which converts weak learner to strong learners.
• To find weak rule, we apply base learning algorithms with a
different distribution. Each time base learning algorithm is
applied, it generates a new weak prediction rule. This is an
iterative process. After many iterations, the boosting algorithm
combines these weak rules into a single strong prediction rule.
Gradient Boosting
• In gradient boosting, it trains many models sequentially. Each new model gradually minimizes the loss
function (y = ax + b + e, ‘e’ needs special attention as it is an error term) of the whole system using Gradient
Descent method.
• The learning procedure consecutively fit new models to provide a more accurate estimate of the response
variable.
• The principle idea behind this algorithm is to construct new base learners which can be maximally correlated
with negative gradient of the loss function, associated with the whole ensemble.
o Y = M(x) + error
o error = G(x) + error2
o error2 = H(x) + error3
o Y = M(x) + G(x) + H(x) + error3
o Y = alpha * M(x) + beta * G(x) + gamma * H(x) + error4

For regression with square loss,


residual  negative gradient
Fit h to residual  fit h to negative gradient
update F based on residual  update F based on negative gradient
What Algorithm Does XGBoost Use?
• The XGBoost library implements the gradient boosting decision tree
algorithm.
In the XGBoost package, at the tth step we are tasked with finding the
tree Ft that will minimize the following objective function:

where L(Ft) is our loss function and (Ft) is our regularization function.

• Regularization: XGBoost has an option to penalize complex models


through both L1 and L2 regularization. Regularization helps in
preventing overfitting
Bagging vs Boosting
Bagging (Bootstrap Aggregating):
•Trains multiple models independently on different
subsets of data (with replacement).
Boosting:
•Trains models sequentially, focusing on examples
that previous models misclassified. Adjusts weights
of misclassified examples.
Basics of Probability
Conditional Probability:
Formula: P(A∣B)=P(A∩B)/ P(B) ​
Meaning: Probability of event A occurring given that event B
has occurred.
Joint Probability:
Formula: P(A∩B)
Meaning: Probability of both events A and B occurring
simultaneously.
Bayes' Theorem:
Formula: P(A∣B)=P(B∣A)⋅P(A)​/ P(B)
0.3 0.7

Markov Models Rain


0.3 0.7
Dry
• Set of states:
{s1 , s2 ,  , s N } Low High
• Process moves from one state to another generating a
0.2 sequence
0.8of
states : si1 , si 2 ,  , sik , 
0.2 0.8
• Markov chain property: probability of each subsequent state depends only
on what was the previous state:
0.6 0.6
0.4 0.4
P ( sik | si1 , si 2 ,  , sik  1 ) P ( sik | sRain
ik  1 ) Dry

• To define Markov model, the following probabilities have to be specified:


transition probabilities and initial probabilities
 i P( si ) aij P ( si | s j )
•matrix of observation probabilities B=(bi (vm )), bi(vm ) = P(vm | si)
Using HMMs :
Evaluation problem. Given the HMMM=(A, B, ) and the observation sequence
O=o o ... o , calculate the probability that model M has generated sequence O .
1 2 K
Forward backward
• Decoding problem. Given the HMM M=(A, B, ) and the observation sequence
O=o o ... o , calculate the most likely sequence of hidden states s that produced this
1 2 K i

observation sequence O. Viterbi


• Learning problem. Given some training observation sequences O=o o ... o and general
1 2 K
structure of HMM (numbers of hidden and visible states), determine HMM parameters
M=(A, B, ) that best fit training data. Baum Welsch

O=o ...o denotes a sequence of observations o {v ,…,v


1 K k 1 M }.
User Behavior Monitoring
User behavior monitoring is a cybersecurity practice focused on observing,
analyzing, and understanding the actions and activities of individuals
within a computer system or network.
Significance in Cyber security
• Early Threat Detection
• Prevention of Insider Threats
• Enhanced Incident Response
• Compliance and Governance
Key Components of User Behavior Monitoring
User Activity Logs:
User Authentication Patterns
Privilege Escalation Monitoring
Access Patterns and Permissions
Endpoint Behavior Analysis
Density-Connectivity
Density-reachable is not symmetric
 not good enough to describe clusters
Density-Connected
A pair of points p and q are density-connected
if they are commonly density-reachable from a
point o.
 Density-connectivity is
symmetric
p q

o
DBSCAN Algorithm
Input: The data set D
Parameter: , MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For
Core, Border & Outlier
Outlier ›Given  and MinPts,
categorize the
objects into three
Border exclusive groups.
A point is a core point if it has
Core more than a specified number of
points (MinPts) within Eps These
are points that are at the interior of
a cluster.

 = 1unit, MinPts = 5 A border point has fewer than


MinPts within Eps, but is in the
neighborhood of a core point.

A noise point is any point that


is not a core point nor a border
point.
DBSCAN: The Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable
from p wrt Eps and MinPts.
• If p is a core point, a cluster is formed.
• If p is a border point, no points are
density-reachable from p and DBSCAN
visits the next point of the database.
• Continue the process until all of the
points have been processed.
Clusters
Neural Networks
Multi-layer networks were designed to overcome the
computational (expressivity) limitation of a single threshold
element.
Output

Linear Threshold Unit Hidden

Input

71
Neural Networks
Neural Networks are functions:
where , or and
Robust approach to approximating real-valued, discrete-valued and vector
valued target functions.
𝑇3

)
) 𝑇5

Trainable Parameters:
, 𝑇4
72
The Backpropagation Algorithm
Create a fully connected three layer network. Initialize weights.
Until all examples produce the correct output within (or other criteria)
For each example in the training set do:
1. Compute the network output for this example
2. Compute the error between the output and target value

3. For each output unit k, compute error term

4. For each hidden unit, compute error term:


5. Update network weights with
End epoch

73
Recurrent Neural Networks (RNNs)
Definition of the RNN:

y1 y2 y3 y4 y5 yt

h1 h2 h3 h4 h5 h

x1 x2 x3 x4 x5 xt

74
Types of RNN
LSTM Network Architecture

76

You might also like