Recommendations:
Go through the burger menu and analyze the details page of each option:
Pay attention to Settings, and Deployment e.g. PaaS and AWS and deploying the agent.
Most of the questions are common sense e.g. if you just read through the options you will quickly tell
what the obvious or correct answer is with around a quarter of them with screenshots and having to
analyze what’s going on.
I have also added both decks. The PDP ones were enough for me however if you want more detail then
look at the vILT ones also. Any questions as always reach out.
Architecture
• Session Storage file based
• Default quota 30GB
• Time Series Cassandra
• 1 minute intervals – 14 days
• 5 minute intervals – 28 days
• 1 hour intervals – 440 days
• 1 day intervals – 5 years
• Visits Elastic Search
• 30 days, can be extended
• Agent listen port 443/8443/9999
• What are Secure Gateways use for?
• Connection bundling – for firewall rule simplicity Network traffic aggregation
• What is the difference between Managed and SaaS?
• managed in within a customer’s environment and requires an outbound-only connection
One Agent
• Host monitoring
• Process monitoring
• Network monitoring
• Log file monitoring*
• Application monitoring for Java, .NET, PHP on Linux and Node.js*
• Web Server monitoring for Apache, IIS and Nginx*
• Plugin execution*
• Feels like one Agent for customers because of
• Single installer
• Auto update or one-touch manual update
• Different capabilities are completely abstracted
• When you need the details:
• Customer sees different libraries loading or several services running
• Customer sees several uplinks
Root Access
• Installer
• Installing Process Agent library in a system library directory
• Setting up /etc/ld.so.preload for injecting Process Agent globally.
• Modify SE Linux policies to allow global injection of Process Agent.
• Agent (OS Agent) needs root rights for
• Accessing list of open sockets for every process
• Accessing list of libraries loaded for every process
• Accessing name and path of executable file for every process
• Accessing command line parameters for every process
These are necessary for horizontal topology, correlating network agent data with processes and
process type recognition.
• Agent for Network needs root rights for
• Initially opening raw socket to capture network traffic. After initialization the root rights
can be dropped.
Smartscape
Application
1. User experience as measured at the endpoint, such as a browser or mobile device
2. How software is presented to the end user
Service
1. A set of code that accepts requests and returns results
2. The result of instrumenting a process
3. The “code layer” which requires “deep dive”
Process
1. A currently executing computer program
2. A means for code to request computing resources
Host
1. A physical or virtualized operating system
2. The source of compute, memory, and storage resources
Topology
• Host
• PGI (Process Group Instance), in principle a continuous representation of process on a
host
• PG (Process Group), a logical group PGIs that belong to the same family, e.g. 5 Tomcats
forming a PG
• Service Instance (SI), one service discovered and running on a PGI
• Service, logic group of Service Instances that serve one Service as a cluster on
distributed PGIs
• Application (not shown in diagram)
Traversing your Stack
Dynatrace Network Monitoring – What is it ?
Covers 3 fundamental aspects of network communication
Which processes consume most of my network resources?
Network utilization breakdown
Which processes experience network degradation problems?
Network quality – retransmissions
Can everyone talk and connect to their parties?
Process network connectivity
Environment agnostic – any TCP/Ethernet communication can be monitored:
physical, virtual, cloud
Dynatace agent installs pcap/winpcap library (open source)
If pcap is installed – we use what’s already there
Pcap allows listening to all packets on Ethernet interface(s)
Network monitoring is executed as separate service (process): “OneAgent network monitoring”
service
OneAgent intelligence allows matching observed traffic to running processes to provide unique
value
Network monitoring process has negligible overhead, depending on the traffic
Customers with up to 100-150Mbps of traffic have seen 0.5% CPU consumption
Internal throttling mechanisms disable functionality in case of problems (> 5% CPU overhead observed)
What are process groups?
Cluster of processes belonging together
Tomcat cluster, Jboss cluster, WebSphere cluster
Run the same software
Service boundaries (later)
Should be stable, continuous! (Deployment, version upgrade)
Processes vs Process group instance
A process group on a host is a process group instance
Normally one process with chart continuity (restart, crash redeploy)
What are services
Server-side code executed within a process group
Web containers, Web services or custom code that customer deployed
“~Agent tiers”
All service side requests are monitored via services
All code level information of requests is in services
Services are “detected” on a PurePath
Services = entry points
What are key requests
Requests can be marked as ‘Key Request’
Key Requests get special privileges
Custom thresholds – think SLAs
Historical data guaranteed – data kept in Cassandra
Always baselined by AI
Databases in Dynatrace
Critical to understand overall query performance for FDI
Critical to understand which single query is causing an issue
Treated as external service – monitor calls rather than DB processes themselves
Queries are treated at a Service – able to implement baselines
Database process treated as a Process – view log files, availability alerting
Capable of getting process metrics via OOTB plugins
Service Analysis
– Performance Details
What is it?
Detailed overview a service’s performance
Starting point for further analysis
When would I use it?
Understand the overall performance over time
Beginning manual hotspot and failure analysis
Landing view for several problem root causes
– Service Flow
What is it?
Overview of all services and queues that a selected service makes requests to and the
time spent within those services
When would I use it?
Understand the call chain sequence of a service
View all the response time contributors for a service
– Service Backtrace
What is it?
A view that shows information about who makes calls to a particular service
When would I use it?
Understand what services call the selected service
Analyze the performance of a service from the perspective of the calling clients.
– Response Time Hotspots
What is it?
A feature that allows you to breakdown time spent in any service or even individual
requests
When would I use it?
Performance analysis of any instrumented service
Understand total impact of code, DB queries, calls to other services
Analyzing performance degradation during problems related to a service or request
– Response Time Distribution
What is it?
A feature that allows you to quickly view the variance in request duration
When would I use it?
Easily view performance outliers and pick them out for deeper analysis
Quickly view changes in performance duration during problems vs normal behavior
– Failure Analyzer
What is it?
View the details of failures occurring at any instrumented tier.
When would I use it?
Ad hoc failure analysis of any instrumented service
Analyzing failure rate increase during problems related to a service or request
RUM
When? Agentless VS agent RUM
Agentless RUM – manual injected
No root access
No Dynatrace RUM supported serverside technology
Hosted web application
Agent based RUM
Whenever possible – correlation of server side, agent hours and visits
What is Apdex Rating?
Apdex is a universal standard that is used to measure user satisfaction with application
performance.
Default threshold for all apps is 3 seconds
Define your own thresholds for user satisfaction for individual actions
How it works –
Value = 1 Perfect!
Value < 0.5 Poor
Benchmarking and comparison
Mobile App
Android –
Gradle Plugin
Auto-Instrumentation
iOS –
Cocoa Pods
Libraries / -Objc linker flag
Tagging
Add metadata to hosts, process groups, and services
Use cases
Filters
Alert notifications
Can be done
Manually
In bulk
Automatic
via Dynatrace API
AI
Graph-algorithm based correlation of events
Detected topology represents the basis for edge-weights
Vertical infrastructure topology
Horizontal call relation topology
Apply weights on graph edges
based on knowledge base
The ranking decides about the probability of root cause candidates, shown on the right.
Frequent Issue Detection
Hypothesis: There are unhealthy situations that are normal for ops
Unimportant disk is full since several weeks
Regular backup process triggers CPU spikes
Ruxit detects such regular events
On basis of a daily and weekly moving window
Once discovered only notify user if severity increases
Frequent issue detection uses one week moving window
One week without frequent issue means reset to start
Reports
• There are two types of Reports included current in the product:
• Service quality reports
• Availability reports
• Reports are generated weekly, Sunday nights at Midnight
• The Unread filter on the Reports page will display only unread reports
• You can share reports with anyone. Just type an email address and click Send.
• When you share a report with a non-Dynatrace user, the user receives a message with a
private link that allows them to view your report without logging into Dynatrace.
Service Quality Reports
• Summarize the monitoring insights compiled over the past week
• Offer an overview of your applications, services, infrastructure utilization, performance
problems, and the impact of performance problems on your customers
• Give insights into hot spots and make it easy to share insights with others
• Reports are structured in such a way that even non-technical team members can understand
them.
• Reports include four sections: overall environment, applications, services, and infrastructure.
For each report section, a score shows you how well your monitoring stack components have
performed over the past week.
SEVERITY EVENTS TYPES HIGHER – AVAILABILITY
LOWER - INFO
https://www.dynatrace.com/support/help/how-to-use-dynatrace/data-privacy-and-security/data-
privacy/data-retention-periods#timeseries-metrics
database REFERENCE default period – ready after 7days
Browser clickpath interval – MINIMUM GRANULATION TO SEE - “RESOLUTION”
Host groups – ANOMALY DETECTION USE GLOBAL SETTING ENABLE. ISOLATED HOST HAS HIS OWN
SETTING AND IT CAN BE OVERRIDDEN BY THIS SETTINGS. IT USEs TOO GROUP HOSTS AND MAKE IT
EASIER TO MONITOR A GROUP HOSTS
types of active gates
KUBERNETS AND OPENSHIFT NEED PRIVATE ACTIVE GATE
BASIC – TYPES OF: ENVIRONMENT AND CLUSTER
BUNDLE ONEAGENT TRAFFIC (EMPACOTA)
SaaS only - environment AG
Managed – both types – cluster is more important
active gates do not store any log files
maintance windows
PROBLEM DETECTION ON MAINTANCE WINDOW
ALERTA – NÃO ALERTA E DeSABILITA
timeframe update PROBLEMS Dynatrace MOBILE app
dynatrace mobile app – 72 hs to update frame
anomaly detection timeframe
learn every day to get a pattern.
one week – acumulate learning of 7 days to create a reference
session duration
first action to last action. No activity is not computed.
4 minutes to be usable to analyse
You can disable onagent monitoring on individual host settings page
key/ value pair – to destination
Session end – user closes browser
baseline time to be avaliable -takes 2 hours to be ready to use
user session visible
time – 4 minutes
Mobile App time Update problems – 72 hours
network connectivity metrics – estabilished vs refused – in percentage
connection attemps – connections refused – connections timeout
cookie – monitor site / website usage / track user
HTTP Monitor smallest time intervals – 1 minute
Conversion goals – session duration, destination – number of actions, user action
service reports time to be ready: each week – 7 days
Quality reports - sucessfull
Synthetic locations – private -
Cookie on browser to identify user
Oneagent interaction
Management zones:
Supported browsers
The supported browser for the Dynatrace Synthetic Recorder is Google Chrome (latest version,
backwards compatible).
The browser used for executing browser monitors from public locations is listed on the
Frequency and locations page when you create or edit a browser monitor.
options environment privacy
IP address gps coordinates
personal data on site
mask user actions
Network Overview page
traffic – connectivity - retransmissons
ex agentless code injection on webpage
Key request 1 minute – 14 days
Conversion goals – destination, user action, session duration, number of actions
Mobile application screenshots
app usage: user experience, crashes, network performance, called services
Browser clickpath time – 5 minutes
5% above CPU – stop monitoring
Network metrics
network traffic, responsiveness, connectivity
https://www.dynatrace.com/support/help/how-to-use-dynatrace/hosts/monitoring/measures-for-host-
health
Script mode for HTTP monitor configuration - json
In addition to the configuration in the UI (Visual mode), you can use Script mode to
configure your HTTP monitors. In this mode, you can access the underlying JSON script
of your monitor. If you're a Synthetic Monitoring power user, this will make your life a lot
easier and allow you to speed up monitor creation and management. Use the script
editor to quickly find specific events (steps) or adapt locators across the entire script.
Memory metrics – memory used, page faults
processes vs services
services duplicated problem in detection
needs to be adjusted in custom process groups detection
Anomaly detection for services
• Service load drops
• Response time
• Service load spikes
• Failure rate
Reference period – 7 days
Dynatrace compares current behavior to a reference period up to 7 days. If the
data in this period is invalid (e.g. because of major architectural changes) you
can reset this period and Dynatrace will immediately adapt to the new setup.
Current period: 7 days
mark to pin to dashboard user action
event types:
https://www.dynatrace.com/support/help/how-to-use-dynatrace/problem-detection-and-analysis/
basic-concepts/event-types
reports
log time to auto discovery - 60 secs
Default metrics for mob applications – app usage
listed on
the
Hosts
page:
Virtual
Machines, tags created like process groups, problems, physical machines with one agent
Opaque services – dynatrace can detect
How does Dynatrace detect external services like unmonitored hosts? - when they send and
receive requests
it is incorrect because dynatrace can detect external dependencies by requests
opaque
synthetic monitor – can use last 24 hours
synthetic monitor types
performance thresholds – 24 hs
Baseline cube – baseline calculation 2 hours to database everyday
Baseline for application vs baseline for services
External service definition
https://www.dynatrace.com/support/help/how-to-use-dynatrace/transactions-and-services/service-
detection-and-naming
Timeseries Manage METRICS – keeped for 5 years
traffic spike error needs be running 20% in one week to raise alerts
Key user actions
With the key user action feature, you can customize the Apdex thresholds for each of
these user actions. You can use this feature to monitor key actions with a dedicated
dashboard tile and track historical trends.
• Mark key action
• change
web request services
user cookie deleted new session starts
Network overhead pausa por 3 e dobra a meta se o uso de cpu continar aumentando até 45 min
de pause
Events and problems
OS__UNEXPECTED
host autotag
host metadata:
configure user key action appdex
create userkey action first – edit created userkey action
memory dumps
– 2 hours
map ip location to private ips
Host oneagent updates
Oneagent update setting – global
or individual on host
problem severity
Smartscape needs 72 hrs
process groups – same tasks – same tech
DRUM real user mon - user actions – aplication contained in user session
user session – user actions for multi applications
granularity table – 72 days 1 hour
Host groups creation
oneagent install – oneagentctl cmd
live user sessions color differs to completed by color
Management zone – to separete to another user some resources blocked, others not. Permissdion to
someone to see
monitor PHP on a single host – need to disable global to enable in only one
metrics performance key
• User action duration
• Visually complete
• Speed index
• DOM interactive
• Load event end
• Load event start
• HTML downloaded
• Time to first byte
• Largest contentful paint
• Cumulative layout shift
• First input delay (RUM web only)
synthetic settings – NOT O/S
responsetime hotspots at services
Metadata – auto
Tags – manual
cumbersome
RUM
statements about user session properties
it is avaliable for mobile app – it can records mobile apps
user session contains all
application is all user actions+
application
application
answer time degradation – 10% of all requests
Automatic hotspot analysis
auto hotspot – cpu intense
users that are affected by crashes
key action – extends retention time of full historical data
Diferenças entre Process x HOST
custom service, what can you use to define an entry point?
Methods and integration
User auth options in managed
VMWare – oneagent needs to be installed in each of VMs
Oneagent port send data through 8443
JVM must be restarted to work
tags to process groups
WRONG! YOU CAN TAG PROCESS. THIS QUESTION WAS UPDATED
Analyze crash error using log viewer finding out more about the crash
Vmware – Openstack – private active gate for SaaS
monitor database statements – database calls
If YOU INCREASE THRESHOLDS YOU ARE COMPROMISING MONITORING. MONITORING CANNOT HAVE
INCREASE NOTHING BECAUSE OF SLA AND SLOS DEFINED.
detection thresholds can be overridden – breaCHES THRESHOLDS
greater and higher
VMWare Panel
KEY ACTION – THRESHOLD - EXPECTATIONS!
you cannot manage databases – sql server – no manage!
FILTER USER SESSION FOR A BROWSER – by browser
AI Anommaly do it auto
Visual – sunbrust
mobile app crash grouped by:
Event is raw data. Problems contains events. IA analyze and do classification of those events
AI make correspondence in such events to create a root cause.
Events needs to get severity level in order to avoid stupid alarms. A SIMPLE EVENT CANNOT BE A
PROBLEM
PROBLEMS CONTAIN CORRELATED EVENTS
Has external dependencies but it not means that you cannot access code
It shows co-related entities and analyze dependencies among events. This is done by Davis.