0% found this document useful (0 votes)
394 views60 pages

IT Infrastructure Monitoring Guide

The document provides recommendations and details on various Dynatrace concepts: - It recommends analyzing burger menu options and deployment details like PaaS and AWS. Most questions just require reading the options to determine the obvious answer. - It describes Dynatrace architecture including session storage, time series storage, visits storage, agents, and secure gateways. - It explains the differences between managed and SaaS offerings and root access needs for the Dynatrace agent. - It provides overviews of key concepts like Smartscape, services, processes, hosts, and topology.

Uploaded by

antony vasquez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
394 views60 pages

IT Infrastructure Monitoring Guide

The document provides recommendations and details on various Dynatrace concepts: - It recommends analyzing burger menu options and deployment details like PaaS and AWS. Most questions just require reading the options to determine the obvious answer. - It describes Dynatrace architecture including session storage, time series storage, visits storage, agents, and secure gateways. - It explains the differences between managed and SaaS offerings and root access needs for the Dynatrace agent. - It provides overviews of key concepts like Smartscape, services, processes, hosts, and topology.

Uploaded by

antony vasquez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 60

Recommendations:

Go through the burger menu and analyze the details page of each option:

Pay attention to Settings, and Deployment e.g. PaaS and AWS and deploying the agent.
Most of the questions are common sense e.g. if you just read through the options you will quickly tell
what the obvious or correct answer is with around a quarter of them with screenshots and having to
analyze what’s going on.

I have also added both decks. The PDP ones were enough for me however if you want more detail then
look at the vILT ones also. Any questions as always reach out.
Architecture

• Session Storage  file based


• Default quota 30GB
• Time Series  Cassandra
• 1 minute intervals – 14 days
• 5 minute intervals – 28 days
• 1 hour intervals – 440 days
• 1 day intervals – 5 years
• Visits  Elastic Search
• 30 days, can be extended
• Agent listen port 443/8443/9999
• What are Secure Gateways use for?
• Connection bundling – for firewall rule simplicity Network traffic aggregation

• What is the difference between Managed and SaaS?

• managed in within a customer’s environment and requires an outbound-only connection

One Agent
• Host monitoring
• Process monitoring
• Network monitoring
• Log file monitoring*
• Application monitoring for Java, .NET, PHP on Linux and Node.js*
• Web Server monitoring for Apache, IIS and Nginx*
• Plugin execution*
• Feels like one Agent for customers because of
• Single installer
• Auto update or one-touch manual update
• Different capabilities are completely abstracted
• When you need the details:
• Customer sees different libraries loading or several services running
• Customer sees several uplinks
Root Access
• Installer
• Installing Process Agent library in a system library directory
• Setting up /etc/ld.so.preload for injecting Process Agent globally.
• Modify SE Linux policies to allow global injection of Process Agent.
 
• Agent (OS Agent) needs root rights for
• Accessing list of open sockets for every process
• Accessing list of libraries loaded for every process
• Accessing name and path  of executable  file for every process
• Accessing command line parameters for every process
These are necessary for horizontal  topology, correlating network agent data with processes and
process type recognition.
 
• Agent for Network needs root rights for
• Initially  opening raw socket to capture network traffic. After initialization  the root  rights
can be dropped.

Smartscape
Application
1. User experience as measured at the endpoint, such as a browser or mobile device
2. How software is presented to the end user
Service
1. A set of code that accepts requests and returns results
2. The result of instrumenting a process
3. The “code layer” which requires “deep dive”
Process
1. A currently executing computer program
2. A means for code to request computing resources
Host
1. A physical or virtualized operating system
2. The source of compute, memory, and storage resources
Topology
• Host
• PGI (Process Group Instance), in principle a continuous representation of process on a
host
• PG (Process Group), a logical group PGIs that belong to the same family, e.g. 5 Tomcats
forming a PG
• Service Instance (SI), one service discovered and running on a PGI
• Service, logic group of Service Instances that serve one Service as a cluster on
distributed PGIs
• Application (not shown in diagram)

Traversing your Stack

Dynatrace Network Monitoring – What is it ?

Covers 3 fundamental aspects of network communication


 Which processes consume most of my network resources?
 Network utilization breakdown
 Which processes experience network degradation problems?
 Network quality – retransmissions
 Can everyone talk and connect to their parties?
 Process network connectivity
Environment agnostic – any TCP/Ethernet communication can be monitored:
 physical, virtual, cloud
 Dynatace agent installs pcap/winpcap library (open source)
 If pcap is installed – we use what’s already there
 Pcap allows listening to all packets on Ethernet interface(s)
 Network monitoring is executed as separate service (process): “OneAgent network monitoring”
service
 OneAgent intelligence allows matching observed traffic to running processes to provide unique
value
Network monitoring process has negligible overhead, depending on the traffic
Customers with up to 100-150Mbps of traffic have seen 0.5% CPU consumption
Internal throttling mechanisms disable functionality in case of problems (> 5% CPU overhead observed)

What are process groups?

 Cluster of processes belonging together


 Tomcat cluster, Jboss cluster, WebSphere cluster
 Run the same software
 Service boundaries (later)
Should be stable, continuous! (Deployment, version upgrade)

Processes vs Process group instance


 A process group on a host is a process group instance
 Normally one process with chart continuity (restart, crash redeploy)
What are services
 Server-side code executed within a process group
 Web containers, Web services or custom code that customer deployed
 “~Agent tiers”
 All service side requests are monitored via services
 All code level information of requests is in services
 Services are “detected” on a PurePath
 Services = entry points
What are key requests
 Requests can be marked as ‘Key Request’
 Key Requests get special privileges
 Custom thresholds – think SLAs
 Historical data guaranteed – data kept in Cassandra
 Always baselined by AI
Databases in Dynatrace
 Critical to understand overall query performance for FDI
 Critical to understand which single query is causing an issue
 Treated as external service – monitor calls rather than DB processes themselves
 Queries are treated at a Service – able to implement baselines
 Database process treated as a Process – view log files, availability alerting

 Capable of getting process metrics via OOTB plugins

Service Analysis
– Performance Details
 What is it?
 Detailed overview a service’s performance
 Starting point for further analysis
 When would I use it?
 Understand the overall performance over time
 Beginning manual hotspot and failure analysis
 Landing view for several problem root causes
– Service Flow
 What is it?
 Overview of all services and queues that a selected service makes requests to and the
time spent within those services
 When would I use it?
 Understand the call chain sequence of a service
 View all the response time contributors for a service
– Service Backtrace
 What is it?
 A view that shows information about who makes calls to a particular service
 When would I use it?
 Understand what services call the selected service
 Analyze the performance of a service from the perspective of the calling clients.
– Response Time Hotspots
 What is it?
 A feature that allows you to breakdown time spent in any service or even individual
requests
 When would I use it?
 Performance analysis of any instrumented service
 Understand total impact of code, DB queries, calls to other services
 Analyzing performance degradation during problems related to a service or request
– Response Time Distribution
 What is it?
 A feature that allows you to quickly view the variance in request duration
 When would I use it?
 Easily view performance outliers and pick them out for deeper analysis
 Quickly view changes in performance duration during problems vs normal behavior
– Failure Analyzer
 What is it?
 View the details of failures occurring at any instrumented tier.
 When would I use it?
 Ad hoc failure analysis of any instrumented service
 Analyzing failure rate increase during problems related to a service or request
RUM
When? Agentless VS agent RUM
 Agentless RUM – manual injected
 No root access
 No Dynatrace RUM supported serverside technology
 Hosted web application
 Agent based RUM
 Whenever possible – correlation of server side, agent hours and visits
What is Apdex Rating?
 Apdex is a universal standard that is used to measure user satisfaction with application
performance.
 Default threshold for all apps is 3 seconds
 Define your own thresholds for user satisfaction for individual actions
 How it works –
 Value = 1 Perfect!
 Value < 0.5 Poor 
 Benchmarking and comparison
Mobile App
 Android –
 Gradle Plugin
 Auto-Instrumentation
 iOS –
 Cocoa Pods
 Libraries / -Objc linker flag
Tagging
 Add metadata to hosts, process groups, and services
 Use cases
 Filters
 Alert notifications
 Can be done
 Manually
 In bulk
 Automatic
 via Dynatrace API
AI
 Graph-algorithm based correlation of events
 Detected topology represents the basis for edge-weights
 Vertical infrastructure topology
 Horizontal call relation topology
 Apply weights on graph edges
 based on knowledge base
 The ranking decides about the probability of root cause candidates, shown on the right.
Frequent Issue Detection
 Hypothesis: There are unhealthy situations that are normal for ops
 Unimportant disk is full since several weeks
 Regular backup process triggers CPU spikes
 Ruxit detects such regular events
 On basis of a daily and weekly moving window
 Once discovered only notify user if severity increases
 Frequent issue detection uses one week moving window
 One week without frequent issue means reset to start
Reports
• There are two types of Reports included current in the product:
• Service quality reports
• Availability reports
• Reports are generated weekly, Sunday nights at Midnight
• The Unread filter on the Reports page will display only unread reports
• You can share reports with anyone. Just type an email address and click Send.
• When you share a report with a non-Dynatrace user, the user receives a message with a
private link that allows them to view your report without logging into Dynatrace.
Service Quality Reports
• Summarize the monitoring insights compiled over the past week
• Offer an overview of your applications, services, infrastructure utilization, performance
problems, and the impact of performance problems on your customers
• Give insights into hot spots and make it easy to share insights with others
• Reports are structured in such a way that even non-technical team members can understand
them.
• Reports include four sections: overall environment, applications, services, and infrastructure.
For each report section, a score shows you how well your monitoring stack components have
performed over the past week.

SEVERITY EVENTS TYPES HIGHER – AVAILABILITY


LOWER - INFO
https://www.dynatrace.com/support/help/how-to-use-dynatrace/data-privacy-and-security/data-
privacy/data-retention-periods#timeseries-metrics

database REFERENCE default period – ready after 7days

Browser clickpath interval – MINIMUM GRANULATION TO SEE - “RESOLUTION”

Host groups – ANOMALY DETECTION USE GLOBAL SETTING ENABLE. ISOLATED HOST HAS HIS OWN
SETTING AND IT CAN BE OVERRIDDEN BY THIS SETTINGS. IT USEs TOO GROUP HOSTS AND MAKE IT
EASIER TO MONITOR A GROUP HOSTS
types of active gates

KUBERNETS AND OPENSHIFT NEED PRIVATE ACTIVE GATE


BASIC – TYPES OF: ENVIRONMENT AND CLUSTER
BUNDLE ONEAGENT TRAFFIC (EMPACOTA)
SaaS only - environment AG
Managed – both types – cluster is more important

active gates do not store any log files


maintance windows

PROBLEM DETECTION ON MAINTANCE WINDOW


ALERTA – NÃO ALERTA E DeSABILITA

timeframe update PROBLEMS Dynatrace MOBILE app


dynatrace mobile app – 72 hs to update frame
anomaly detection timeframe
learn every day to get a pattern.
one week – acumulate learning of 7 days to create a reference

session duration
first action to last action. No activity is not computed.
4 minutes to be usable to analyse

You can disable onagent monitoring on individual host settings page


key/ value pair – to destination

Session end – user closes browser

baseline time to be avaliable -takes 2 hours to be ready to use

user session visible


time – 4 minutes
Mobile App time Update problems – 72 hours

network connectivity metrics – estabilished vs refused – in percentage

connection attemps – connections refused – connections timeout

cookie – monitor site / website usage / track user


HTTP Monitor smallest time intervals – 1 minute

Conversion goals – session duration, destination – number of actions, user action


service reports time to be ready: each week – 7 days
Quality reports - sucessfull

Synthetic locations – private -

Cookie on browser to identify user


Oneagent interaction

Management zones:
Supported browsers
The supported browser for the Dynatrace Synthetic Recorder is Google Chrome (latest version,
backwards compatible).
The browser used for executing browser monitors from public locations is listed on the
Frequency and locations page when you create or edit a browser monitor.
options environment privacy

IP address gps coordinates


personal data on site
mask user actions
Network Overview page
traffic – connectivity - retransmissons

ex agentless code injection on webpage

Key request 1 minute – 14 days

Conversion goals – destination, user action, session duration, number of actions


Mobile application screenshots
app usage: user experience, crashes, network performance, called services
Browser clickpath time – 5 minutes

5% above CPU – stop monitoring

Network metrics
network traffic, responsiveness, connectivity

https://www.dynatrace.com/support/help/how-to-use-dynatrace/hosts/monitoring/measures-for-host-
health
Script mode for HTTP monitor configuration - json

In addition to the configuration in the UI (Visual mode), you can use Script mode to
configure your HTTP monitors. In this mode, you can access the underlying JSON script
of your monitor. If you're a Synthetic Monitoring power user, this will make your life a lot
easier and allow you to speed up monitor creation and management. Use the script
editor to quickly find specific events (steps) or adapt locators across the entire script.

Memory metrics – memory used, page faults

processes vs services

services duplicated problem in detection


needs to be adjusted in custom process groups detection

Anomaly detection for services

• Service load drops


• Response time
• Service load spikes
• Failure rate 
Reference period – 7 days
Dynatrace compares current behavior to a reference period up to 7 days. If the
data in this period is invalid (e.g. because of major architectural changes) you
can reset this period and Dynatrace will immediately adapt to the new setup.
Current period: 7 days

mark to pin to dashboard user action

event types:
https://www.dynatrace.com/support/help/how-to-use-dynatrace/problem-detection-and-analysis/
basic-concepts/event-types

reports
log time to auto discovery - 60 secs

Default metrics for mob applications – app usage

listed on
the
Hosts
page:
Virtual

Machines, tags created like process groups, problems, physical machines with one agent

Opaque services – dynatrace can detect


How does Dynatrace detect external services like unmonitored hosts? - when they send and
receive requests

it is incorrect because dynatrace can detect external dependencies by requests


opaque

synthetic monitor – can use last 24 hours

synthetic monitor types


performance thresholds – 24 hs

Baseline cube – baseline calculation 2 hours to database everyday

Baseline for application vs baseline for services


External service definition

https://www.dynatrace.com/support/help/how-to-use-dynatrace/transactions-and-services/service-
detection-and-naming

Timeseries Manage METRICS – keeped for 5 years

traffic spike error needs be running 20% in one week to raise alerts
Key user actions
With the key user action feature, you can customize the Apdex thresholds for each of
these user actions. You can use this feature to monitor key actions with a dedicated
dashboard tile and track historical trends.
• Mark key action

• change

web request services

user cookie deleted new session starts


Network overhead pausa por 3 e dobra a meta se o uso de cpu continar aumentando até 45 min
de pause

Events and problems

OS__UNEXPECTED
host autotag

host metadata:

configure user key action appdex


create userkey action first – edit created userkey action

memory dumps
– 2 hours
map ip location to private ips

Host oneagent updates

Oneagent update setting – global


or individual on host
 problem severity

Smartscape needs 72 hrs

process groups – same tasks – same tech

DRUM real user mon - user actions – aplication contained in user session
user session – user actions for multi applications
granularity table – 72 days 1 hour

Host groups creation

oneagent install – oneagentctl cmd

live user sessions color differs to completed by color


Management zone – to separete to another user some resources blocked, others not. Permissdion to
someone to see

monitor PHP on a single host – need to disable global to enable in only one
metrics performance key
• User action duration
• Visually complete
• Speed index
• DOM interactive
• Load event end
• Load event start
• HTML downloaded
• Time to first byte
• Largest contentful paint
• Cumulative layout shift
• First input delay (RUM web only)

synthetic settings – NOT O/S


responsetime hotspots at services

Metadata – auto
Tags – manual
cumbersome
RUM
statements about user session properties
it is avaliable for mobile app – it can records mobile apps

user session contains all


application is all user actions+
application

application
answer time degradation – 10% of all requests
Automatic hotspot analysis

auto hotspot – cpu intense


users that are affected by crashes

key action – extends retention time of full historical data


Diferenças entre Process x HOST

custom service, what can you use to define an entry point?


Methods and integration

User auth options in managed


VMWare – oneagent needs to be installed in each of VMs

Oneagent port send data through 8443

JVM must be restarted to work


tags to process groups

WRONG! YOU CAN TAG PROCESS. THIS QUESTION WAS UPDATED


Analyze crash error using log viewer finding out more about the crash
Vmware – Openstack – private active gate for SaaS

monitor database statements – database calls

If YOU INCREASE THRESHOLDS YOU ARE COMPROMISING MONITORING. MONITORING CANNOT HAVE
INCREASE NOTHING BECAUSE OF SLA AND SLOS DEFINED.
detection thresholds can be overridden – breaCHES THRESHOLDS
greater and higher

VMWare Panel
KEY ACTION – THRESHOLD - EXPECTATIONS!

you cannot manage databases – sql server – no manage!


FILTER USER SESSION FOR A BROWSER – by browser
AI Anommaly do it auto

Visual – sunbrust
mobile app crash grouped by:

Event is raw data. Problems contains events. IA analyze and do classification of those events
AI make correspondence in such events to create a root cause.
Events needs to get severity level in order to avoid stupid alarms. A SIMPLE EVENT CANNOT BE A
PROBLEM
PROBLEMS CONTAIN CORRELATED EVENTS
Has external dependencies but it not means that you cannot access code
It shows co-related entities and analyze dependencies among events. This is done by Davis.

You might also like