Db2 Performance
Triage for Db2 performance problems
Toby Haynes, Matt Emmerton
IBM Canada
1 © 2016 IBM Corporation
Introduction
●
Getting Started
●
Data collection
●
Deductions
●
Finding the right people
●
Solutions
2 © 2017 IBM Corporation
Getting Started
3 © 2017 IBM Corporation
Getting Started
●
Ensure that the problem is clearly stated
– “My database is slow” is not a useful description
– Performance support is NOT a free database tuning service
●
Find out when the problem started
– Knowing WHEN a problem started can make the analysis MUCH easier
– Changes in OS, configuration, new database deployments, new hardware, new
storage solutions – you NEED to know.
●
Some customers will say “nothing” has changed
●
You may need to collect evidence and ask about changes in specific areas later
●
Collecting the right information the first time is KEY.
4 © 2017 IBM Corporation
Setting expectations
●
Performance work is rarely “Just One Thing”
●
Don’t assume that the first “issue” resolved will “fix” the
problem
●
Many performance investigations point to a list of possible
causes
– Keep good notes
– Iterate on ONE issue at a time
– Making many changes at once wastes more time than it saves
5 © 2017 IBM Corporation
What is important to the customer?
●
Customers often care about specific metrics
– Specific query elapsed time (good)
– Commits per second (good)
– Bufferpool hit ratios (mostly good but not the whole story)
– Latch wait times (sometimes too specific...)
●
Get a clear statement on what performance is expected
– Previous “good” data is preferable
– Beware “impossible” demands
6 © 2017 IBM Corporation
Tactics
●
The game plan is:
– Collect the evidence
– Think (part I)
– Identify one or more areas that are “bad”
– Think (part II)
– Get the right people involved
– Solve
– Close
7 © 2017 IBM Corporation
Tactics
●
Always start at the largest scales
– Analysis starts looking at measurements across each of the database members
●
Identify “Big” areas
– CPU
– Wait Time
– Extreme I/O
●
Drill down
– Follow the most obvious lead
– Be prepared to back up and move on
8 © 2017 IBM Corporation
Data Collection
9 © 2017 IBM Corporation
Collect the evidence
●
db2support
– -purescale
●
Make sure this is used for ANY purescale cluster
●
100% Critical for any complaint about database recovery time
– Must be collected shortly after any recovery event (e.g. reboot)
– -collect 0
●
Configuration, catalog, db2look and db2set
– -collect 1
●
Adds formatted explain
10 © 2017 IBM Corporation
Collect the evidence
●
db2mon
– A set of a shell script and multiple SQL scripts, run to collect current activity on a Db2
instance
●
Supports pureScale and BLU, different versions for 9.7, 10.1, 10.5 and 11.1
– Oracle-mode ORA supported as well
●
Available from the Db2 performance team
– Do NOT send the perl script out to customers – just the generated files
– Db2mon hosted in IBM GitHub Enterprise
●
Updated sporadically – changes are coming soon
●
Likely to be available in sqllib/samples/perf from v11.1 mod3 fixpack3
11 © 2017 IBM Corporation
Db2mon
●
Successor to psmon
– Uses MON_GET_* table functions to obtain extensive data set
– Automatically calculates the differences between the performance counters
– Does not use snapshots
●
Snapshots are deprecated
●
Covers many aspects of database performance
– Resource usage
– Processing time
– Wait time
– Hit ratios
– I/O times
– Configuration
12 © 2017 IBM Corporation
Db2mon
Battle-tested in multiple
●
Aggregate CF command execution counts
Bufferpool data and index hit ratios
●
●
●
Bufferpool read statistics
●
Bufferpool sizes
critsits
●
Bufferpool statistics by tablespace
●
Bufferpool write statistics
●
CF-side command execution counts and average response times
●
CF-side total command execution counts
●
CF structure size information
CF system resource information
More than 60 reports
●
●
Client activity (active connections have at least 1 stmt s)
● ●
Combined results from 29 tables
●
Combined results from 58 tables
●
Count of group bufferpool full conditions
●
Current lock waits at end of capture
●
Currently executing SQL at start of capture (non-zero metrics only)
●
Database configuration settings
●
Database log read times
●
Database log write times
●
Database manager configuration settings
●
Database memory set information @ end
●
Database system resource usage information
●
DB2 registry variable settings
●
Deadlocks, lock timeouts and lock escalations
●
Disk read and write I O times
●
Group bufferpool data and index hit ratios
●
Group bufferpool invalid page statistics
●
Group bufferpool IO statistics by tablespace
●
Instance information
●
IO statistics per stmt - top statements by execution time
●
Latch wait metrics
●
Memory pool information @ end
●
Other database log statistics
●
Page reclaim metrics for index and data pages
●
Page reclaim metrics for SMP pages
●
Partial early aggregation distincts
●
Physical and logical page reads and writes at connection level
●
Processing times at database level
●
Round-trip CF command execution counts and average response times
●
Row level statistics per stmt - top statements by execution time
●
Sequences information
●
Size of database
●
Sort metrics at database level
●
Sort statistics per stmt - top statements by execution time
●
Tablespace data page prefetching statistics for group bufferpool
●
Tablespace prefetching statistics
●
Tablespace properties
●
Tablespace to bufferpool mapping
●
Tablespace usage over monitoring interval
●
Throughput metrics at database level
●
Time breakdown at database level (wait + processing)
●
Time spent executing by package
●
Top SQL statements by execution time, aggregated by PLANID
●
Top SQL statements by execution time
●
Various metrics at connection level
●
Various table level metrics
●
Wait time breakdown for top SQL statements by execution time
●
Wait times at connection level
●
Wait times at database level
●
Workload balancing server list
13 © 2017 IBM Corporation
Db2mon - dos and gotchas
●
db2mon needs
– MON_REQ_METRICS BASE or EXTENDED
– MON_ACT_METRICS BASE or EXTENDED
– MON_OBJ_METRICS BASE or EXTENDED
●
Db2mon can collect data over
– run for N seconds or
– “pause” until “Enter”
●
Counters can wrap over long periods of time
– Negative numbers in time columns are a BIG hint
– Impossible values are another
●
Rule of thumb: Keep the collection time under 5 minutes
14 © 2017 IBM Corporation
Deductions
15 © 2017 IBM Corporation
Capacity or configuration?
●
Look at the system
– CPU usage, specifically user+sys and IO wait
– Physical Memory (and swap usage)
– Disk I/O
– Network I/O
– Any virtualization
●
Most customers have some monitoring on some or all of these areas
– Get it if it exists
– DSM may hold historical data if deployed
●
Take notes of any part of the system that is showing high
usage/utilization
16 © 2017 IBM Corporation
Database or application
●
Slow queries or low transaction rates can be due to many factors
– Server (or part of the server)
– Network between client and server
– Client application
●
Compare
– Time spent from one request to the next
●
CLIENT_IDLE_WAIT_TIME
– Time spent on the server per request
●
TOTAL_RQST_TIME
●
It’s a client or networking issue if this ratio is too high
– CLIENT_IDLE_WAIT_TIME / TOTAL_RQST_TIME > 5
17 © 2017 IBM Corporation
Db2mon: Client Idle Wait Time
| MEMBER | TOTAL | TOTAL | TOTAL | TOT | ACTIVE | ACTIVE | ACTIVE | ACTIVE | ACTIVE |
| | CLIENTS | CIWT | RQST | CIWT | CLIENTS | RQ PER S | CIWT | RQST | CIWT |
| | | | | RQ | | | | | RQ |
| | | | | RATIO | | | | | RATIO |
|--------+---------+----------+---------+-------+---------+----------+----------+---------+--------|
| 0 | 1029 | 26976266 | 3062576 | 8.80 | 886 | 24376.47 | 24302776 | 2985602 | 8.13 |
| 1 | 838 | 25094808 | 2421562 | 10.36 | 731 | 21120.67 | 22000322 | 2326539 | 9.45 |
●
High Total CIWT indicates that the clients are not making
sequential requests quickly
– GOTCHA: max_coordagents is set to 120 – so roughly 6 connections are waiting for
each agent
18 © 2017 IBM Corporation
Waiting for …
●
Db2 provides WAIT TIME counters (mostly in milliseconds) for many
areas
– Total wait time can be found in IDLE_WAIT_TIME
– Some wait is inevitable (especially in pureScale)
– Investigate any area with high wait times
●
db2mon provides wait time
– Per database member
– Per application connection
– Per statement
●
Find the “big picture” wait and look to see where it is coming from
●
DSM dashboard includes wait time charts
19 © 2017 IBM Corporation
Wait Time Analysis
●
Start from “Wait times at database level”
●
Wait percentage can be around 70% on pureScale and be “fine”
●
Look for all the wait coming from ONE or TWO areas
– General:
●
Lock, Latch, Log Disk, Pool Read, Direct Writes, FCM, TCP/IP
– PureScale-specific
●
CF, Reclaims
20 © 2017 IBM Corporation
db2mon: Wait Time at Database Level
| MEMBER | PCT | PCT | PCT | PCT | PCT | PCT | PCT | PCT | PCT | PCT | PCT | PCT | PCT |
| | RQST | LOCK | GLB | LTCH | LG | RCLM | CF | POOL | DIR | DIR | FCM | TCPIP | DIAG |
| | WAIT | | LOCK | | DSK | | | R | R | W | | | |
|--------+-------+-------+------+------+-------+------+-------+------+------+------+------+-------+------|
| 0 | 75.49 | 16.18 | 0.24 | 0.07 | 20.98 | 2.84 | 32.80 | 0.01 | 0.00 | 0.00 | 0.20 | 2.33 | 0.00 |
| 1 | 76.54 | 17.70 | 1.99 | 0.04 | 19.36 | 7.03 | 29.46 | 0.01 | 0.00 | 0.00 | 0.02 | 2.81 | 0.00 |
●
Moderately busy system
– 75% Request Wait time (RQST_WAIT)
●
30% CF message
●
20% Log Disk
●
17% Lock
21 © 2017 IBM Corporation
Wait Time per SQL statement
●
Extreme Wait in one/few statements
– Identify the area (lock/latch/CF/log disk)
– Identify any common features of the “bad” statements e.g.
●
Similar columns
●
Same table
●
Same tablespace
●
Same bufferpool
●
Consider:
– Obtain formatted explain plans for these statements and consult Optimizer team
22 © 2017 IBM Corporation
Hit Ratios
●
Hang on – surely performance is all about hit ratios in bufferpools
– Undersized bufferpools show up as
●
Normally shows up as high pool read time
●
Many direct reads → high direct read time
●
Db2mon has “Bufferpool data and index hit ratios”
– Index hit ratio is usually more important than data page hit ratios
– Aim for 95% or better
●
PureScale “Group bufferpool data and index hit ratios”
– Any hits here avoid disk I/O
– Good enough can be quite low (40%) for some workloads
●
Too many customers have too many bufferpools
– Stick to one bufferpool per page size
– Configure with AUTOMATIC size
23 © 2017 IBM Corporation
Too much work (CPU)
●
If the CPU utilization is very high on a system
– Look for HOT parts of the system in the db2mon reports
– “Processing Times at database level”
●
Compiling
●
Section processing
●
Sort processing
●
Commit
●
Rollback
●
Column (BLU) processing
●
Connection processing
24 © 2017 IBM Corporation
Most expensive SQL statements (CPU)
●
Performance of the whole system can be badly impacted by a
few bad statements
– Use the “Top SQL statements by execution time” report to
see the worst offenders
– BEWARE:
●
Section Time is in milliseconds
●
CPU time is in microseconds
– It is common for the top five statements to take half the work
25 © 2017 IBM Corporation
Most expensive SQL statements (OLTP)
●
For OLTP systems, we expect a few rows accessed per statement
execution
●
Check the “ Row level statistics per stmt” report:
– Queries with many rows modified or read should be checked
– especially look for suitable indexes and explain plans
26 © 2017 IBM Corporation
db2mon: Row level statistics per statement
| MEM | NUM | AVG | AVG ROWS | AVG | STMT TEXT |
| | EXEC | ROWS | READ | ROWS | |
| | | MOD | | RET | |
| | | | | | |
|-----+------+------+----------+-------+---------------------------------------|
| 1 | 80 | 0.0 | 232.0 | 229.0 | SELECT V.CATEGORY, V.KEY, V.VALUE FRO |
| 1 | 81 | 0.0 | 171.0 | 27.0 | SELECT CAST(P.ID AS INTEGER) AS ID, C |
| 1 | 7843 | 0.0 | 144.0 | 1.1 | SELECT R.ID, R.PARTY_ID, R.TYPE FROM |
| 1 | 6917 | 0.0 | 144.0 | 1.0 | SELECT COUNT(*) AS COUNT FROM FTM.RS_ |
●
Statements returning 1 row may be fine if
– Aggregation
– Distinct
●
Statements including count(*) will need appropriate indexes
27 © 2017 IBM Corporation
Most expensive SQL statements (Analytics)
●
Long running queries are common in Analytic workloads
– Gather Explain data
●
Talk to the Optimizer team
– Consider moving row → column organized tables
– QUERY_OPT may need tweaking per query or overall
– There are many db2set parameters for fine tuning
28 © 2017 IBM Corporation
Too much work (CPU)
●
Drill down to appropriate areas
– Compiling → e.g. package cache, query optimization levels
– Section processing → (this is where the work should be!)
●
Check for hot statements in the “Top Sql statements by execution time”
– Sort processing → e.g. sort heap configuration, explain plans
– Commit → e.g. page cleaners, disk I/O times
– Rollback → e.g. locks, deadlocks, contention and configuration
– Column (BLU) processing → e.g. memory, cdetrace
– Connection processing → e.g. network, connection pooling
29 © 2017 IBM Corporation
Latch wait
●
High latch wait may be
– “business as usual” or
– a need for better configuration
●
Latch wait above 10% is definitely worth examining
– Per latch under “Latch wait metrics” in db2mon
●
Usually one or two latches dominate the time waiting
– Latchtrace may be needed
30 © 2017 IBM Corporation
db2mon: Latch Wait Times (pureScale – 30% latch wait)
| MEM | LATCH NAME | TOT | TOT | TIME |
| | | EXT | EXT | PER |
| | | LATCH | LATCH | LATCH |
| | | WAIT | WAITS | WAIT |
| | | TIME | | MS |
| | | MS | | |
|-----+-----------------------------------------------+--------+--------+-------|
| 2 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 279240 | 191130 | 1.46 |
| 3 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 271005 | 190274 | 1.42 |
| 1 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 270853 | 196191 | 1.38 |
| 6 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 265425 | 197303 | 1.34 |
| 5 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 255242 | 194439 | 1.31 |
| 0 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 246942 | 186820 | 1.32 |
| 7 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 242398 | 192810 | 1.25 |
| 4 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 228679 | 193971 | 1.17 |
| 7 | SQLO_LT_SQLB_BPD__WARLatch | 9639 | 20566 | 0.46 |
| 1 | SQLO_LT_SQLB_BPD__WARLatch | 8096 | 20196 | 0.40 |
| 0 | SQLO_LT_SQLB_BPD__WARLatch | 8005 | 20494 | 0.39 |
| 2 | SQLO_LT_SQLB_BPD__WARLatch | 7887 | 19878 | 0.39 |
| 6 | SQLO_LT_SQLB_BPD__WARLatch | 7809 | 20047 | 0.38 |
| 4 | SQLO_LT_SQLB_BPD__WARLatch | 7716 | 19645 | 0.39 |
| 5 | SQLO_LT_SQLB_BPD__WARLatch | 7511 | 20411 | 0.36 |
| 3 | SQLO_LT_SQLB_BPD__WARLatch | 7258 | 19943 | 0.36 |
| 3 | SQLO_LT_SQLB_PTBL__pool_table_latch | 3142 | 1966 | 1.59 |
| 1 | SQLO_LT_SQLB_PTBL__pool_table_latch | 2528 | 1966 | 1.28 |
| 7 | SQLO_LT_SQLP_DBCB__transsem | 473 | 1575 | 0.30 |
| 3 | SQLO_LT_sqmObjMetricsTable__m_objMetricsLatch | 469 | 2337 | 0.20 |
| 4 | SQLO_LT_SQLP_DBCB__transsem | 406 | 1570 | 0.25 |
31 © 2017 IBM Corporation
db2mon: Latch Wait Metrics (GDPC – 1% latch wait)
| MEM | LATCH NAME | TOT | TOT | TIME |
| | | EXT | EXT | PER |
| | | LATCH | LATCH | LATCH |
| | | WAIT | WAITS | WAIT MS |
| | | TIME | | |
| | | MS | | |
|-----+--------------------------------------------------------+--------+-------+---------|
| 0 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 137679 | 1707 | 80.65 |
| 1 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 116643 | 1745 | 66.84 |
| 0 | SQLO_LT_SQLB_POOL_ADDRESS_INFO__pool_address_latch | 43214 | 31 | 1394.00 |
| 1 | SQLO_LT_SQLB_POOL_ADDRESS_INFO__pool_address_latch | 39632 | 31 | 1278.45 |
| 0 | SQLO_LT_SQLB_BPD__WARLatch | 19480 | 318 | 61.25 |
| 1 | SQLO_LT_SQLB_BPD__WARLatch | 16049 | 350 | 45.85 |
| 0 | SQLO_LT_SQLD_SEQ__seqLatch | 1901 | 14 | 135.78 |
| 0 | SQLO_LT_SQLI_INX_PAGE_CACHE__ipcLatch | 1666 | 4 | 416.50 |
| 1 | SQLO_LT_SQLD_SEQ__seqLatch | 986 | 10 | 98.60 |
| 1 | SQLO_LT_SQLD_PAGE_CACHE__pageCacheLatch | 804 | 21 | 38.28 |
| 0 | SQLO_LT_SQLD_PAGE_CACHE__pageCacheLatch | 460 | 16 | 28.75 |
| 0 | SQLO_LT_SQLD_CHAIN__fullChainLatchNS | 112 | 10 | 11.20 |
| 1 | SQLO_LT_SQLE_GSS_HEADER__latch | 36 | 5 | 7.20 |
| 1 | SQLO_LT_SQLI_INX_PAGE_CACHE__ipcLatch | 11 | 2 | 5.50 |
| 0 | SQLO_LT_SQLE_GSS_HEADER__latch | 10 | 4 | 2.50 |
| 1 | SQLO_LT_NO_IDENTITY | 0 | 63 | 0.00 |
| 1 | SQLO_LT_SMemPool__MemLatchType__latch | 0 | 4 | 0.00 |
| 1 | SQLO_LT_SQLP_LHSH__hshlatch | 0 | 1 | 0.00 |
32 © 2017 IBM Corporation
Getting the right people
●
Fixing performance issues often involves more than just
Perf_Non_Opt
●
Secondary assistance often required
– Optimizer
– SAL
– Index
– Bufferpool
– DMS
33 © 2017 IBM Corporation
Solution Examples
34 © 2017 IBM Corporation
Extreme GDPC
●
Huge separation between sites GDPC cluster
– Sites separated by 65km cable
– 650 microsecond ping
●
Customer test suite runs in
– 2 minutes on Db2 ESE
– 30 minutes on GDPC cluster (better hardware)
35 © 2017 IBM Corporation
Extreme GDPC (continued)
●
Database level symptoms
●
60% → 90% Wait time (sporadic)
– Pool Read Wait Time 0% → 81%
– CF Wait Time 1% → 51%
– FCM Wait Time 2 → 57%
●
High Pool Read: enlarge bufferpool
– Result: Customer test suite runs in 4 minutes
36 © 2017 IBM Corporation
Monster 8-member pureScale cluster
●
Critsit
– Larger than average system
– 100% XA transactions
– Target:
●
Need 180,000 transactions per second
●
After extensive tuning, reached 115,000 transactions per second
●
Database level symptoms
– 80% Wait Time
●
33% Latch wait time
●
28% CF wait time
●
19% Log Disk wait time
37 © 2017 IBM Corporation
Monster 8-member pureScale cluster (continued)
●
Focus on each area seen in Wait Time
– Latch: SQLO_LT_SQLB_BPD__bpdLatch_SX
●
“Business as usual” Bufferpool Page latch
– Check CF execution times
●
High ReadCastoutClass times – up to milliseconds!
– Engage SAL and CF teams
●
Identified SINGLE mutex for all castout activity
●
Code change in ca-server
– Multiple mutexes controlled by environment variable
●
Other tuning
– Set QUERY_OPT back to 5 (customer had configured this to 3)
●
Result
– 202,000 transactions per second
38 © 2017 IBM Corporation
Resources
39 © 2017 IBM Corporation
Resources
●
People!
●
Dozens of presentations available from the Performance team
●
Best Practices, even five years old, are still valuable reading
●
Extensive tool sets available for chewing through performance
data
40 © 2017 IBM Corporation