0% found this document useful (0 votes)
65 views40 pages

Handling Performance PMRs

The document outlines a systematic approach to diagnosing and resolving performance issues in Db2 databases. It emphasizes the importance of clear problem definition, thorough data collection, and iterative analysis to identify root causes. Key tools and metrics for performance evaluation, such as db2mon and various wait time analyses, are discussed to aid in effective troubleshooting.

Uploaded by

kanmanda1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views40 pages

Handling Performance PMRs

The document outlines a systematic approach to diagnosing and resolving performance issues in Db2 databases. It emphasizes the importance of clear problem definition, thorough data collection, and iterative analysis to identify root causes. Key tools and metrics for performance evaluation, such as db2mon and various wait time analyses, are discussed to aid in effective troubleshooting.

Uploaded by

kanmanda1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Db2 Performance

Triage for Db2 performance problems


Toby Haynes, Matt Emmerton
IBM Canada

1 © 2016 IBM Corporation


Introduction


Getting Started

Data collection

Deductions

Finding the right people

Solutions

2 © 2017 IBM Corporation


Getting Started

3 © 2017 IBM Corporation


Getting Started


Ensure that the problem is clearly stated
– “My database is slow” is not a useful description
– Performance support is NOT a free database tuning service

Find out when the problem started
– Knowing WHEN a problem started can make the analysis MUCH easier
– Changes in OS, configuration, new database deployments, new hardware, new
storage solutions – you NEED to know.

Some customers will say “nothing” has changed

You may need to collect evidence and ask about changes in specific areas later


Collecting the right information the first time is KEY.

4 © 2017 IBM Corporation


Setting expectations


Performance work is rarely “Just One Thing”

Don’t assume that the first “issue” resolved will “fix” the
problem

Many performance investigations point to a list of possible
causes
– Keep good notes
– Iterate on ONE issue at a time
– Making many changes at once wastes more time than it saves

5 © 2017 IBM Corporation


What is important to the customer?


Customers often care about specific metrics
– Specific query elapsed time (good)
– Commits per second (good)
– Bufferpool hit ratios (mostly good but not the whole story)
– Latch wait times (sometimes too specific...)

Get a clear statement on what performance is expected
– Previous “good” data is preferable
– Beware “impossible” demands

6 © 2017 IBM Corporation


Tactics


The game plan is:
– Collect the evidence
– Think (part I)
– Identify one or more areas that are “bad”
– Think (part II)
– Get the right people involved
– Solve
– Close

7 © 2017 IBM Corporation


Tactics

Always start at the largest scales
– Analysis starts looking at measurements across each of the database members

Identify “Big” areas
– CPU
– Wait Time
– Extreme I/O

Drill down
– Follow the most obvious lead
– Be prepared to back up and move on

8 © 2017 IBM Corporation


Data Collection

9 © 2017 IBM Corporation


Collect the evidence


db2support
– -purescale

Make sure this is used for ANY purescale cluster

100% Critical for any complaint about database recovery time
– Must be collected shortly after any recovery event (e.g. reboot)

– -collect 0

Configuration, catalog, db2look and db2set
– -collect 1

Adds formatted explain

10 © 2017 IBM Corporation


Collect the evidence

db2mon
– A set of a shell script and multiple SQL scripts, run to collect current activity on a Db2
instance

Supports pureScale and BLU, different versions for 9.7, 10.1, 10.5 and 11.1
– Oracle-mode ORA supported as well

Available from the Db2 performance team
– Do NOT send the perl script out to customers – just the generated files

– Db2mon hosted in IBM GitHub Enterprise



Updated sporadically – changes are coming soon

Likely to be available in sqllib/samples/perf from v11.1 mod3 fixpack3

11 © 2017 IBM Corporation


Db2mon

Successor to psmon
– Uses MON_GET_* table functions to obtain extensive data set
– Automatically calculates the differences between the performance counters
– Does not use snapshots

Snapshots are deprecated


Covers many aspects of database performance
– Resource usage
– Processing time
– Wait time
– Hit ratios
– I/O times
– Configuration

12 © 2017 IBM Corporation


Db2mon
Battle-tested in multiple

Aggregate CF command execution counts
Bufferpool data and index hit ratios


Bufferpool read statistics

Bufferpool sizes

critsits

Bufferpool statistics by tablespace

Bufferpool write statistics

CF-side command execution counts and average response times

CF-side total command execution counts

CF structure size information
CF system resource information

More than 60 reports



Client activity (active connections have at least 1 stmt s)
● ●
Combined results from 29 tables

Combined results from 58 tables

Count of group bufferpool full conditions

Current lock waits at end of capture

Currently executing SQL at start of capture (non-zero metrics only)

Database configuration settings

Database log read times

Database log write times

Database manager configuration settings

Database memory set information @ end

Database system resource usage information

DB2 registry variable settings

Deadlocks, lock timeouts and lock escalations

Disk read and write I O times

Group bufferpool data and index hit ratios

Group bufferpool invalid page statistics

Group bufferpool IO statistics by tablespace

Instance information

IO statistics per stmt - top statements by execution time

Latch wait metrics

Memory pool information @ end

Other database log statistics

Page reclaim metrics for index and data pages

Page reclaim metrics for SMP pages

Partial early aggregation distincts

Physical and logical page reads and writes at connection level

Processing times at database level

Round-trip CF command execution counts and average response times

Row level statistics per stmt - top statements by execution time

Sequences information

Size of database

Sort metrics at database level

Sort statistics per stmt - top statements by execution time

Tablespace data page prefetching statistics for group bufferpool

Tablespace prefetching statistics

Tablespace properties

Tablespace to bufferpool mapping

Tablespace usage over monitoring interval

Throughput metrics at database level

Time breakdown at database level (wait + processing)

Time spent executing by package

Top SQL statements by execution time, aggregated by PLANID

Top SQL statements by execution time

Various metrics at connection level

Various table level metrics

Wait time breakdown for top SQL statements by execution time

Wait times at connection level

Wait times at database level

Workload balancing server list

13 © 2017 IBM Corporation


Db2mon - dos and gotchas


db2mon needs
– MON_REQ_METRICS BASE or EXTENDED
– MON_ACT_METRICS BASE or EXTENDED
– MON_OBJ_METRICS BASE or EXTENDED

Db2mon can collect data over
– run for N seconds or
– “pause” until “Enter”

Counters can wrap over long periods of time
– Negative numbers in time columns are a BIG hint
– Impossible values are another

Rule of thumb: Keep the collection time under 5 minutes

14 © 2017 IBM Corporation


Deductions

15 © 2017 IBM Corporation


Capacity or configuration?


Look at the system
– CPU usage, specifically user+sys and IO wait
– Physical Memory (and swap usage)
– Disk I/O
– Network I/O
– Any virtualization

Most customers have some monitoring on some or all of these areas
– Get it if it exists
– DSM may hold historical data if deployed

Take notes of any part of the system that is showing high
usage/utilization

16 © 2017 IBM Corporation


Database or application


Slow queries or low transaction rates can be due to many factors
– Server (or part of the server)
– Network between client and server
– Client application

Compare
– Time spent from one request to the next

CLIENT_IDLE_WAIT_TIME
– Time spent on the server per request

TOTAL_RQST_TIME


It’s a client or networking issue if this ratio is too high
– CLIENT_IDLE_WAIT_TIME / TOTAL_RQST_TIME > 5

17 © 2017 IBM Corporation


Db2mon: Client Idle Wait Time
| MEMBER | TOTAL | TOTAL | TOTAL | TOT | ACTIVE | ACTIVE | ACTIVE | ACTIVE | ACTIVE |
| | CLIENTS | CIWT | RQST | CIWT | CLIENTS | RQ PER S | CIWT | RQST | CIWT |
| | | | | RQ | | | | | RQ |
| | | | | RATIO | | | | | RATIO |
|--------+---------+----------+---------+-------+---------+----------+----------+---------+--------|
| 0 | 1029 | 26976266 | 3062576 | 8.80 | 886 | 24376.47 | 24302776 | 2985602 | 8.13 |
| 1 | 838 | 25094808 | 2421562 | 10.36 | 731 | 21120.67 | 22000322 | 2326539 | 9.45 |


High Total CIWT indicates that the clients are not making
sequential requests quickly
– GOTCHA: max_coordagents is set to 120 – so roughly 6 connections are waiting for
each agent

18 © 2017 IBM Corporation


Waiting for …


Db2 provides WAIT TIME counters (mostly in milliseconds) for many
areas
– Total wait time can be found in IDLE_WAIT_TIME
– Some wait is inevitable (especially in pureScale)
– Investigate any area with high wait times

db2mon provides wait time
– Per database member
– Per application connection
– Per statement

Find the “big picture” wait and look to see where it is coming from

DSM dashboard includes wait time charts

19 © 2017 IBM Corporation


Wait Time Analysis


Start from “Wait times at database level”

Wait percentage can be around 70% on pureScale and be “fine”

Look for all the wait coming from ONE or TWO areas
– General:

Lock, Latch, Log Disk, Pool Read, Direct Writes, FCM, TCP/IP
– PureScale-specific

CF, Reclaims

20 © 2017 IBM Corporation


db2mon: Wait Time at Database Level
| MEMBER | PCT | PCT | PCT | PCT | PCT | PCT | PCT | PCT | PCT | PCT | PCT | PCT | PCT |
| | RQST | LOCK | GLB | LTCH | LG | RCLM | CF | POOL | DIR | DIR | FCM | TCPIP | DIAG |
| | WAIT | | LOCK | | DSK | | | R | R | W | | | |
|--------+-------+-------+------+------+-------+------+-------+------+------+------+------+-------+------|
| 0 | 75.49 | 16.18 | 0.24 | 0.07 | 20.98 | 2.84 | 32.80 | 0.01 | 0.00 | 0.00 | 0.20 | 2.33 | 0.00 |
| 1 | 76.54 | 17.70 | 1.99 | 0.04 | 19.36 | 7.03 | 29.46 | 0.01 | 0.00 | 0.00 | 0.02 | 2.81 | 0.00 |


Moderately busy system
– 75% Request Wait time (RQST_WAIT)

30% CF message

20% Log Disk

17% Lock

21 © 2017 IBM Corporation


Wait Time per SQL statement

Extreme Wait in one/few statements
– Identify the area (lock/latch/CF/log disk)
– Identify any common features of the “bad” statements e.g.

Similar columns

Same table

Same tablespace

Same bufferpool


Consider:
– Obtain formatted explain plans for these statements and consult Optimizer team

22 © 2017 IBM Corporation


Hit Ratios

Hang on – surely performance is all about hit ratios in bufferpools
– Undersized bufferpools show up as

Normally shows up as high pool read time

Many direct reads → high direct read time


Db2mon has “Bufferpool data and index hit ratios”
– Index hit ratio is usually more important than data page hit ratios
– Aim for 95% or better

PureScale “Group bufferpool data and index hit ratios”
– Any hits here avoid disk I/O
– Good enough can be quite low (40%) for some workloads

Too many customers have too many bufferpools
– Stick to one bufferpool per page size
– Configure with AUTOMATIC size

23 © 2017 IBM Corporation


Too much work (CPU)


If the CPU utilization is very high on a system
– Look for HOT parts of the system in the db2mon reports
– “Processing Times at database level”

Compiling

Section processing

Sort processing

Commit

Rollback

Column (BLU) processing

Connection processing

24 © 2017 IBM Corporation


Most expensive SQL statements (CPU)


Performance of the whole system can be badly impacted by a
few bad statements
– Use the “Top SQL statements by execution time” report to
see the worst offenders
– BEWARE:

Section Time is in milliseconds

CPU time is in microseconds
– It is common for the top five statements to take half the work

25 © 2017 IBM Corporation


Most expensive SQL statements (OLTP)

For OLTP systems, we expect a few rows accessed per statement
execution

Check the “ Row level statistics per stmt” report:
– Queries with many rows modified or read should be checked
– especially look for suitable indexes and explain plans

26 © 2017 IBM Corporation


db2mon: Row level statistics per statement
| MEM | NUM | AVG | AVG ROWS | AVG | STMT TEXT |
| | EXEC | ROWS | READ | ROWS | |
| | | MOD | | RET | |
| | | | | | |
|-----+------+------+----------+-------+---------------------------------------|
| 1 | 80 | 0.0 | 232.0 | 229.0 | SELECT V.CATEGORY, V.KEY, V.VALUE FRO |
| 1 | 81 | 0.0 | 171.0 | 27.0 | SELECT CAST(P.ID AS INTEGER) AS ID, C |
| 1 | 7843 | 0.0 | 144.0 | 1.1 | SELECT R.ID, R.PARTY_ID, R.TYPE FROM |
| 1 | 6917 | 0.0 | 144.0 | 1.0 | SELECT COUNT(*) AS COUNT FROM FTM.RS_ |


Statements returning 1 row may be fine if
– Aggregation
– Distinct

Statements including count(*) will need appropriate indexes

27 © 2017 IBM Corporation


Most expensive SQL statements (Analytics)

Long running queries are common in Analytic workloads
– Gather Explain data

Talk to the Optimizer team
– Consider moving row → column organized tables
– QUERY_OPT may need tweaking per query or overall
– There are many db2set parameters for fine tuning

28 © 2017 IBM Corporation


Too much work (CPU)


Drill down to appropriate areas
– Compiling → e.g. package cache, query optimization levels
– Section processing → (this is where the work should be!)

Check for hot statements in the “Top Sql statements by execution time”
– Sort processing → e.g. sort heap configuration, explain plans
– Commit → e.g. page cleaners, disk I/O times
– Rollback → e.g. locks, deadlocks, contention and configuration
– Column (BLU) processing → e.g. memory, cdetrace
– Connection processing → e.g. network, connection pooling

29 © 2017 IBM Corporation


Latch wait

High latch wait may be
– “business as usual” or
– a need for better configuration

Latch wait above 10% is definitely worth examining
– Per latch under “Latch wait metrics” in db2mon

Usually one or two latches dominate the time waiting
– Latchtrace may be needed

30 © 2017 IBM Corporation


db2mon: Latch Wait Times (pureScale – 30% latch wait)
| MEM | LATCH NAME | TOT | TOT | TIME |
| | | EXT | EXT | PER |
| | | LATCH | LATCH | LATCH |
| | | WAIT | WAITS | WAIT |
| | | TIME | | MS |
| | | MS | | |
|-----+-----------------------------------------------+--------+--------+-------|
| 2 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 279240 | 191130 | 1.46 |
| 3 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 271005 | 190274 | 1.42 |
| 1 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 270853 | 196191 | 1.38 |
| 6 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 265425 | 197303 | 1.34 |
| 5 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 255242 | 194439 | 1.31 |
| 0 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 246942 | 186820 | 1.32 |
| 7 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 242398 | 192810 | 1.25 |
| 4 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 228679 | 193971 | 1.17 |
| 7 | SQLO_LT_SQLB_BPD__WARLatch | 9639 | 20566 | 0.46 |
| 1 | SQLO_LT_SQLB_BPD__WARLatch | 8096 | 20196 | 0.40 |
| 0 | SQLO_LT_SQLB_BPD__WARLatch | 8005 | 20494 | 0.39 |
| 2 | SQLO_LT_SQLB_BPD__WARLatch | 7887 | 19878 | 0.39 |
| 6 | SQLO_LT_SQLB_BPD__WARLatch | 7809 | 20047 | 0.38 |
| 4 | SQLO_LT_SQLB_BPD__WARLatch | 7716 | 19645 | 0.39 |
| 5 | SQLO_LT_SQLB_BPD__WARLatch | 7511 | 20411 | 0.36 |
| 3 | SQLO_LT_SQLB_BPD__WARLatch | 7258 | 19943 | 0.36 |
| 3 | SQLO_LT_SQLB_PTBL__pool_table_latch | 3142 | 1966 | 1.59 |
| 1 | SQLO_LT_SQLB_PTBL__pool_table_latch | 2528 | 1966 | 1.28 |
| 7 | SQLO_LT_SQLP_DBCB__transsem | 473 | 1575 | 0.30 |
| 3 | SQLO_LT_sqmObjMetricsTable__m_objMetricsLatch | 469 | 2337 | 0.20 |
| 4 | SQLO_LT_SQLP_DBCB__transsem | 406 | 1570 | 0.25 |
31 © 2017 IBM Corporation
db2mon: Latch Wait Metrics (GDPC – 1% latch wait)
| MEM | LATCH NAME | TOT | TOT | TIME |
| | | EXT | EXT | PER |
| | | LATCH | LATCH | LATCH |
| | | WAIT | WAITS | WAIT MS |
| | | TIME | | |
| | | MS | | |
|-----+--------------------------------------------------------+--------+-------+---------|
| 0 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 137679 | 1707 | 80.65 |
| 1 | SQLO_LT_SQLB_BPD__bpdLatch_SX | 116643 | 1745 | 66.84 |
| 0 | SQLO_LT_SQLB_POOL_ADDRESS_INFO__pool_address_latch | 43214 | 31 | 1394.00 |
| 1 | SQLO_LT_SQLB_POOL_ADDRESS_INFO__pool_address_latch | 39632 | 31 | 1278.45 |
| 0 | SQLO_LT_SQLB_BPD__WARLatch | 19480 | 318 | 61.25 |
| 1 | SQLO_LT_SQLB_BPD__WARLatch | 16049 | 350 | 45.85 |
| 0 | SQLO_LT_SQLD_SEQ__seqLatch | 1901 | 14 | 135.78 |
| 0 | SQLO_LT_SQLI_INX_PAGE_CACHE__ipcLatch | 1666 | 4 | 416.50 |
| 1 | SQLO_LT_SQLD_SEQ__seqLatch | 986 | 10 | 98.60 |
| 1 | SQLO_LT_SQLD_PAGE_CACHE__pageCacheLatch | 804 | 21 | 38.28 |
| 0 | SQLO_LT_SQLD_PAGE_CACHE__pageCacheLatch | 460 | 16 | 28.75 |
| 0 | SQLO_LT_SQLD_CHAIN__fullChainLatchNS | 112 | 10 | 11.20 |
| 1 | SQLO_LT_SQLE_GSS_HEADER__latch | 36 | 5 | 7.20 |
| 1 | SQLO_LT_SQLI_INX_PAGE_CACHE__ipcLatch | 11 | 2 | 5.50 |
| 0 | SQLO_LT_SQLE_GSS_HEADER__latch | 10 | 4 | 2.50 |
| 1 | SQLO_LT_NO_IDENTITY | 0 | 63 | 0.00 |
| 1 | SQLO_LT_SMemPool__MemLatchType__latch | 0 | 4 | 0.00 |
| 1 | SQLO_LT_SQLP_LHSH__hshlatch | 0 | 1 | 0.00 |

32 © 2017 IBM Corporation


Getting the right people


Fixing performance issues often involves more than just
Perf_Non_Opt

Secondary assistance often required
– Optimizer
– SAL
– Index
– Bufferpool
– DMS

33 © 2017 IBM Corporation


Solution Examples

34 © 2017 IBM Corporation


Extreme GDPC

Huge separation between sites GDPC cluster
– Sites separated by 65km cable
– 650 microsecond ping

Customer test suite runs in
– 2 minutes on Db2 ESE
– 30 minutes on GDPC cluster (better hardware)

35 © 2017 IBM Corporation


Extreme GDPC (continued)

Database level symptoms

60% → 90% Wait time (sporadic)
– Pool Read Wait Time 0% → 81%
– CF Wait Time 1% → 51%
– FCM Wait Time 2 → 57%

High Pool Read: enlarge bufferpool
– Result: Customer test suite runs in 4 minutes

36 © 2017 IBM Corporation


Monster 8-member pureScale cluster


Critsit
– Larger than average system
– 100% XA transactions
– Target:

Need 180,000 transactions per second

After extensive tuning, reached 115,000 transactions per second


Database level symptoms
– 80% Wait Time

33% Latch wait time

28% CF wait time

19% Log Disk wait time

37 © 2017 IBM Corporation


Monster 8-member pureScale cluster (continued)


Focus on each area seen in Wait Time
– Latch: SQLO_LT_SQLB_BPD__bpdLatch_SX

“Business as usual” Bufferpool Page latch
– Check CF execution times

High ReadCastoutClass times – up to milliseconds!
– Engage SAL and CF teams

Identified SINGLE mutex for all castout activity

Code change in ca-server
– Multiple mutexes controlled by environment variable


Other tuning
– Set QUERY_OPT back to 5 (customer had configured this to 3)

Result
– 202,000 transactions per second

38 © 2017 IBM Corporation


Resources

39 © 2017 IBM Corporation


Resources

People!

Dozens of presentations available from the Performance team

Best Practices, even five years old, are still valuable reading

Extensive tool sets available for chewing through performance
data

40 © 2017 IBM Corporation

You might also like