0% found this document useful (0 votes)
120 views93 pages

Exadata Maa

The document presents an overview of Oracle's Exadata Database Machine and its Maximum Availability Architecture (MAA), emphasizing the importance of high availability in data centers to minimize downtime costs. It details the features and benefits of Exadata's engineered systems, including built-in high availability, data protection mechanisms, and efficient lifecycle management. The presentation highlights the critical role of MAA in ensuring continuous availability and protecting against data corruption and storage failures.

Uploaded by

analytic doc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views93 pages

Exadata Maa

The document presents an overview of Oracle's Exadata Database Machine and its Maximum Availability Architecture (MAA), emphasizing the importance of high availability in data centers to minimize downtime costs. It details the features and benefits of Exadata's engineered systems, including built-in high availability, data protection mechanisms, and efficient lifecycle management. The presentation highlights the critical role of MAA in ensuring continuous availability and protecting against data corruption and storage failures.

Uploaded by

analytic doc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Exadata Database Machine :

Maximum Availability Architecture (MAA)


Technical Presentation

Exadata and MAA Product Management


January 2025
Agenda

1 2 3 4 5
Why focus on What is Maximum Exadata Summary
Maximum Maximum Availability Lifecycle
Availability? Availability Architecture Operations
Architecture? features in
Exadata

4 Copyright © 2025, Oracle and/or its affiliates


Why focus on Maximum Availability?

5 Copyright © 2025, Oracle and/or its affiliates


$350K
Average Cost of downtime per hour

Source: Gartner, Data Center Knowledge, IT Process Institute, Forrester Research

6 Copyright © 2025, Oracle and/or its affiliates


$10M
Average Cost of unplanned data center outage or
disaster

Source: Gartner, Data Center Knowledge, IT Process Institute, Forrester Research

7 Copyright © 2025, Oracle and/or its affiliates


87 hours
Average Downtime per year

Source: Gartner, Data Center Knowledge, IT Process Institute, Forrester Research

8 Copyright © 2025, Oracle and/or its affiliates


91%
Percentage of companies that have experienced
an unplanned data center outage in the last 24
months

Source: Gartner, Data Center Knowledge, IT Process Institute, Forrester Research

9 Copyright © 2025, Oracle and/or its affiliates


What is Maximum Availability Architecture?

10 Copyright © 2025, Oracle and/or its affiliates


Oracle Maximum Availability Architecture (MAA)
Standardized Reference Architectures for Never-Down Deployments

Continuous availability

Customer insights and expert recommendations


Platinum

Application Online Edition-based


Continuity Redefinition Redefinition

Data protection

Reference 24/7 HA features,


architectures configuration
Replication and operational Flashback RMAN ZDLRA+ ZRCV
practices
Gold

Active replication
Production site Replicated site

Active Data Guard GoldenGate


Deployment choices Full Stack DR
Silver

Scale out & Lifecycle

Generic Systems Engineered Systems BaseDB, ExaDB/ExaCC Autonomous DB


Bronze

Globally Distributed
Zero Downtime Migration (ZDM) RAC FPP
Database

11 Copyright © 2025,
2023, Oracle and/or its affiliates
affiliates.
MAA Reference Architectures
Availability service levels

Bronze Silver Gold Platinum

Dev, test, prod Prod/departmental Business critical Mission critical

Bronze + Silver + Gold +

Single instance DB Database HA with RAC DB replication with Active GoldenGate


Data Guard
Restartable Application continuity
Edition-Based Redefinition
Backup/restore

All tiers possible with on-premises and cloud.

12 Copyright © 2025, Oracle and/or its affiliates


MAA Exadata Features

13 Copyright © 2025, Oracle and/or its affiliates


Hardware and Software Engineered together :

ü Performance
ü Manageability
ü Availability

14 Copyright © 2025, Oracle and/or its affiliates [Date]


Oracle Exadata Database Machine : Built-in High Availability

• Redundant Database Servers


– Active-Active highly available clustered servers
– Hot-swappable power supplies, fans and flash cards
– Redundant power distribution units
– Integrated HA software/firmware stack

• Redundant Network
– Redundant 100Gb/s RoCE and switches
– Client access using HA bonded networks
– Integrated HA software/firmware stack

• Redundant Storage Grid


– Data mirrored across storage servers
– Hot-swappable power supplies, fans, M.2 drives and flash cards
– Redundant, non-blocking I/O paths
– Integrated HA software/firmware stack
15 Copyright © 2025, Oracle and/or its affiliates
Exadata : X11M

• 100 Gb active-active RDMA over Converged Ethernet Exadata RDMA Memory


(RoCE) private network
• 1.25TB low latency Exadata RDMA Memory (XRMEM)
per Storage Server

Higher Cost per GB


• Data Acceleration reduces read latency to <14μs
• 3 storage tiers: Performance-Optimized Flash

Faster
• Exadata RDMA Memory
• Performance-optimized Flash
• Capacity-optimized Flash or Hard Disk Capacity-Optimized
Flash
• Baremetal or KVM Based Virtualization Disk

16 Copyright © 2025, Oracle and/or its affiliates


Exadata: The MAA Platform of Choice
Evolution: We Continue to Protect your Service Level from the Most Difficult HA Problems

Human Error
X11M Database Server Prevention! X11M Storage Server

Zero impact major Linux upgrades, e.g. OL8 in Low I/O latency preservation during unplanned and
Exadata release 23.1 planned outages

Zero impact security software upgrades including Tightly integrated hardware & software with auto repair
STIG compliance of sick storage

MS (Management Server) alerting of key Database Exadata X10M and X11M Extreme Flash storage server
and Grid Infrastructure software incidents with both performance and capacity-optimized flash

14 microseconds to retrieve a database I/O from storage serverXRMEM Cache

MAA Best Practice Full Stack Compliance Checks with Exachk


17 Copyright © 2025, Oracle and/or its affiliates
Exadata: The MAA Platform of Choice
Evolution: Metrics and More Made Easy

How do I really know what is going inside of Exadata?


• Performance data including Exadata metrics have been around since Exadata inception but they were
sometimes difficult to consume and understand

• Enter Real Time Insight in Exadata release 22.1. Simply zoom into one of the dashboards to observe
performance trends or shine a bright light on performance anomalies

18 Copyright © 2025, Oracle and/or its affiliates


Exadata : Built-in High Availability
I/O error prevention with
Automatic LED support for disk removal Exadata disk scrubbing / ASM corruption repair
Redundancy Check during power down
Failure Monitoring on database servers Reduced brownout for instance recovery
I/O latency capping for reads and writes ILOM hang detection and repair Custom Diagnostic Package for Cell Alerts
Updating database nodes with patchmgr Optimized and Faster Exadata Patching Blue OK-to-remove LED light notification
Exadata HARD Auto online
Automated repair from controller cache failure
Cell-to-Cell Rebalance Preserves Flash Cache Priority rebalance support
Drop hard disk for replacement Exadata Elastic Configuration
Cell to Cell offload for Disk Repair
Redundancy protection on cell shutdown Fast network failure detection Flash and Disk Life Cycle Management Alerts
Fastest Redo Apply and Instance Recovery
Exadata Smart Write Back Efficient resilver rebalance after flash failure

I/O hang detection and repair Health factor on predicatively failed disks
I/O and Network Resource Management
Exachk full stack healthcheck with critical issues alerts
Elimination of false positive drive failures
EM failure reporting
VLAN support and automation
Active Active ROCE Network
Drop BBU for Replacement Exadata Smart Flash Logging
Disk confinement
Cell-to-Cell Rebalance Data Accelerator Cache preservation Redundancy protection on cellsrv shutdown
Corruption prevention with HARD support Smart Write Back Flash Cache persistence
Auto disk management Cell I/O timeout threshold
Automatic ASM mirror read on I/O error corruption Appliance mode supportCell Alert Summary

20 Copyright © 2025, Oracle and/or its affiliates


Lifecycle Management Data Protection

Brownout Quality Of Service and


Performance

21 Copyright © 2025, Oracle and/or its affiliates


Lifecycle Management Data Protection

Brownout Quality Of Service and


Performance

22 Copyright © 2025, Oracle and/or its affiliates


What are Data Corruptions

• Physical corruptions aka Media Corruption


• Checksum of the database block doesn’t
correspond with its contents
• Header corrupt
• Block Contains Zeros
• …

• Logical corruptions
• Database block with correct checksum but logically
inconsistent
• Structure below the header is corrupt
• Lost Write
• Row locked by inexistent transaction
• …

• Often occurs silently

Source: Wikipedia
23 Copyright © 2025, Oracle and/or its affiliates
Exadata : Data Protection
Corruption Detection & Prevention

When a network packet in the I/O path between


DB server and storage node is corrupted

• Storage cell prevents the write


• ASM retries by re-sending the packet

üApplication never encounters corruptions

24 Copyright © 2025, Oracle and/or its affiliates


Exadata : Data Protection
Corruption Repair

If an application update in the database


encounters corruption

• Database reads from the ASM mirror


• Repairs the corruption using the good copy

üThis repair happens without impacting other


database processes and application

25 Copyright © 2025, Oracle and/or its affiliates


Exadata : Data Protection
Storage Failures

On the storage cells, what happens if a drive is • When a storage failure occurs, redundancy is
reported as but has not really failed? impacted
• Automatic power cycle the drive / flash to • Restoration of redundancy is prioritized in
avoid false positive drive failure database-aware order to preserve data
• Order of Priority
1. Control Files
2. Online logs
3. Archivelogs
4. ASM SPfile
5. Database SPfile
6. TDE key store
7. OCR
8. Standby Redo Logs
9. Wallet
10. Datafiles

26 Copyright © 2025, Oracle and/or its affiliates


Exadata : Data Protection
Efficient Rebalance with Service Level Protection

ASM Power Limit: Intelligent and flexible rebalance power setting

• Testing in MAA labs to find the best balance between redundancy


restoration and service level protection
• MAA best practice asm_power_limit
• Default 4 (total across clusters) at deployment time
• Never set asm_power_limit = 0
• Dynamically modifiable using – see table for recommended max
alter diskgroup <diskgroup name> rebalance modify power <value>

Recommended MAX asm_power_limit


Oracle Database 23ai Oracle Database 21c and earlier
96 64

Credit : LilGoldWmn https://www.freeimages.com/photo/balance-1218685


27 Copyright © 2025, Oracle and/or its affiliates
Exadata : Data Protection
Exadata ASM configuration best practices

• Disks are constantly getting larger


• During rolling software cell updating 2 copies
remain
• Double partner disk failures are rare but
possible
• Particularly important for older systems with
aging disks
• User data is stored in Primary Extent and 2
mirrored copies
• 5 failure groups required for Voting files and
ASM metadata
High Redundancy HIGHLY recommended • OCR stored in primary and 2 mirrored copies

28 Copyright © 2025, Oracle and/or its affiliates


Exadata : Data Protection
Exadata ASM configuration best practices

High Redundancy requires at least 1 disk group with 5 failure groups


• An Eighth and Quarter Rack has 3 Storage Servers
• Only Storage Server 3 failure groups

Solution : ASM quorum disk on Database Servers


• Implemented automatically when deployed through OEDA
• Uses iSCSI based ‘quorum failure groups’
• Managed with quorumdiskmgr (if needed)

High Redundancy HIGHLY recommended

29 Copyright © 2025, Oracle and/or its affiliates


Exadata : Data Protection
Exadata ASM configuration best practices Eighth & Quarter Rack High Redundancy

Exadata Host1

exa1domU1 exa1domU2
/dev/exadata_quorom/QD_DATAC1_EXA1DOMU1 /dev/exadata_quorom/QD_DATAC2_EXA1DOMU2
/dev/exadata_quorom/QD_DATAC1_EXA2DOMU1 /dev/exadata_quorom/QD_DATAC2_EXA2DOMU2

R R
A A
C C
1 2
Exadata Host 2

exa2domU1 exa2domU2
/dev/exadata_quorom/QD_DATAC1_EXA1DOMU1 /dev/exadata_quorom/QD_DATAC2_EXA1DOMU2
/dev/exadata_quorom/QD_DATAC1_EXA2DOMU1 /dev/exadata_quorom/QD_DATAC2_EXA2DOMU2

30 Copyright © 2025, Oracle and/or its affiliates


Exadata : Data Protection
Storage Failures

• Exadata includes automated operations for disk


maintenance when disks fail or have been proactively
marked as problematic

• ASM automatically restores redundancy before balancing


data on disk
• Reduces window when some data may have reduced
redundancy

• If a disk needs to be dropped manually, administrator can


specify MAINTAIN REDUNDANCY to rebalance data before
dropping the corresponding ASM disks
• Preserves redundancy in addition to regular checks performed
by DROP FOR REPLACEMENT

31 Copyright © 2025, Oracle and/or its affiliates


Exadata : Data Protection
M.2 Fast Failure Protection and Online Replacement ( X7 and newer)

Two M.2 drives for OS and cell software

M.2 drives protected with Intel RSTe Raid

Can be replaced online so user data does not


have to be taken offline

33 Copyright © 2025, Oracle and/or its affiliates


Exadata : Data Protection
Online Flash Replacement (X7 and newer)

• Open chassis and replace online; no outage


needed from storage server

• For failed drive replace when ready

• For online drive :

• CellCLI> alter physicaldisk FLASH_2_2 drop for


replacement;

• After replacement no customer interaction


needed

34 Copyright © 2025, Oracle and/or its affiliates


Exadata : Data Protection
Hardware Assisted Resilient Data

• Exadata includes Hardware Assisted Resilient Data (HARD) checks to prevent corruption for specific
file types:
• Spfile
• Controlfiles
• Log files
• Datafiles
• Data Guard Broker Files

• When HARD check fails corrupted data not written

• Works transparently after enabling DB_BLOCK_CHECKSUM


• Active during ASM Rebalance or ASM Resync

35 Copyright © 2025, Oracle and/or its affiliates


Exadata : Data Protection
Disk Scrubbing

• Inspects and repairs hard disks during idle time


• Checks for bad sectors on the disks
• Executed by Exadata Storage Software
• If bad sectors are found storage requests mirror copy from ASM to perform repair

• Automatic and dynamic execution


• Scheduled by default bi-weekly
• When disks are idle ( < 25% busy )
• Automatically backs off when application needs I/O resources

https://blogs.oracle.com/exadata/post/exadata-disk-scrubbing
36 Copyright © 2025, Oracle and/or its affiliates
Conclusion Data Protection

Rest assured Exadata has you covered :

• Corruption Detection, Prevention & Repair


• H.A.R.D.
• Scrubbing
• Online flash replacement
• M.2 Fast Failure Protection and Online Replacement
• High Redundancy
• Do not service LED
• Efficent Rebalance
• Automatic Power cycle of (potentially sick) flash/ drives

Credit : David Clode


https://unsplash.com/photos/Yg_sNKOiXvY

37 Copyright © 2025, Oracle and/or its affiliates


Lifecycle Management Data Protection

Brown out
Quality Of Service and
Performance
&

38 Copyright © 2025, Oracle and/or its affiliates


Exadata : Quality of Service & Performance
I/O Latency Capping

• High I/O latency can have detrimental performance impact


• Exadata detects high I/O latency and redirects reads and writes to
other devices
• High latency read I/O redirected to partner cell
• High latency write I/O cancelled and temporarily written to
flash on same cell

LGWR Delay after Hung IO


40
30
30

Seconds
20
10
1
0
Exadata Traditional
Storage
39 Copyright © 2025, Oracle and/or its affiliates
Exadata : Quality of Service & Performance
Storage Server Disk Confinement

• Exadata constantly monitors disk performance and health


• Poor performance is often a precursor to disk failure
• Disks identified with poor performance are confined and I/O
directed to alternative mirror
• Storage Server automatically runs disk health check
• If the disk is deemed healthy
• Disk is returned to service and RESYNCronized
• If the disk is deemed unhealthy
• Disk is dropped, data rebalanced to maintain redundancy
and blue service LED is lit
• Disk can then be replaced

40 Copyright © 2025, Oracle and/or its affiliates


Exadata : Quality of Service & Performance
Smart Storage with I/O Resource Manager (IORM)

• IORM configures and manages Storage Server I/O related resources


when contention occurs
• I/O tagged and prioritized based on IORM Plan for Database
(CDB/PDB/Non-CDB) or Cluster
• Tag includes
• Database/PDB/Cluster name
• Purpose
• Priority
• Useful in mixed and consolidated workload environments
• Can be combined with Database Resource Manager

CellCLI> ALTER IORMPLAN - dbplan=((name=prod, share=16), -


(name=dw, share=4), - (name=prod_test, share=2), -
(name=DEFAULT, share=1))

41 Copyright © 2025, Oracle and/or its affiliates


Exadata : Quality of Service & Performance
Flash Cache : Write Back or Write Through

Write-Through Write-Back

W
R R R
E
I E Application I/O
T
A
E
A
D
rarely hits the hard
D
W
R
disks or Capacity-
I Optimized Flash
T
E
W
R R R
E I E
A T A
D E D

MAA Best Practice


42 Copyright © 2025, Oracle and/or its affiliates
Exadata : Quality of Service & Performance
Smart Flash Log

• Eliminates write latency outliers Database Server


• Redo log write latency is critical for OLTP database performance
• Writes to Flash Cache and Flash Log on different flash devices simultaneously
• Fastest device acknowledges write
Storage Server
• Eliminates storage as log write bottleneck
• Online and standby redo logs automatically and transparently cached in write-
back Smart Flash Cache
• Increases log write throughput by writing to flash instead of disk Flash Cache Flash Log
• Benefits workloads that read the online redo logs such as GoldenGate FLASH#1 FLASH#2

• Beneficial when multiple concurrent workloads require hard disk I/O bandwidth
(eg backups)
Flush later
to
• Asynchronous flush to capacity-optimized flash or HDD Flash/Disk

43 Copyright © 2025, Oracle and/or its affiliates


Exadata : Quality of Service & Performance
Smart Flash and Hard Disk Replacement

• After flash or hard disk replacement, a “health


factor” is set on the affected hard disks.

• While the health factor is on, reads are satisfied


from a healthy partner cell and Exadata
software continues warming up the flash
cache on the cell that had its storage replaced.

• When the flash cache is sufficiently warmed up,


the health factor status is removed.

• This feature enables consistent, low I/O


latency after storage replacement that in turn
maintains application service levels.

44 Copyright © 2025, Oracle and/or its affiliates


Exadata : Quality of Service & Performance
SLA’s maintained during planned maintenance or unplanned maintenance

• Exadata flash cache state preserved during Performance is Time


ASM rebalance operations. One practical
example is the resync that occurs during cell Time is Money
software rolling updates.

• Intelligent routing of I/O requests to cell


providing the best service after flash and disk
failure and repair

• Applicable to both unplanned outages and


planned maintenance

45 Copyright © 2025, Oracle and/or its affiliates


Exadata : Quality of Service & Performance
Database Tier I/O Cancel

Database Tier Database Tier I/O Latency Capping ü

I/Os are Pumping


Slow I/O ? Cell I/O Latency Capping ü

Hung I/O ? I/O Hang detection / repair ü

Storage Tier Sick disk ? Disk confinement ü

Undiscovered hardware /
software issue?

46 Copyright © 2025, Oracle and/or its affiliates


Exadata ASM Reserved Space for Rebalance

• ASM requires space to allow for rebalancing of data in the event of a failure
• Ensures rebalance is successful
• Restores redundancy
• Space to ensure rebalance is successful is not reserved
• Reports ORA-15041 if there is not enough space to complete rebalance

REQUIRED_MIRROR_FREE_MB * Exadata X10M and newer Extreme Flash has hardware-specific requirements
• Depends on the number of failure groups and ASM version
Applies to any disk group and any redundancy (HIGH or NORMAL) Required % Free of Disk
• Same for all media types and hardware generations* Group Capacity to
Number of Failure Groups
Redundancy Successfully Rebalance
(8 ASM disks / FG)
after a single physical
Grid Infrastructure Number of Failure Required % Free of Disk disk failure
Version Groups Group Capacity less than 5 NORMAL 15%
12.1.0 Any 15
less than 5 HIGH 29%
12.2, 18.1+ less than 5 15
5 or more NORMAL 9%
12.2, 18.1+ 5 or more 9
5 or more HIGH 11%

• X10M and newer EF cells have four physical flash disks with two ASM disks per
physical flash disk. Therefore, a flash card failure will result in two ASM disks being dropped.
• GI/ASM 19c and newer with patch 34281503

47 Copyright © 2025, Oracle and/or its affiliates


Exadata Smart Rebalance

• Smart Rebalance affects High Redundancy disk groups when a failure occurs
• If disk group has required free space
• Data is rebalanced and redundancy restored
• If disk group DOES NOT have required free space
• Disk is offlined and rebalance deferred
• Disk is re-mirrored efficiently from partner disks once replaced
• Reduces data movement and extra I/O at failure time if more capacity is required for
database storage

Smart Rebalance is a safety-net. MAA


strongly recommends maintaining
sufficient free space

48 Copyright © 2025, Oracle and/or its affiliates


ASM Disk Partnering
Concept

• ASM utilizes disk partnerships to choose disks for


placing extents and their mirror copies

• Each disk partners with 8 other disks


• For less than 5 cells, all partners are from 2 cells
• For 5 or more cells, all partners are from 4 cells

• The Primary Extent is then mirrored


• to two of these partners for High Redundancy
• to one of these partners for Normal Redundancy

• Read IO provided by 8 disks


• Used by rebalance, rebuild, resync, resilver, disk/flash
warmup operations

49 Copyright © 2025, Oracle and/or its affiliates


ASM 23ai Increases Number of Disk Partners

• Each disk partners with 24 other disks on four cells

• Read IO provided by 24 disks


• Benefits rebalance, rebuild, resync, resilver, disk/flash
warmup operations

• Automatically managed by ASM


• New partnering scheme not applied during upgrade
• Partners updated by following operations
• ADD DISK
• ADD FAILGROUP (add cell)
• REBALANCE

Results in up to 3x
faster redundancy restoration

50 Copyright © 2025, Oracle and/or its affiliates


What about other storage configurations?

• Different Storage Server configurations utilize different partnership values


Number of cells Storage Server Type Number of disks per Number of partners Number of partners
cell (pre-23ai) (23ai)
1/8th Rack High
3 6 8 12
Capacity
3 or 4 High Capacity 12 8 12

5 or more High Capacity 12 8 24

3 or 4 Extreme Flash 8 8 8

5 or more Extreme Flash 8 8 16

• Benefits following operations


• REBUILD – disk failure
• RESYNC – cell patching
Note – when adding a 5th cell to a configuration, rebalance will run
• RESILVER – flash card failure longer as the increased number of partners is applied

• Disk/Flash WARMUP

51 Copyright © 2025, Oracle and/or its affiliates


Exadata : Quality of Service & Performance
Capacity Planning : Memory Configuration

Memory swapping can cause performance and


stability issues

Correct memory configuration avoids :


• Swapping
• Instability

Credit : Kathy
https://unsplash.com/photos/R7nSPG8edVI
52 Copyright © 2025, Oracle and/or its affiliates
Exadata : Quality of Service & Performance
Exadata built for speed

• Smart Scan
• Smart Flash Cache
• Storage Index
• “The fastest I/O operation is the one that you
don’t need to do”

• Hybrid Columnar Compression


• In-Memory Columnar Format
• RDMA
• Real-Time Insight

53 Copyright © 2025, Oracle and/or its affiliates


Exadata : Quality of Service & Performance
RDMA Network Fabric

2 Active – Active ports in every RDMA Network Fabric Adapter RDMA Network Fabric Adapter

2 RDMA Network Fabric Switches in every Exadata single rack

22 Ports per switch used for internal cluster network, cabled RDMA Network Fabric Switch
ensuring no single point of failure exists

• Only to be used for Exadata purposes


• Settings on switch level not to be changed
• ZFS systems recommended to be connected through Top Of Rack
(ToR) switches, for scalability and flexibility reasons

54 Copyright © 2025, Oracle and/or its affiliates


Exadata : Quality of Service & Performance
Automatic Workload Prioritization

RDMA Fabric implements automatic Quality of Service (QoS)

Separate QoS lanes for specific traffic


• Critical I/O – LGWR
• Disk reads
• Disk writes

55 Copyright © 2025, Oracle and/or its affiliates


RoCE Network Resilience

Exadata RoCE IPs need to be highly available Operator


• Each server has a dual-port RoCE NIC with each port configuration
connected to a different Leaf Switch error
• Automatically failed over if a switch port is “down”

Leaf Switch

Unhealthy switches or network may leave ports “up”


but network traffic stalled and unable to flow Leaf Switch

• Switch misconfiguration
• Excessive pause frames

192.168.1.1
Network traffic stalls may result in database instability 192.168.1.2
or outages

The ExaPortMon process runs on the host and


monitors the live traffic of both RoCE ports Spine Switch

• Migrates IP to operational port if stall detected


• Returns IP to original port when upstream issue is resolved

56 Copyright © 2025, Oracle and/or its affiliates


Exadata : Quality of Service & Performance
Instant Failure Detection (IFD)

• Traditional systems use software to check


availability
• May cause performance issues under high load
• Rely on TCP timeouts

HCA Port #1 HCA Port #2


• Exadata uses RDMA to check server
availability
RDMA
• Instant Failure Detection
• Utilizes 4 RDMA paths between for redundancy
HCA Port #1 HCA Port #2
• Database ↔ Storage Servers
• Database ↔ Database Servers

• If all four paths are unavailable after a short


period the server is evicted
Sub-second notification vs up to 1-minute
timeout on non-Exadata platforms

57 Copyright © 2025, Oracle and/or its affiliates


Exadata Secure RDMA Fabric Isolation for RoCE

Exadata Secure Fabric for RoCE systems


implements network isolation for Virtual
Machines while allowing access to common
Exadata Storage Servers
• Each VM cluster is assigned a private network
• VM clusters cannot communicate with each
other
• All VMs can communicate to the shared
storage infrastructure
• Security cannot be bypassed
• Enforcement done by the network card on every
packet
• Rules programmed by hypervisor automatically

58 Copyright © 2025, Oracle and/or its affiliates


Conclusion Quality of Service

• Cell side I/O Latency Capping


• Cell disk confinement
• Smart Storage with I/O Resource Manager
IORM
• Smart Flash Logging
• Smart Flash Log Write-Back
• Smart Flash replacement
• Exadata RDMA Memory Data Accelerator
• RDMA Network Fabric
• RDMA QoS
• Instant Failure Detection
• Exadata Secure RDMA Fabric Isolation
Credit : Towfiqu barbhuiya
https://unsplash.com/photos/0ZUoBtLw3y4

59 Copyright © 2025, Oracle and/or its affiliates


Lifecycle Management Data Protection

Brownout
Quality Of Service and
Performance

60 Copyright © 2025, Oracle and/or its affiliates


Exadata : Brownout
Blackout vs Brownout

Blackout Brownout

• Complete service level interuption • Significant service level degradation

Lost productivity & Lost revenue

Systems are complex and an issue one layer can cascade to other layers

Our Engineered Systems and MAA best practices are designed and tuned to tackle this

61 Copyright © 2025, Oracle and/or its affiliates


Exadata : Brownout
Traditional System • Each layer has its own failure detection
Clusterware
Timeout and timeouts

• Usually fault detection times aditive


eg upon storage controller crash it takes 2 SCSI
time-outs for db server to detect this failure
SCSI
SAN/LAN
Timeout

Storage Storage
Controller Controller

Proprietary
Protocol
Timeouts

62 Copyright © 2025, Oracle and/or its affiliates


Cell Controller Cache Failure Handling
Automated Data Loss Prevention

Failed cache controllers can be complicated on custom built systems


and earlier Exadata systems

Before Exadata 21.2, a user had to recover from a failed controller


cache with manual steps:
• Answering cryptic question on the console about how to proceed
• Ensuring grid disks were force dropped before the controller was
replaced

Credit Liam Riby


63 Copyright © 2025, Oracle and/or its affiliates https://unsplash.com/photos/j8h2_9UDqrM
Cell Controller Cache Failure Handling
Automated Data Loss Prevention

Using Exadata 21.2 and higher, repair from controller cache


failure is handled automatically by doing the following
• Detecting the problem before cell services start post crash
• Disable access to the grid disks
• Recover the failed disks

Credit :Connor McSheffrey


https://unsplash.com/photos/MIspM6HIit8

64 Copyright © 2025, Oracle and/or its affiliates


Exadata : Brownout
Brownouts & Blackouts : Flex ASM
DBHOST 1 DBHOST 2 DBHOST 3
Oracle Flex ASM enables Oracle ASM instances
to run on a separate physical server from the DB1 Inst 1 DB1 Inst 2 DB1 Inst 3
database servers.

• Enables continuous RDBMS ↔ ASM


communication ASM1 ASM2 ASM3

• After ASM instance crash no need for a


service failover

• Completely transparent to the application with


no service level impact

• On Exadata Cardinality is set to ALL

ASM Storage

65 Copyright © 2025, Oracle and/or its affiliates


Exadata : Brownout
Brownouts Reduction for Client Network Port Failure

Brownout associated with active/passive client


access network port failure is extremely low.

LACP “Active / Active” can also be configured


but needs changes on network infrastructure

ACTIVE CONNECTION

PASSIVE CONNECTION

66 Copyright © 2025, Oracle and/or its affiliates


Exadata : Brownout
Smart Handshake for Storage Server Shutdown

• When storage server is shutdown the diskmon process in the Grid Infrastructure on the database
server is notified

• No blackout when storage tier is shutdown for maintenance

Database Tier

diskmon
Storage Tier

67 Copyright © 2025, Oracle and/or its affiliates


Exadata : Brownout
Smart OLTP Caching Quick review of Exadata data access tiers first…
Database Buffer Cache

2. DBWR evicts a buffer to


1. Data
Free upread into
space in buffer
buffer cache
cache Sizzling

1.
2. Cell with primary mirror
remains
populated
populatedin in 2. Cell with secondary mirror populated in Cell with tertiary mirror located on
super
superlow
lowlatency
latencyData
DataAccelerator
Accelerator low latency flash cache high latency hard disk throughout

Warm

Cold
Hot

68 Copyright © 2025, Oracle and/or its affiliates


Exadata : Brownout
Smart OLTP Caching – Storage Failure

• Application reading data from


primary mirror
• Storage failure on cell containing
primary mirror
• Retrieve data from secondary mirror
on flash with low latency and
populate super low latency Data
Accelerator
• Tertiary mirror continues to provide
protection when Murphy strikes
• After repair of storage failure and
flash cache warm up, return to
primary copy

ASM rebalance, resync, and resilver always preserve flash cache state when moving extents
69 Copyright © 2025, Oracle and/or its affiliates
Exadata : Brownout
Cell-to-Cell Rebalance Preserves Data Accelerator Population

• Rebalance happens due to disk failure


• Primary mirror was cached in Data Accelerator
• Primary mirror goes to other cell
• Cache in Data Accelerator follows
• Latency preserved end-user happy

Credit Jacob Vizek


https://unsplash.com/photos/ibvHQnpk4LE

Data Data
Accelerator Accelerator

70 Copyright © 2025, Oracle and/or its affiliates


Exadata : Brownout
Cisco RoCE Spine Switch Software Update

• MAA team tested spine switch reboots in Multi Rack configurations

• Exadata 21.2.*
• No blackouts
• Significantly reduced brownout

71 Copyright © 2025, Oracle and/or its affiliates


Lifecycle Management Data Protection

Brownout
Quality Of Service and
Performance

72 Copyright © 2025, Oracle and/or its affiliates


Exadata : Life Cycle Management
Exachk

• Recommendations from Exachk come straight from


the field & engineering discussed in weekly meetings

• It is crucial to always have the latest version


• Keeps track of critical issues
• Issues in certain releases
• We don’t allow to run if older than 180 days

• Highly recommended to run Exachk


• Once a month
• Before and after any major
configuration change eg : patching, storage
addition

• Best practice health check

73 Copyright © 2025, Oracle and/or its affiliates


Exadata : Life Cycle Management
Exachk

k
a ch
Ex
ed
at
td
Ou

74 Copyright © 2025, Oracle and/or its affiliates


Exadata : Life Cycle Management
Exachk: top observed painpoints

• Huge pages not set correctly


• In later DB releases we check if SGA > 32 Gb is used without huge pages configured
• If that is the case the instance doesn’t start

• Redundancy recommendation not followed

• Critical issue that is already fixed in later releases

75 Copyright © 2025, Oracle and/or its affiliates


Exadata : Life Cycle Management
Exawatcher : Graphing

A picture says more than a thousand words

76 Copyright © 2025, Oracle and/or its affiliates


Exadata : Life Cycle Management
Monitoring by Enterprise Manager 13c

77 Copyright © 2025, Oracle and/or its affiliates


Exadata : Life Cycle Management
Exadata Real Time Insight

• Automatically stream up-to-the-second


metric observations from all servers in your
Exadata fleet

Feed customizable monitoring dashboards
for real-time analysis and problem-solving

• Comprehensive : > 200 Exadata Soft- &


Hardware Metrics

• Proactive issue detection and real-time


decision making

78 Copyright © 2025, Oracle and/or its affiliates


Exadata : Life Cycle Management
Exadata Real-Time Insight

• Automatically stream up-to-the-second metric observations from all servers in your Exadata fleet
• Feed customizable monitoring dashboards for real-time analysis and problem-solving

• Comprehensive
• 200+ Exadata Software & Hardware Metrics
• Fine-grained metrics can be collected as often as
every 1 second
• Integrated
• Integrated with popular time-series and observability
platforms
• Stream fine-grained metrics to user-defined endpoints in
real time
• Insightful
• Enables proactive issue detection and real-time decision
making

https://blogs.oracle.com/exadata/post/real-time-insight-quick-start
79 Copyright © 2025, Oracle and/or its affiliates
Exadata : Life Cycle Management
Exadata Real-Time Insight – Sample Dashboards Code

• Oracle Samples repository on GitHub.com contains example Real-Time Insight dashboards.


• The following dashboard code is included (Grafana/Prometheus):
• Exadata Cluster
• Compute
• Storage Server
• Cell Disk
• Flash Cache
• Smart Scan
• Network

• https://github.com/oracle-samples/oracle-db-examples/tree/main/exadata/insight

80 Copyright © 2025, Oracle and/or its affiliates


Exadata : Life Cycle Management
Exadata Real-Time Insight – Sample Dashboards

• Exadata Cluster: Provides a cluster-wide view


that shows metrics for compute nodes and
storage servers
• Compute: Provides a compute-node view that
shows CPU and network utilization for the
compute nodes
• Storage Server: Provides a storage-server-
centric view that focuses on storage server
CPU and I/O metrics, as well as Exadata
metrics for Smart Flash Cache, Smart Flash
Log, and Smart I/O
• Cell Disk: Shows cell disk I/O metrics on the
storage server
• Flash Cache: Shows flash cache metrics on
the storage server
• Smart Scan: Shows smart scan metrics on
the storage server

81 Copyright © 2025, Oracle and/or its affiliates


Exadata : Life Cycle Management
Exadata AWR support

82 Copyright © 2025, Oracle and/or its affiliates


Exadata : Life Cycle Management
Custom Diagnostic Package for Storage Server Alerts

83 Copyright © 2025, Oracle and/or its affiliates


Exadata : Life Cycle Management
Backup
Backup
• Backup your databases J
Restore
• ZDLRA or ZFS appliance with RMAN are recommended

• Backup KVM host and KVM Guest


• More details see at end of presentation

• Test your backups

Schrödinger Backup :
The condition of any backup is unknown until a restore is attempted

84 Copyright © 2025, Oracle and/or its affiliates Credit : Fabienne Fierens


Exadata Live Update
Increase security and minimize database server and VM reboots

Exadata System Software provides operating system, firmware, and


Exadata software updates that are crucial for the optimal and secure
operation of Exadata Database Servers and Oracle Database

Updates are applied in a rolling fashion across database servers

Exadata Database Server


Exadata Live Update applies updates online and defers any remaining
work to occur at a scheduled time

Exadata Live Update uses familiar Linux technologies, including RPM and
ksplice, to apply updates online to database servers/VMs avoiding the
need to reboot

85 Copyright © 2025, Oracle and/or its affiliates


Exadata Live Update Options

Exadata Live Update multiple options based on the Common Vulnerability Scoring System (CVSS).
When using Exadata Live Update, you choose from the following options:

highcvss Applies only security updates to address vulnerabilities with a CVSS score of 7 or greater
allcvss Applies only security updates to address vulnerabilities with any CVSS score
full Performs a full update, which includes all security-related updates and all other non-
security updates. Equivalent to regular updates applied with a server/VM reboot

$ patchmgr --dbnodes kvm_guests.lst --upgrade --repo <repo.zip location> --rolling \


--target_version 24.1.0.0.0.240517 --live-update-target highcvss|allcvss|full

86 Copyright © 2025, Oracle and/or its affiliates


Viewing Outstanding Work

Not all update content can be applied online, or activated without a reboot
• e.g. firmware, booting with the latest kernel, JDK

These updates are called ‘outstanding work’ and are staged for activation at the next graceful shutdown

Use patchmgr --live-update-list-outstanding-work to show outstanding items

$ patchmgr --dbnodes kvm_guests.lst --live-update-list-outstanding-work


***
Summary of outstanding work for Exadata Live Update:
exdpm1adm01vm01.example.com: (*) 2024-08-15 00:17:08: Exadata Live Update outstanding work is
scheduled for completion at the next reboot
- The Linux kernel will be updated from version 5.4.17-
2136.330.7.5.el8uek to 5.4.17-2136.333.5.1.el8uek.
Current Uptrack kernel version: 5.4.17-2136.333.5.1.el8uek.x86_64
- New package uptrack-updates-5.4.17-2136.333.5.1.el8uek.x86_64
(version 20240725-0) will be installed.
87 Copyright © 2025, Oracle and/or its affiliates
Applying Outstanding Work

By default, outstanding work is applied during the next graceful shutdown

Administrators can use patchmgr --live-update-schedule-outstanding-work to


• Specify the reboot window - "YYYY-MM-DD HH24:MM:SS TZ"
$ patchmgr --dbnodes kvm_guests.lst --live-update-schedule-outstanding-work \
”2024-11-04 22:00:00 AEDT”
• To defer applying outstanding work – ‘never’

$ patchmgr --dbnodes kvm_guests.lst --live-update-schedule-outstanding-work never

• Reset a previously set schedule to the default behavior

$ patchmgr --dbnodes kvm_guests.lst --live-update-schedule-outstanding-work reset

Oracle recommends outstanding work be applied at least every 3 months


88 Copyright © 2025, Oracle and/or its affiliates
Exadata Live Update Best Practices

Database Server/VM Backup


• Patchmgr automatically creates a system backup during all updates to allow for fast rollback if required
• Additional administrator-managed backups are recommended

Graceful reboots
• Include vm_maker --stop_domain/--start_domain operations, host restart (shutdown –r), a short press of the
power button on the server, etc.
• Restarting the physical database server also restarts VMs
• Useful (but not required) to align VM and physical server reboot
• Avoid resetting VMs and physical servers while outstanding work is applied

Use Database MAA features including Transparent Application Continuity to mask


planned reboot from applications and users

89 Copyright © 2025, Oracle and/or its affiliates


Exadata Live Update
Applying monthly maintenance releases - examples

Quarterly Update Windows (Recommended)


August September October November
• 24.1.3 • 24.1.4 • 24.1.5 • 24.1.6
• Full Update • Exadata Live Update • Exadata Live Update • Full Update
• Server/VM reboot • No reboot • No reboot • Server/VM reboot
December January February March
• 24.1.7 • 24.1.8 • 24.1.9 • 24.1.10
• Exadata Live Update • Exadata Live Update • Full Update • Exadata Live Update
• No reboot • No reboot • Server/VM reboot • No reboot

Bi-Yearly Update Windows


August September October November
• 24.1.3 • 24.1.4 • 24.1.5 • 24.1.6
• Full Update • Exadata Live Update • Exadata Live Update • Exadata Live Update
• Server/VM reboot • No reboot • No reboot • No reboot
December January February March
• 24.1.7 • 24.1.8 • 24.1.9 • 24.1.10
• Exadata Live Update • Exadata Live Update • Full Update • Exadata Live Update
• No reboot • No reboot • Server/VM reboot • No reboot

90 Copyright © 2025, Oracle and/or its affiliates


Exadata : Life Cycle Management
Planned Maintenance

Exadata patchmgr utility can be used to patch the whole


hardware stack :
• Storage cells 19.3.0.0
19.15.0.0
• RoCE switches 21.5.0.0
Gold Image
• Admin switches Repository

• Baremetal and KVM Host


• KVM Guest

Fleet Patching & Provisioning the tool for out place patching
• Database homes
• Grid infrastructure and combined GI + DB patching
• Also Exadata patching
• www.oracle.com/goto/FPP
• One tool to patch / upgrade your whole Oracle DB stack

91 Copyright © 2025, Oracle and/or its affiliates


Exadata : Further Reading

Backup
• https://www.oracle.com/technetwork/database/availability/recovery-appliance-maint-practices-4487388.pdf

KVM Virtualization
• https://www.oracle.com/a/tech/docs/exadata-kvm-overview.pdf

Life Cycle Management


• https://www.oracle.com/a/tech/docs/exadata-software-maintenance-2022.pdf

Security
• https://www.oracle.com/a/tech/docs/exadata-maximum-security-architecture.pdf

Exadata Real-Time Insight


• https://blogs.oracle.com/exadata/post/exadata-real-time-insight

92 Copyright © 2025, Oracle and/or its affiliates


Reference
Useful Resources

Exadata Product Management Blog - https://blogs.oracle.com/exadata/


MOS Note Reference Blog - https://blogs.oracle.com/exadata/post/exadata-mos-notes

Exadata Database Machine and Exadata Storage Server Supported Versions (Doc ID 888828.1)
Oracle Exadata Database Machine EXAchk (Doc ID 1070954.1)
Oracle Exadata Best Practices (Doc ID 757552.1)
Exadata Critical Issues (Doc ID 1270094.1)
Exadata Patching Overview and Patch Testing Guidelines (Doc ID 1262380.1)
The ASM Priority Rebalance feature - An Example (Doc ID 1968607.1)
Physical and Logical Block Corruptions. All you wanted to know about it. (Doc ID 840978.1)
Best Practices for Corruption Detection, Prevention, and Automatic Repair - in a Data Guard
Configuration (Doc ID 1302539.1)
Understanding ASM Capacity and Reservation of Free Space in Exadata (Doc ID 1551288.1)

93 Copyright © 2025, Oracle and/or its affiliates


Exadata MAA : Conclusion

Solid as a rock Out of this world performance

Credit : Zoltan Tasi https://unsplash.com/photos/QxjEi8Fs9Hg Credit : Space X https://unsplash.com/photos/OHOU-5UVIYQ

94 Copyright © 2025, Oracle and/or its affiliates


Thank you

95 Copyright © 2025, Oracle and/or its affiliates


Our mission is to help people see
data in new ways, discover insights,
unlock endless possibilities.

You might also like