0% found this document useful (0 votes)

272 views54 pages

Brocade Resiliency

Faulty or improperly configured devices, misbehaving hosts, and faulty or substandard FC media can significantly impact the performance of FC fabrics and the applications they support. However, with the proper knowledge and capabilities, the fabric can often identify and, in some cases, mitigate or protect against the effects of these misbehaving components to provide better fabric resiliency.

Uploaded by

Abhishek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

272 views54 pages

Brocade Resiliency

Uploaded by

Abhishek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Front cover

Fabric Resiliency Best Practices

Chad Collie
Michael Hrencecin
David Lutz
Ian MacQuarrie
Shawn Wright

Redpaper
Fabric Resiliency Best Practices

This IBM® Redpaper™ publication describes preferred practices for deploying and using
advanced Brocade Fabric Operating System (FOS) features to identify, monitor, and protect
Fibre Channel (FC) SANs from problematic devices and media behavior.

FOS: This paper focuses on the FOS command options and features that are available
from versions 7.2 - 7.4, but also covers other features such as bottleneck detection, port
fencing, and Fabric Watch.

© Copyright IBM Corp. 2011, 2017. All rights reserved. ibm.com/redbooks 3

Introduction
Faulty or improperly configured devices, misbehaving hosts, and faulty or substandard FC
media can significantly impact the performance of FC fabrics and the applications they
support. In most real-world scenarios, these issues cannot be corrected or completely
mitigated within the fabric itself. Instead, the behavior must be addressed directly. However,
with the proper knowledge and capabilities, the fabric can often identify and, in some cases,
mitigate or protect against the effects of these misbehaving components to provide better
fabric resiliency.

This document concentrates specifically on Brocade Fabric Vision features (and related
capabilities) that help provide optimum fabric resiliency. Although some Fabric Vision features
have been available since FOS V7.0, most of the features are available since FOS V7.2.

For more information about the features that are described in this publication, see the product
documents that are appropriate for your FOS release. They are available to registered users
at:
http://my.brocade.com
򐂰 SAN Design and Best Practices
򐂰 Fabric OS Administrator’s Guide
򐂰 Fabric OS Command Reference Manual
򐂰 Fabric OS Monitoring and Alerting Policy Suites Configuration Guide
򐂰 Fabric OS Flow Vision Configuration Guide
򐂰 Brocade Network Advisor Administrator’s Guide

Factors affecting fabric resiliency

There are several common types of abnormal behavior originating from fabric components or
attached devices:
򐂰 Faulty media (fiber-optic cables and Small Form-Factor Pluggables [SFPs]/optics): Faulty
media can cause frame loss due to excessive cyclic redundancy check (CRC) errors,
invalid transmission words, and other conditions, which can result in I/O failure and
application performance degradation.
򐂰 Misbehaving devices, links, or switches: Occasionally, a condition arises where a device
(server or storage array) or link (inter-switch link (ISL)) behaves erratically and causes
disruptions in the fabric. If not immediately addressed, this situation might result in severe
stress on the fabric.
򐂰 Congestion: Congestion is caused by latencies or insufficient link bandwidth. End devices
that do not respond as quickly as expected can cause the fabric to hold frames for
excessive periods, which can result in application performance degradation or, in extreme
cases, I/O failure.
򐂰 Credit loss: Credit loss happens when the receiving end of a link fails to acknowledge a
request to transmit a frame because no buffers are available to receive the frame.

4 Fabric Resiliency Best Practices

Faulty media
In addition to high-latency devices causing disruptions to data centers, fabric problems are
often the result of faulty media. Faulty media can include bad cables, SFPs, extension
equipment, receptacles, patch panels, improper connections, and so on. Media can fault on
any SAN port type and fail, often unpredictably and intermittently, making it even harder to
diagnose. Faulty media involving server/host and storage device ports (F_Ports) results in an
impact to the end device attached to the F_Port and to devices communicating with this
device.

Failures on ISLs or E_Ports can have an even greater impact. Many flows (host and target
pairs) can simultaneously traverse a single E_Port. In large fabrics, this can be hundreds or
thousands of flows. If there is a media failure involving one of these links, it is possible to
disrupt some or all of the flows that use the path. Severe cases of faulty media, such as a
disconnected cable, can result in a complete failure of the media, which effectively brings a
port offline. This situation is typically easy to detect and identify. When it occurs on an F_Port,
the impact is specific to flows involving the F_Port. E_Ports are typically redundant, so severe
failures on E_Ports typically only result in a minor drop in bandwidth because the fabric
automatically uses redundant paths. Also, error reporting that is built into FOS readily
identifies the failed link and port, allowing for simple corrective action and repair. With
moderate cases of faulty media, failures occur, but the port can remain online or transition
between online and offline. This situation can cause repeated errors, which can occur
indefinitely or until the media fails. When these types of failures occur on E_Ports, the result
can be devastating because there can be repeated errors that impact many flows, which can
result in significant impacts to applications that last for prolonged durations.

Signatures of these types of failures include:

򐂰 CRC errors on frames
򐂰 Invalid Transfer Words (includes encoder out errors)
򐂰 State Changes (ports going offline or online repeatedly)
򐂰 Credit loss (complete loss of credit on a virtual channel (VC) on an E_Port prevents traffic
from flowing on that VC, resulting in frame loss and I/O failures for devices that use the
VC)

Misbehaving devices
Another common class of abnormal behavior originates from high-latency end devices (host
or storage). A high-latency end device is one that does not respond as quickly as expected
and thus causes the fabric to hold frames for excessive periods. This situation can result in
application performance degradation or, in extreme cases, I/O failure. Common examples of
moderate device latency include disk arrays that are overloaded and hosts that cannot
process data as fast as requested. Misbehaving hosts, for example, become more common
as hardware ages. Bad host behavior is usually caused by defective host bus adapter (HBA)
hardware, bugs in the HBA firmware, and problems with HBA drivers. Storage ports can
produce the same symptoms due to defective interface hardware or firmware issues. Some
arrays deliberately reset their fabric ports if they are not receiving host responses within their
specified timeout periods. Severe latencies are caused by badly misbehaving devices that
stop receiving, accepting, or acknowledging frames for excessive periods. However, with the
proper knowledge and capabilities, the fabric can often identify and, in some cases, mitigate
or protect against the effects of these misbehaving components to provide better fabric
resiliency.

5
Congestion
Congestion occurs when the traffic being carried on a link exceeds its capacity. Sources of
congestion might be links, hosts, or storage responding more slowly than expected.
Congestion is typically due to either fabric latencies or insufficient link bandwidth capacity. As
FC link bandwidth has increased from one to 16 Gbps, instances of insufficient link bandwidth
capacities have radically decreased. Latencies, particularly device latencies, are the major
source of congestion in today’s fabrics due to their inability to promptly return buffer credits to
the switch.

Device-based latencies
A device experiencing latency responds more slowly than expected. The device does not
return buffer credits (through R_RDY primitives) to the transmitting switch fast enough to
support the offered load, even though the offered load is less than the maximum physical
capacity of the link that is connected to the device.

Figure 1 illustrates the condition where a buffer backup on ingress port 6 on B1 causes
congestion upstream on S1, port 3. When all available credits are exhausted, the switch port
that is connected to the device must hold additional outbound frames until a buffer credit is
returned by the device.

Figure 1 Device latency example

When a device does not respond in a timely fashion, the transmitting switch is forced to hold
frames for longer periods, resulting in high buffer occupancy, which results in the switch
lowering the rate at which it returns buffer credits to other transmitting switches. This effect
propagates through switches (and potentially multiple switches, when devices attempt to send
frames to devices that are attached to the switch with the high-latency device) and ultimately
affects the fabric.

Figure 2 on page 7 shows how latency on a switch can propagate through the fabric.

6 Fabric Resiliency Best Practices

Figure 2 Latency on a switch can propagate through the fabric

Note: The impact to the fabric (and other traffic flows) varies based on the severity of the
latency that is exhibited by the device. The longer the delay that is caused by the device in
returning credits to the switch, the more severe the problem.

Moderate device latencies

Moderate device latencies from the fabric perspective are defined as those not severe
enough to cause frame loss. If the time between successive credit returns by the device is
between a few hundred microseconds to tens of milliseconds, the device exhibits mild to
moderate latencies because this delay is typically not enough to cause frame loss. This
situation does cause a drop in application performance but typically does not cause frame
drops or I/O failures.

The effect of moderate device latencies on host applications might still be profound, based on
the average disk service times that are expected by the application. Mission-critical
applications that expect average disk service times of, for example, 10 ms, are severely
affected by storage latencies in excess of the expected service times. Moderate device
latencies have traditionally been difficult to detect in the fabric. Advanced monitoring
capabilities that are implemented in Brocade ASICs and FOS have made these moderate
device latencies much easier to detect by providing the following information and alerts:
򐂰 Switches in the fabric generate Fabric Performance Impact (FPI) Alerts if FPI is enabled
on the affected ports.
򐂰 Elevated tim_txcrd_z counts on the affected F_Port, that is, the F_Port where the affected
device is connected.
򐂰 Potentially elevated tim_txcrd_z counts on all E_Ports carrying the flows to and from the
affected F_Port/ device.

7
Note: tim_txcrd_z is defined as the number of times that the port was polled and that the
port was unable to transmit frames because the transmit Buffer-to-Buffer Credit (BBC) was
zero. The purpose of this statistic is to detect congestion or a device that is affected by
latency. This parameter is sampled at intervals of 2.5 microseconds, and the counter is
incremented if the condition is true. Each sample represents 2.5 microseconds of time with
zero Tx BBC. tim_txcrd_z counts are not an absolute indication of significant congestion or
latencies and are just one of the factors in determining whether real latencies or fabric
congestion are present. Some level of congestion is to be expected in a large production
fabric and is reflected in tx_crd_z counts. The Brocade FPI feature was introduced to
remove uncertainty around identifying congestion in a fabric.

Note: tim_latency_vc is a Brocade Gen 5 Condor3 ASIC counter that measures the
latency time that a frame incurs in the transmit queue of its corresponding VC. The
purpose of this statistic is to directly measure the frame transmit latency of a switch port.
Each unit of the counter value represents 250 nanoseconds of latency. The Brocade FPI
feature uses this counter to enhance the detection of devices introducing latency into the
fabric.

Severe device latencies

Severe device latencies result in frame loss, which triggers the host Small Computer System
Interface (SCSI) stack to detect failures and to retry I/Os. This process can take tens of
seconds (possibly as long as 30 - 60 seconds), which can cause a noticeable application
delay and potentially results in application errors. If the time between successive credit
returns by the device is in excess of 100 ms, the device is exhibiting severe latency. When a
device exhibits severe latency, the switch is forced to hold frames for excessively long periods
(on the order of hundreds of milliseconds). When this time becomes greater than the
established timeout threshold, the switch drops the frame (per FC standards). Frame loss in
switches is also known as Fibre Channel Class 3 (IBM C3®) discards or timeouts.

Because the effect of device latencies often spreads through the fabric, frames can be
dropped due to timeouts, not just on the F_Port to which the misbehaving device is
connected, but also on E_Ports carrying traffic to the F_Port. Dropped frames typically cause
I/O errors that result in a host retry, which can result in significant decreases in application
performance. The implications of this behavior are compounded and exacerbated by the fact
that frame drops on the affected F_Port (device) result not only in I/O failures to the
misbehaving device (which are expected), but also on E_Ports, which might cause I/O failures
for unrelated traffic flows involving other hosts (and typically are not expected).

Latencies on ISLs
Latencies on ISLs are usually the result of back pressure from latencies elsewhere in the
fabric. The cumulative effect of many individual device latencies can result in slowing the link.
The link itself might be producing latencies, if it is a long-distance link with distance delays or
there are too many flows that use the same ISL. Although each device might not appear to be
a problem, the presence of too many flows with some level of latency across a single ISL or
trunked ISL might become a problem. Latency on an ISL can ripple through other switches in
the fabric and affect unrelated flows.

FOS can provide alerts and information indicating possible ISL latencies in the fabric, through
one or more of the following items:
򐂰 Switches in the fabric generate FPI Alerts if FPI is enabled on the affected ports.
򐂰 C3 transmit discards (er_tx_c3_timeout) on the device E_Port or EX_Port carrying the
flows to and from the affected F_Port or device.

8 Fabric Resiliency Best Practices

򐂰 Brocade MAPS alerts, if they are configured for C3 timeouts.
򐂰 Elevated tim_txcrd_z counts on the affected E_Port, which also might indicate congestion.
򐂰 C3 receives discards (er_rx_c3_timeout) on E_Ports in the fabric containing flows of a
high-latency F_Port.

Credit loss
Buffer credits are a part of the FC flow control and the mechanism that Fibre Channel
connections use to track the number of frames that are sent to the receiving port. Every time
a frame is sent, the credit count is reduced by one. When the sending port runs out of credits,
it is not allowed to send any more frames to the receiving port. When the receiving port
successfully receives a frame, it tells the sending port that it has the frame by returning an
r_rdy primitive. When the sending port receives an r_rdy, it increments the credit count. Credit
loss occurs when either the receiving port does not recognize a frame (usually due to bit
errors), so it does not return an r_rdy, or the sending port does not recognize the r_rdy
(usually due to link synchronization issues).

FC links are never perfect, so the occasional credit loss can occur, but it becomes an issue
only when all available credits are lost. Credit loss can occur on both external and internal FC
links. When credit loss occurs on external links, it is usually caused by faulty media, and credit
lost on internal ports is usually associated with jitter, which in most cases is adjusted for by
the internal adapter firmware. The switch automatically tries to recover from a complete loss
of credit on external links after 2 seconds by issuing a link reset. For the switch to perform
automatic recovery from internal link credit loss, the Credit Loss Detection and Recovery
feature must be enabled.

Designing resiliency into the fabric

This document is not intended to cover the general set of design considerations that are
required for designing a Storage Area Network (SAN), but there are a set of technologies that
should be considered to ensure that the fabric is resilient by design. This section includes
preferred practices for each of the following areas:
򐂰 Inter-switch link trunking
򐂰 Routing policies
򐂰 Edge Hold Time
򐂰 Credit recovery tools
򐂰 Dynamic port naming

For more information, see “Preferred implementation” on page 26.

General topics that are related to the architecture, topology, and capacity planning for a SAN
are described in SAN Design and Best Practices, found at:

http://my.brocade.com

Inter-switch link trunking

Trunking optimizes the use of bandwidth by allowing a group of ISLs to merge into a single
logical link, which is called a trunk group. Traffic is distributed dynamically and in order over
this trunk group, achieving greater performance with fewer links. Within the trunk group,
multiple physical ports appear as a single port, thus simplifying management.

9
Trunking improves system reliability by maintaining in-order delivery of data and avoiding I/O
retries if one link within the trunk group fails.

Trunking provides excellent protection from credit lost on ISLs. If credit loss occurs on an ISL,
frames continue to flow by using the other link until the switch can detect the credit loss
(typically 2 seconds) and perform a link reset to recover the credits.

More IT environments are relying on server virtualization technologies that can share host
adapter connections. Specifically, N_Port ID Virtualization (NPIV) allows many clients
(servers, guest, or hosts) to use a single physical port on the SAN. Each of these
communications paths from server, virtual or otherwise, is a data flow that must be
considered when planning for how many interswitch links are needed. These virtualized
environments often lead to a situation where there are a many data flows from the edge
switches potentially leading to frame-based congestion if there are not enough ISL or trunk
resources.

To avoid frame-based congestion in environments where there are many data flows between
switches, it is better to create several two-link trunks than one large trunk with multiple links.
For example, it is better to have two 2-link trunk groups than one 4-link trunk group.

Routing policies
The routing policy determines the route or path frames take when traversing the fabric. There
are three routing policies available:
򐂰 The default exchange-based routing (EBR)
򐂰 Port-based routing (PBR)
򐂰 Device-based routing (DBR)

EBR is the preferred routing policy for FCP fabrics. Before 2013, cascaded FICON
configurations supported only static PBR across ISLs. In this case, the ISL (route) for a given
port was assigned statically based on a round-robin algorithm at fabric login (FLOGI) time.
PBR can result in some ISLs being overloaded. In mid-2013, IBM z Systems added support
for DBR, which spread the routes across ISLs based on a device ID hash value. With the z13
release in mid-2015, IBM added FICON Dynamic Routing (FIDR), which supports Brocade
EBR to improve load balancing for cascaded FICON across ISLs.

The prerequisite z13 driver levels, adapter features, storage, and FOS levels to support FIDR
are included in the following white paper:
https://community.brocade.com/dtscp75322/attachments/dtscp75322/MainframeSolutions
/186/1/FICON%20Dynamic%20Routing%20White%20Paper%202016-08.pdf

FICON cascaded configurations with z13 and all other appropriate prerequisites should use
EBR. All other FICON cascaded configurations should use DBR.

For more information about the FICON Dynamic Routing feature, see Get More Out of Your IT
Infrastructure with IBM z13 I/O Enhancements, REDP-5134, found at:
http://www.redbooks.ibm.com/abstracts/redp5134.html?Open

Note: For FICON it should be DBR (if z/OS and z System supports DBR) regardless if it is
a FICON/FCP intermix or not. If z System does not support DBR then it must be PBR
regardless of intermix or not.

10 Fabric Resiliency Best Practices

Edge Hold Time
Hold time (HT) is the amount of time a Class 3 frame can remain in a queue before being
dropped while waiting for credit to be given for transmission.

Experience shows that when high latencies occur even on a single initiator or device in a
fabric, not only does the port that is attached to this initiator device see Class 3 frame
discards, but the resulting back pressure due to the lack of credit can build up in the fabric,
causing other flows that are not directly related to the high latency device to have their frames
discarded at ISLs.

Edge Hold Time (EHT) allows an overriding value for default HT that is applied to individual
F_Ports on Gen 5 FC platforms or all ports on an individual ASIC for 8 Gbps platforms if any
of the ports on that ASIC are operating as F_Ports.

Setting a lower EHT can be used to reduce the likelihood of this back pressure into the fabric
by assigning this lower HT value only for edge ports (initiators or targets). The lower EHT
value ensures that frames are dropped at the initiator or target port where the credit is lacking
before the higher default HT value that is used at the ISLs expires. This action can localize the
impact of a high latency port to just the single edge where the initiator or target is, preventing
the lack of credit from spreading into the fabric and impacting other unrelated flows.

Like HT, the EHT is configured for the entire switch, and is not configurable on individual ports
or ASICs. Whether the EHT or HT values are used on a port depends on the particular
platform and ASIC, and the type of port and also other ports that are on the same ASIC.

EHT is enabled by default in FOS V7.0 and later and there is no additional license that must
be configured.

Behavior
All Brocade Gen 5 platforms (16 Gbps) can set the HT value based on the type of each port
basis for ports on Gen 5 ASICs:
򐂰 All F-ports are programmed with the alternate EHT.
򐂰 All E_Ports are programmed with the default HT value (500 ms).

The same EHT value that is set for the switch is programmed into all F_Ports on that switch.
Different EHT values cannot be programmed on an individual port basis.

If 8 Gbps blades are installed into a Gen 5 platform (that is, an FC8-64 blade in a DCX 8510),
the same EHT value is programmed into all ports on the ASIC:
򐂰 If any single port on an ASIC is an E_Port, the default HT value (500 ms) value is
programmed into the ASIC, and all ports (E_Ports and F_Ports) use this one value.
򐂰 If all ports on an ASIC are F_Ports, the entire ASIC is programmed with the alternate EHT
value.

When deploying Virtual Fabrics, a unique EHT value can be independently configured for
each Logical Switch for Gen 5 Platforms running FOS V7.1 or later. When deploying Virtual
Fabrics with FOS V7.0, the EHT value that is configured into the default switch is the value
that is used for all Logical Switches. 8 Gbps blades that are installed in a Gen 5 platform
continue to use the Default Logical Switch configured value for all ports on those blades
regardless of which Logical Switches those ports are assigned to.

11
Preferred settings
Starting with FOS V7.0, the default EHT value is set to a moderate value of 220 ms. This
default EHT value is appropriate for most all environments.

The lowest EHT value of 80 ms can provide more protection from misbehaving initiators
compared to the default value, but this aggressive setting is preferable only for fabrics that are
well maintained and when a more aggressive monitoring and protection strategy is being
deployed. Additionally, this lowest value should be configured only on edge switches
composed entirely of initiators (with no device target ports) because a frame drop has more
significance for a target device than an initiator because multiple initiators typically
communicate with a single target port. Frame drops on target ports usually result in “SCSI
Transport” error messages being generated in server logs. Multiple frame drops from the
same target port can affect multiple servers in what appears to be a random fabric or storage
problem. Because the source of the error is not obvious, this situation can result in time that is
wasted for determining the source of the problem. Extra care should be taken to avoid
applying this lowest EHT, especially on switches where targets are deployed.

Credit recovery tools

FC traffic uses credit-based flow control where each side of a given connection provides a
number of buffers (referred to as “buffer to buffer” (BB)). These buffers are advertised to the
partner side of a connection as credits, indicating how many frames can be outstanding.
Credits are replenished when transmissions are successful and acknowledged. In rare error
scenarios situations, credit-based flow control acknowledgements are either not sent or not
received, leading to a “credit loss” condition. If the problem causing this failure is persistent,
credit loss can result in a stall of traffic.

FC credit-based recovery applies to external switch ports and back-end ports (ports that are
connected to the core blade or core blade back-end ports) that are used for traffic within a
switch. Traffic stalls on these internal back-end ports can have a wide impact, particularly
when they impact virtual circuits of an ISL. Starting with FOS V6.4.2 and V7.0, Brocade
introduced enhanced credit recovery tools to mitigate this type of problem. These tools can be
enabled to automatically reset back-end ports when a loss of credits is detected on internal
ports.

As a preferred practice, explicitly enable the credit recovery tools for internal ports because
this function is not enabled by default.

There are two main choices for how the recovery can proceed when enabled:
򐂰 An escalating recovery based on the results of a single link reset only (onLrOnly)
򐂰 A threshold-based approach that uses multiple link resets (onLrThresh).

When used with the onLrOnly option, the recovery mechanism takes the following escalating
actions:
򐂰 When it detects credit loss, it performs a link reset and logs a RASlog message (RAS
Cx-1014).
򐂰 If the link reset fails to recover the port, the port reinitializes. A RASlog message is
generated (RAS Cx-1015). The port reinitialization does not fault the blade.
򐂰 If the port fails to reinitialize, the port is faulted. A RASlog message (RAS Cx-1016) is
generated.
򐂰 If a port is faulted and there are no more online back-end ports in the trunk, the core blade
is faulted. (The port blade is always faulted.) A RASlog message is generated (RAS
Cx-1017).

12 Fabric Resiliency Best Practices

When used with the onLrThresh option, recovery is attempted through repeated link resets
and a count of the link resets is kept. If the threshold of more than the configured threshold
value (by using the -lrthreshold option) per hour is reached, the blade is faulted (RAS
Cx-1018). Regardless of whether the link reset occurs on the port blade or on the core blade,
the port blade is always faulted.

As a preferred practice, enable the credit tools with the onLrOnly recovery option.

Enabling credit recovery tools before Fabric OS V7.2

Before FOS V7.2, the credit recovery tool features were referred to as the “bottleneck credit
tool” and this function was managed with options to the bottleneck monitoring command.
Specifically, the --cfgcredittools parameter on the bottleneckMon command is used to
enable or disable credit recovery of back-end ports, and the --showcredittools parameter is
used to display the configuration.

Example 1 shows the use of the bottleneckmon --cfgcredittools command to enable

credit recovery on back-end ports by using the recovery option onLrOnly.

Example 1 The bottleneckmon --cfgcredittools command example

myswitch:admin> bottleneckmon --cfgcredittools -intport -recover onLrOnly
myswitch:admin> bottleneckmon --showcredittools
Internal port credit recovery is Enabled with LrOnly
myswitch:admin>

Enabling credit recovery tools in Fabric OS V7.2 and later

The creditRecovMode command was introduced in FOS V7.2.0 to manage all credit
recovery-related tools. Furthermore, as of FOS V7., all credit recovery options were
deprecated from the bottleneckMon command.

Example 2 shows the use of the creditRecovMode command to enable credit recovery on
back-end ports by using the recovery option onLrOnly.

Example 2 The creditRecovMode command

myswitch:admin> creditrecovmode --cfg onLrOnly
myswitch:admin> creditrecovmode --show
Internal port credit recovery is Enabled with LrOnly
myswitch:admin>

Dynamic port naming

Every port on a Brocade FC switch has a port name that by default is slot number port
number. The port name is displayed in many of the switch event messages, MAPS alert, and
Network Advisor dashboards.

Using a more meaningful port name makes these messages and dashboards more
meaningful and makes identifying external devices that are causing fabric problems easier
and quicker to identify.

The problem is that manually setting port names to more meaningful names is
labor-intensive, and typically only done with scripts to set the port name to the alias name of
the attached device.

13
With FOS V7.4, Brocade introduced dynamic port names that dynamically set the port name
to <switch name>.<port type>.<port index>.<alias name>. Dynamic port name is enabled by
using the configure command and setting dynamic port name to on.

In FOS V8, enhancements were made to allow configuring the dynamic port name by using
any of the following fields:
򐂰 Switch Name
򐂰 Port Type
򐂰 Port Index
򐂰 F_Port Alias
򐂰 FDMI Host name
򐂰 Remote Switch Name
򐂰 Slot / Port Number

Example 3 shows examples of both dynamically and manually set port names.

Example 3 The switchshow command showing port names

SANC_DCX1:FID32:dlutz> switchshow -portname

103 2 39 20:67:00:05:1e:d0:b5:05 SANC_DCX1.E_PORT.103

133 7 5 20:85:00:05:1e:d0:b5:05 CLSS14_HBA3
134 7 6 20:86:00:05:1e:d0:b5:05 SANC_DCX1.F_PORT.134.(null)
135 7 7 20:87:00:05:1e:d0:b5:05 DS5300_B2
134 7 6 20:86:00:05:1e:d0:b5:05 SANC_DCX1.F_PORT.134.(null)
145 7 17 20:91:00:05:1e:d0:b5:05 SANC_DCX1.F_PORT.145.XIV3_M5P
147 7 19 20:93:00:05:1e:d0:b5:05 SANC_DCX1.F_PORT.147.DS4800_B2
192 8 0 20:c0:00:05:1e:d0:b5:05 SANC_DCX1.(none).192
208 8 16 20:d0:00:05:1e:d0:b5:05 SANC_DCX1.F_PORT.208.(null)

Note: Ports that have manually set port names are not updated with dynamic port names.
To remove a manually set port name, you must reset the port configuration with the
portcfgdefault command, which resets all port parameters, including the port name, to
their default values.

Preferred settings
Enable dynamic port names.

Maintaining an optimal FC SAN environment

In each subsequent release of FOS, Brocade has added and enhanced features to assist with
monitoring, protecting, and troubleshooting fabrics. Several of these features have been
around since FOS vV7.0. The majority of the features have been available since FOS V7.2.
Starting in FOS V7.2, this set of features is referred to as Fabric Vision. When implemented
and administered properly, these features can dramatically improve the reliability and
resiliency of the fabric.

14 Fabric Resiliency Best Practices

This section focuses on the Fabric Vision features that are available in FOS versions 7.2 to
7.4, and specifically on the following subset of features that apply to monitoring and alerting:
򐂰 Bottleneck Detection
򐂰 Fabric Performance Impact (FPI)
򐂰 Monitoring Alerting Policy Suite (MAPS)
򐂰 Port Fencing
򐂰 Network Advisor Dashboards

Bottleneck Detection
Bottleneck Detection provides monitoring and detection of devices that introduce latency or
congestion into the fabric. This function was originally introduced in FOS V6.3.0 and has been
significantly enhanced in subsequent FOS releases in terms of effectiveness, reporting, and
configuration.

Starting in FOS V7.3, a new feature that is called FPI was introduced, which provides an
enhanced implementation of the Bottleneck Detection function and provides integration with
MAPS.

In FOS V7.3, the original Bottleneck Detection feature is still available for use, but it cannot be
used with FPI. The original Bottleneck Detection feature and FPI cannot be enabled
concurrently. As a preferred practice, use FPI over the Bottleneck Detection feature if the
Fabric Vision license is in place to enable it.

Starting with FOS V7.4, the original Bottleneck Detection feature was removed and FPI
becomes the only option to use this important function.

Enabling and disabling Bottleneck Detection

When Bottleneck Detection is enabled, RASlog alerts can be enabled to be sent when the
bottleneck conditions at a port exceed a specified threshold. To do so, complete the following
steps:
1. On the switch with target port connections, log in with administrator privileges.
2. Run the bottleneckmon --enable command to enable Bottleneck Detection on an F_Port
or FL_Port.
bottleneckmon --enable
[ -alert ] [ -thresh threshold ] [ -time window ] [ -qtime quiet_time]
[slot/]portlist [[slot/]portlist]...

If the alert parameter is not specified, alerts are not sent, but a history of bottleneck
conditions for the port can be viewed. The thresh, time, and qtime parameters are also
ignored if the alert parameter is not specified.

Use the default values for the thresh (0.1), time (300), and qtime (300) parameters.

Displaying the history of bottlenecks on a port

Use the bottleneckmon –show command to display a history of bottleneck conditions for an
individual port by completing the following steps:
1. Connect to the switch to which the target port belongs and log in as an administrator.
2. Run the bottleneckmon --show command to display a history of the bottleneck severity for
a specific port.

15
Example 4 shows the bottleneck history for port 3 in 5-second windows over a period of 30
seconds.

Example 4 Results of the bottleneckmon --show command

fcr_saturn1:root> bottleneckmon --show -interval 5 -span 30 3
=============================================================
Mon Jun 15 18:54:35 UTC 2010
=============================================================
From To affected secs
=============================================================
Jun 15 18:54:30 Jun 15 18:54:35 80.00%
Jun 15 18:54:25 Jun 15 18:54:30 40.00%
Jun 15 18:54:20 Jun 15 18:54:25 0.00%
Jun 15 18:54:15 Jun 15 18:54:20 0.00%
Jun 15 18:54:10 Jun 15 18:54:15 20.00%
Jun 15 18:54:05 Jun 15 18:54:10 80.00%

Bottleneck alert example

Example 5 shows a Bottleneck Detection alert on an F_Port.

Example 5 Example Bottleneck Detection alert on an F_Port

2010/03/16-03:40:47, [AN-1003], 21760, FID 128, WARNING, sw0, Latency bottleneck
at slot 0, port 38. 100.00 percent of last 300 seconds were affected. Avg. time
b/w transmits 80407.3975 us.

Preferred settings
On switches running FOS V6.2 through FOS V7.2, enable bottleneck monitoring.

Fabric Performance Impact

FPI monitoring was introduced in Brocade FOS V7.3.0 with monitoring for device latency
conditions on F_Ports. It was enhanced in Brocade FOS V7.4.0 with added support for
congestion detection on both E_Ports and F_Ports. Brocade FOS V7.4 also added automatic
mitigation capabilities through Slow Drain Device Quarantine (SDDQ) and Port Toggle
actions. FPI monitoring requires a Fabric Vision license and is supported on 8 Gbps and 16
Gbps platforms with FOS V7.3 and FOS V7.4. Starting with FOS V8.0, FPI monitoring does
not require a license on 16 Gbps and 32 Gbps platforms.

FPI detects different severity levels of latency and reports two latency states:
򐂰 The IO_FRAME_LOSS state is a severe level of latency. In this state, frame timeouts
either have occurred or are likely to occur. Administrators should take immediate action to
prevent application interruption.
򐂰 The IO_PERF_IMPACT state is a moderate level of latency. In this state, device-based
latencies can negatively impact the overall network performance.

FOS V7.4 added the IO_LATENCY_CLEAR state so that administrators are alerted when the
latency conditions clear. Administrators should act to mitigate the effect of latency devices.
The separate states enable administrators to apply different MAPS actions for different
severity levels. For example, administrators can configure SDDQ or the port toggling action
for the IO_FRAME_LOSS state and email alert action for IO_PERF_IMPACT.

16 Fabric Resiliency Best Practices

The SDDQ action sends the PID that is associated with the slow draining device to all the
switches in a fabric. All switches move the traffic that is destined for the slow draining device
into a low-priority VC. As a result, buffer credits on the regular, medium priority VC are freed
for traffic that is destined to other devices. This situation effectively removes the impact of the
slow drain device to the fabric performance without disruption to traffic. The result of this
action is that slow drain devices are isolated in a quarantine but remain online. This gives
administrators more time to find permanent solutions for the slow draining device problem. To
use the SDDQ action, the switches in the fabric are required to run FOS V7.4 with Fabric
Vision license and enable QoS on all the ports in the flow path.

The port toggle action disables a port for a short and user-configurable duration, and then
enables the port. The port toggle action can recover slow draining devices such as those
caused by a faulty host adapter. In addition, the port toggle action can induce multi-pathing
software (MPIO) to trigger traffic failing over to the alternative path to prevent severe
performance degradation. By using the SDDQ or port toggle actions, administrators can
monitor for device-based latency and automatically mitigate the problem when such
conditions are detected by FPI.

Preferred settings
On switches running FOS V7.3, enable FPI; on switches with FOS V7.4 and higher, enable
FPI and set up MAPS to quarantine the port for IO_FRAME_LOSS events by using the SDDQ
option (see “Monitoring Alerting Policy Suite setup” on page 31).

Note: To use SDDQ, quality of service (QoS) must be enabled on all switches, which is the
factory shipped default.

Monitoring Alerting Policy Suite

MAPS provides a new, easy-to-use solution for policy-based threshold monitoring and
alerting. MAPS proactively monitors the health and performance of the SAN infrastructure to
ensure application uptime and availability. By leveraging pre-built, policy-based rules
templates, MAPS simplifies threshold configuration, monitoring, and alerting. Organizations
can configure the entire fabric (or multiple fabrics) at one time by using common rules and
policies, or customize policies for specific ports or switch elements, all through a single
interface. The integrated network advisor dashboards display an overall switch health report,
along with details about out-of-policy conditions, to help administrators quickly pinpoint
potential issues and easily identify trends and other behaviors occurring on a switch or fabric.

MAPS was introduced in FOS V7.2 and replaces Fabric Watch as the preferred monitoring
tool. In FOS V7.4, Fabric Watch is no longer available.

Note: A Fabric Watch and Advanced Performance Monitoring or Fabric Vision license is
required to use MAPS.

Enabling MAPS monitoring with Network Advisor by using one of the default profiles is quick
and easy and provides effective monitoring of key metrics on every switch in the fabric. You
can enable MAPS by going to Network Advisor and clicking Monitor → Fabric Vision →
MAPS-Enable, and select and enable a default MAPS policy and distribute the policy to all
switches that are managed by Network Advisor by clicking Monitor → Fabric Vision →
MAPS → Configure.

17
Figure 3 shows the Network Advisor Fabric Vision MAPS menu.

Figure 3 Network Advisor Fabric Vision MAPS menu

If the switches were running Fabric Watch when MAPS is enabled, the Fabric Watch
configuration is converted to MAPS rules and a policy for the active and default Fabric Watch
settings is created. If you had a customized Fabric Watch set of rules, you can use the Fabric
Watch MAPS policy that was created instead of the default MAPS policy.

Note: Fabric Watch thresholds can be converted to MAPS rules/policies only on FOS V7.2
or V7.3. After the switch is upgraded to FOS V7.4, the Fabric Watch settings are lost.

MAPS alerts in the default policy generate emails and SNMP alerts. It is a preferred practice
to configure SNMP alerts to be sent to an SNMP manager, or configure email alerts to have
emails that are sent to key personnel to notify them of MAPS alerts.

Smaller installations that do not have Network Advisor running FOS V7.2 or V7.3 can enable
MAPS by running mapsconfig --enablemaps. In FOS V7.4 and higher, MAPS is enabled by
default, but unless you have a license, you can use only the limited base monitoring policy.

Note: It is common after enabling MAPS to see CPU utilization alerts. These can be
normal, as described in “CPU utilization” on page 45.

Custom policy for high availability fabrics

Fabrics that require high availability require strict monitoring. As a preferred practice, use a
custom set of MAPS rules that combine the thresholds from a moderate policy and an
aggressive policy into a single custom policy. The custom policy has aggressive rules that log
minor threshold violations so that you can detect marginal conditions and act. The custom
policy also has moderate rules to alert you through SNMP or email for conditions that need
more immediate attention, and a set of safety net rules to start automatic actions, such as for
port fencing to act automatically for severe issues.

Use the moderate default policy as a base and then customize the MAPS thresholds that are
shown in Table 1. Port thresholds should be customized for the Non_E_F_PORTS,
ALL_E_PORTS, ALL_OTHER_F_PORTS, ALL_HOSTS_PORTS, and
ALL_TARGET_PORTS groups.

Table 1 Custom MAPS port metrics thresholds

Type Timebase OP RASLOG Alert Fence (host)

C3TXTO min ge 3 20

CRC min ge 10 20 40

CRC hour 240

ITW min ge 20 40

18 Fabric Resiliency Best Practices

Type Timebase OP RASLOG Alert Fence (host)

LF min ge 3 5

LOSS_SIGNAL min ge 3

LOSS_SYNC min ge 3

LR min ge 5 10 20

LR hour 60

PE min ge 37 7

RX / TX / UTIL hour ge 75 90

STATE_CHG min ge 5 10

Note: Thresholds in the fence column should be created only in the ALL_HOST_PORTS
group.

Set the EPORT_DOWN and FAB_SEG thresholds to greater than or equal to 1 to create
alerts for every E_Port link that is down or switch segmentation by using the thresholds that
are shown in Table 2.

Table 2 Custom MAPS switch metric thresholds

Type Timebase OP RASLOG and alert

EPORT_DOWN min ge 1

FAB_SEG min ge 1

Preferred settings
Enable MAPS with the default moderate policy on all switches running FOS V7.2 or higher.
For fabrics that have high availability targets, create custom rules to provide additional
monitoring for marginal issues.

Port Fencing
The Brocade MAPS feature provides the ability to protect against faulty components and
conditions that impact links by automatically blocking ports when predefined thresholds are
reached.

Enabling MAPS rules with the Port Fencing option should be used with care so that fencing
ports happens only on ports that have severe issues. As a preferred practice, only MAPS
rules for host ports for the CRC and Link Rest thresholds be enabled for port fencing. Also,
run with the port fencing rules that are set up by the port fencing action as disabled for a
period to monitor the port fencing rules to ensure that they achieve the wanted effect.

Configuring Port Fencing

Complete the following steps:
1. From the MAPS configuration, click Monitor → Fabric Vision → MAPS → Configure,
select the MAPS policy to modify, and click Edit.
2. Select the rule to modify and click the left arrow, or create a rule and select the Fence
check box to enable the port fencing action.

19
3. Click the right arrow to transfer the rule to the selected policy.
Figure 4 shows the editing of a MAPS rule.

Figure 4 MAPS edit rule

4. After the rules are modified or created, activate the MAPS policy and monitor to ensure
that the rules are operating properly. Then, enable the port fencing facility. From the MAPS
configuration, click Actions and select the Fence check box.

Figure 5 on page 21 shows the activate MAPS policy actions.

20 Fabric Resiliency Best Practices

Figure 5 MAPS activate actions

Preferred settings
Enable port fencing on well-managed fabrics with high availability targets, and only on host
ports link reset and CRC metrics.

Network Advisor dashboards

Network Advisor 12 introduced dashboards and was further enhanced in subsequent
releases. Dashboards are a visual way to view key fabric metrics to help quickly identify
issues.

The dashboard displays different widgets that contain switch and port status, port thresholds,
performance monitors. and other items. Network Advisor comes with some standard
dashboards, such as Product Status and Traffic and SAN Port Health, and you can create
additional custom dashboards.

A dashboard provides a high-level overview of the network and the current states of managed
devices. You can easily check the status of the devices in the network. The dashboard also
provides several features to help you quickly access reports, device configurations, and
system event logs.

21
The custom Fabric Health dashboard that is shown in Figure 6 has several widgets defined
that can quickly show the current state and health of the fabric. At the top, the Scope field
defines which switches and what time frame is used to populate the widgets. Widgets such as
the Out of Range widget shows the number of ports that had violations for each category for
the selected time range, and by double-clicking the category, dialog boxes open where you
can drill down to the specific details for the violations. Similarly, you can use the Events
widgets to click the event severity to display the individual event messages.

Figure 6 Network Advisor Fabric Health dashboard

Note: For more information about setting up dashboards and configuring the dashboard
widgets, see the Brocade Network Advisor SAN User Manual for your release by searching
in the Brocade Document Library:
http://www.brocade.com/en/support/document-library.html

One of the more powerful features of the dashboards is the ability to select the time frame or
which fabric is used to populate the widgets. You can set the time frame for the last 24 hours
to see what issues occurred in the past day to monitor for marginal issues that might be
occurring, or dial down the scope to 30 minutes to focus on the current metrics when trying to
investigate a problem currently happening.

Figure 7 on page 23 shows the dashboard time and fabric scope selection window.

22 Fabric Resiliency Best Practices

Figure 7 Network Advisor Dashboard scope selection

Another useful feature of the dashboard widgets is the ability to double-click most of the
widget metrics to see additional details or a graph of the metric. Figure 8 shows the ITW port
widget. By double-clicking the port name, a chart showing the ITW occurrences opens.
Figure 8 shows the Host port ITW widget showing 427 ITWs.

Figure 8 Network Advisor Dashboard ITW widget

23
Figure 9 shows the ITW graph after double-clicking the port on the ITW widget.

Figure 9 Network Advisor Dashboard ITW graph

As a preferred practice, create a customized dashboard to monitor the overall fabric health,
and a dashboard to monitor port metrics. Optionally, create specialized custom dashboards to
show port metrics for storage devices, server ports, and ISLs. These dashboards are used
during major incidents and can help identify whether there is a storage port, server port, or
ISL that is causing a problem.

Preferred settings
Create Fabric Health and Ports dashboards with the widgets that are shown in Table 3.

Table 3 Suggested Product Status and Ports dashboards widgets

Dashboard name Status widget Performance widget

Fabric Health SAN Status Top Product CPU

SAN Inventory Top Product Memory

Events

Status

Out of Range Violations

Port Health Violations

Ports Port Health Violations Top Port C3 Discards

Top Port C3 Discards RX TO

Top Port C3 Discards TX TO

Top Port CRC

Top Port ITW

Top Port Too Long Errors

Top Port Utilization Percent

24 Fabric Resiliency Best Practices

Optionally, create Host, Storage, and ISL port dashboards, as shown in Table 4, which can be
useful when doing problem determination.

Table 4 Optional Host, Storage, and ISL dashboard widgets

Dashboard Status widget Performance widget

Host Ports Initiator Port Health Violations Top Initiator Ports C3 Discards

Initiator Bottleneck Ports Top Initiator Ports CRC Errors

Top Initiator Ports Encode Error

Top Initiator Port Sync Loss

Top Initiator Port ITWs

Storage Ports Target Port Health Violations Top Target Ports C3 Discards

Target Bottleneck Ports Top Target Ports CRC Errors

Top Target Ports Encode Error

Top Target Port Sync Loss

Top Target Port ITWs

ISL Ports ISL Port Health Violations Top ISL Ports C3 Discards

ISL Bottleneck Ports Top ISL Ports CRC Errors

Top ISL Ports Encode Error

Top ISL Port Sync Loss

Top ISL Ports C3 Discards RX

Top ISL Ports C3 Discards TX

25
Summary of preferred practices
Here is a summary of the preferred features and capabilities to improve the overall resiliency
of FOS-based FC fabric environments:
򐂰 Enable an appropriate routing policy.
򐂰 Configure an appropriate In Order Delivery (IOD) setting.
򐂰 Configure an appropriate Dynamic Load Sharing (DLS) setting.
򐂰 Verify Edge Hold Time.
򐂰 Enable Credit Recovery Tool.
򐂰 Enable Dynamic Port name.
򐂰 Enable Fabric Performance Impact or Bottleneck monitoring.
򐂰 Enable MAPS monitoring and alerting.
򐂰 Enable Slow Drain Device Quarantine.
򐂰 Configure and use Network Advisor Dashboards.

Preferred implementation
This section describes the preferred sequence for implementing the fabric resiliency features
that are provided by the Brocade FOS along with the preferred configuration values.

Note: The preferred sequence and associated thresholds that are presented are identified
for most environments. It is possible that specific environments might require alternative
settings to meet specific requirements.

Enabling the routing policy

Enable the appropriate routing policy based on what environment the fabric supports.

Enable EBR for open systems environments by using the Advanced Performance Tuning
Policy (aptpolicy) command.

Note: EBR is the default routing policy.

Example 6 shows the aptpolicy command that is used to set the EBR policy.

Example 6 The aptpolicy command for exchange-based routing

DCX1_Default:FID128:dlutz> aptpolicy 3
Policy updated successfully.

DCX1_Default:FID128:dlutz> aptpolicy
Current Policy: 3

3 : Default Policy
1: Port Based Routing Policy
2: Device Based Routing Policy (FICON support only)
3: Exchange Based Routing Policy

Enable EBR policy for switches that support FICON only or FICON and open systems
environments.

26 Fabric Resiliency Best Practices

Example 7 shows the aptpolicy command that is used to set the DBR policy.

Example 7 The aptpolicy command for device-based routing

DCX1_Default:FID128:dlutz> aptpolicy 2
Policy updated successfully.

DCX1_Default:FID128:dlutz> aptpolicy
Current Policy: 2

3 : Default Policy
1: Port Based Routing Policy
2: Device Based Routing Policy (FICON support only)
3: Exchange Based Routing Policy

Enabling In Order Delivery

Disable IOD for switches unless you are using FICON.

Example 8 shows the iodset and iodshow commands to enable frame IOD.

Example 8 The iodset and iodshow commands

DCX1_SANA:FID16:dlutz> iodset
IOD is set

DCX1_SANA:FID16:dlutz> iodshow
IOD is set

Example 9 shows the iodreset and iodshow commands to disable frame IOD.

Example 9 The iodreset command

DCX1_SANA:FID16:dlutz> iodreset
IOD is not set

DCX1_SANA:FID16:dlutz> iodshow
IOD is not set

Enabling Dynamic Load Sharing

DLS is enabled when the exchange-based or EBR policies are used and cannot be changed.

Lossless should be enabled. Example 10 shows the dlsset command to enable lossless.

Example 10 The dlsset command to enable lossless

DCX1_Default:FID128:dlutz> dlsset --enable -lossless
DLS and Lossless are set

DCX1_Default:FID128:dlutz> dlsshow
DLS is set with Lossless enabled
E_Port Balance Priority is not set

27
E_Port balance priority should be enabled. Example 11 shows the dlsset commands to
enable E_Port balance priority.

Example 11 The dlsset commands to enable E_port balance priority

DCX1_SANA:FID16:dlutz> dlsset --enable -eportbal
E_Port Balance Priority is set

DCX1_SANA:FID16:dlutz> dlsshow
DLS is set with Lossless enabled
E_Port Balance Priority is set

Enabling Edge Hold Time

Enable EHT on switches to 220ms.

Note: To update the EHT setting on switches with virtual fabrics, run the configure
command from all logical switches.

Example 12 shows the configure command that is used to set the EHT.

Example 12 The configure command to set Edge Hold Time

DCX1_SANA:FID16:dlutz> configure

Not all options will be available on an enabled switch.

To disable the switch, use the "switchDisable" command.

Configure...

Fabric parameters (yes, y, no, n): [no] y

WWN Based persistent PID (yes, y, no, n): [no]

Allow XISL Use (yes, y, no, n): [no]
Location ID: (0..4) [0]
Edge Hold Time(Low(80ms), Medium(220ms), High(500ms): (80..500) [220] 220
Remote Fosexec feature: (on, off): [off]
High Integrity Fabric Mode (yes, y, no, n): [no]
D-Port Parameters (yes, y, no, n): [no]

Enabling the Credit Recovery Tool

Enable the Credit Recovery Tool with the link reset only (LROnly) option.

Example 13 shows the bottleneck commands to enable credit tools and display the credit
tools’ current setting for FOS V6.4 through FOS V7.2.

Example 13 The bottleneckmon cfgcredittools command for Fabric OS V6.4 through Fabric OS V7.2
IBM_2005_BK5:dlutz> bottleneckmon --cfgcredittools -intport -recover onLrOnly

IBM_2005_BK5:dlutz> bottleneckmon --showcredittools

Internal port credit recovery is Enabled with LrOnly

28 Fabric Resiliency Best Practices

Example 14 shows the creditrecovmode command to enable credit tools and display the
credit tools settings for FOS V7.3.

Example 14 The creditrecovmode command for Fabric OS V7.3

DCX1_SANA:FID16:dlutz> creditrecovmode --cfg on onLrOnly
DCX1_SANA:FID16:dlutz> creditrecovmode --fe_crdloss on

DCX1_SANA:FID16:dlutz> creditrecovmode --show

Internal port credit recovery is Enabled with LrOnly
LR threshold (not currently activated): 2
Fault Option (not currently activated): EDGEBLADE
C2 FE Complete Credit Loss Detection is Enabled

Example 15 shows the creditrecovmode command to enable credit tools and display the
credit tools settings for FOS V7.4.

Example 15 The creditrecovmode command for Fabric OS V7.4

DCX1_SANA:FID16:dlutz> creditrecovmode --cfg on onLrOnly
DCX1_SANA:FID16:dlutz> creditrecovmode --fe_crdloss on
DCX1_SANA:FID16:dlutz> creditrecovmode --be_crdloss on
DCX1_SANA:FID16:dlutz> creditrecovmode --be_losync on

DCX1_SANA:FID16:dlutz> creditrecovmode --show

Internal port credit recovery is Enabled with LrOnly
LR threshold (not currently activated): 2
Fault Option (not currently activated): EDGEBLADE
C2 FE Complete Credit Loss Detection is Enabled

Enabling the dynamic port name

On switches running FOS V7.4 and higher, enable the dynamic port name.

Note: To enable the dynamic port name on switches with virtual fabrics, run the configure
command from all logical switches.

29
Example 16 shows the configure commands that are used to enable the dynamic port name.

Example 16 The configure command to enable the dynamic port name

F48a_Default:FID128:dlutz> configure

Not all options will be available on an enabled switch.

To disable the switch, use the "switchDisable" command.

Configure...

Fabric parameters (yes, y, no, n): [no] y

WWN Based persistent PID (yes, y, no, n): [no]

Allow XISL Use (yes, y, no, n): [yes]
Dynamic Portname (on, off): [on] on
Edge Hold Time(Low(80ms), Medium(220ms), High(500ms), UserDefined(80-500ms):
(80..500) [220]

Enabling bottleneck monitoring or Fabric Performance Impact

On switches running FOS V6.4 through FOS V7.2, enable bottleneck monitoring, and on
switches with FOS V7.3 or higher, enable FPI.

Enabling bottleneck monitoring

Example 17 shows the bottleneckmon enable command to enable bottleneck monitoring on
FOS V6.4 through FOS V7.2.

Example 17 The bottleneckmon enable command

SANA_DCX2:FID16:dlutz> bottleneckmon --enable -lthresh 0.1 -cthresh 0.5 -time 300
-qtime 300 -alert

SANA_DCX2:FID16:dlutz> bottleneckmon --status

Bottleneck detection - Enabled
==============================

Switch-wide sub-second latency bottleneck criterion:

====================================================
Time threshold - 0.800
Severity threshold - 50.000

Switch-wide alerting parameters:

================================
Alerts - Yes
Latency threshold for alert - 0.100
Congestion threshold for alert - 0.500
Averaging time for alert - 300 seconds
Quiet time for alert - 300 seconds

Enabling Fabric Performance Impact

Example 18 on page 31 shows the mapsconfig command that enables FPI on FOS V7.3 and
higher.

30 Fabric Resiliency Best Practices

Example 18 The mapsconfig command that enables Fabric Performance Impact
DCX1_SANA:FID16:dlutz> mapsconfig --enableFPImon

DCX1_SANA:FID16:dlutz> mapsconfig --show

Configured Notifications: RASLOG,SNMP,EMAIL,SW_CRITICAL,SW_MARGINAL
Mail Recipient: Not Configured
FPI Monitoring: Enabled

Enabling Monitoring Alerting Policy Suite

On switches running FOS V7.2 or higher, use the MAPS to monitor the switches over Fabric
Watch. To ensure that consistent settings and policies are enabled on all switches in the
fabric, use Network Advisor.

Monitoring Alerting Policy Suite setup

To set up MAPS, complete the following steps:
1. Log in to Network Advisor and click Monitor → Fabric Vision → MAPS → Enable.
Figure 10 shows the Network Advisor menu options to enable MAPS.

Figure 10 Network Advisor Monitoring Alerting Policy Suite Enable menu

2. Select the switches that you want to enable MAPS on by selecting them in the Available
Switches pane and click the right arrow to move them to the Selected Switches pane. After
all the switches that you want to enable MAPS on are selected, click OK to enable MAPS
on those switches.

31
Figure 11 shows the Network Advisor MAPS enable switch selection window.

Figure 11 Network Advisor MAPS enable switch selection

3. After MAPS is enabled, click Monitor → Fabric Vision → MAPS → Configure.

Figure 12 shows the Network Advisor menu options to open the MAPS configuration
window.

Figure 12 Network Advisor MAPS open Configure menu

32 Fabric Resiliency Best Practices

4. In the MAPS Configuration window, select the switches on which to enable policy actions.
To select multiple switches, hold the Ctrl key while selecting the switches. After the
switches are selected, click Actions.
Figure 13 shows the MAPS Configuration window.

Figure 13 Network Advisor MAPS Configuration

33
5. In the MAPS Policy Actions dialog box, select the RAS Log Event, SNMP Trap, E-mail,
Switch Status Marginal, Switch Status Critical, and SFP Status Marginal check boxes,
and for switches with FOS V7.4 and higher, select FPI Actions and SDDQ. Click OK.

Note: Do not enable the Fence action.

Figure 14 shows the MAPS Policy Actions dialog box.

Figure 14 Network Advisor MAPS Policy Actions

6. In the MAPS Configuration dialog box, expand the list of available policies for each of the
switches. Select the dft_conservative_policy for each switch. To select a policy for each
switch, hold the Ctrl key while selecting the policies. After policies for each switch are
selected, click Activate.
Figure 15 on page 35 shows the MAPS Configuration dialog box with
dft_conservativy_policy selected.

34 Fabric Resiliency Best Practices

Figure 15 Network Advisor MAPS Configuration with conservative policy selected

MAPS is now enabled with the default conservative policy.

Enabling the MAPS setup for high availability fabrics

For switches that require strict monitoring to provide highly available fabrics, implement
custom MAPS rules for the port and switch thresholds that are listed in the tables in this
section by completing the following steps:
1. Create a customer MAPS policy by making a copy of the default moderate policy by
running the mapspolicy --clone command.
2. Display existing rules in the newly created customer policy by running the mapspolicy
--show <policyname> command.
3. Remove any existing default maps rules for metrics that are shown in the tables in this
section (if any custom rules will be created) by running the mapspolicy --delrule
commands.
4. Create customer rules by running the mapsrule --create command, and add these rules
to the newly created custom policy.

35
Table 5 shows the custom MAPS port metric thresholds.

Hint: To help identify which existing default rules must be removed, run the following
command:

mapspolicy --show dflt_moderate_policy | grep LOSS_SIGNAL

Table 5 Custom MAPS port metrics thresholds

Type Timebase OP RASLOG Alert Fence (host)

C3TXTO min ge 3 20

CRC min ge 10 20 40

CRC hour 20 240

ITW min ge 20 40

LF min ge 3 5

LOSS_SIGNAL min ge 3

LOSS_SYNC min ge 3

LR min ge 5 10 20

LR hour 60

PE min ge 37 7

RX / TX / UTIL hour ge 75 90

STATE_CHG min ge 5 10

Table 6 shows the custom MAPS switch metric thresholds.

Table 6 Custom MAPS switch metric thresholds

Type Timebase OP RASLOG and alert

EPORT_DOWN min ge 1

FAB_SEG min ge 1

Example 19 shows the sample commands to create custom MAPS policy and rules.

Example 19 Sample commands to create custom MAPS policy and rules

dlutz> mapspolicy --clone dflt_moderate_policy -name custom_policy

dlutz> mapspolicy --show dflt_moderate_policy | grep CRC

dlutz> mapspolicy --delrule custom_policy --rulename defALL_HOST_PORTSCRC_10

dlutz> mapspolicy --delrule custom_policy --rulename defALL_HOST_PORTSCRC_20

dlutz> mapsrule --create ALL_HOST_CRC_Log -policy custom_policy -group

ALL_HOST_PORTS -monitor CRC -timebase Min -op ge -value 10 -action RASLOG

36 Fabric Resiliency Best Practices

dlutz> mapsrule --create ALL_HOST_CRC_Alert -policy custom_policy -group
ALL_HOST_PORTS -monitor CRC -timebase Min -op ge -value 20 -action
RASLOG,EMAIL,SNMP

dlutz> mapsrule --create ALL_HOST_CRC_Fence -policy custom_policy -group

ALL_HOST_PORTS -monitor CRC -timebase Min -op ge -value 40 -action
RASLOG,EMAIL,SNMP,FENCE

dlutz> mapsrule --create ALL_HOST_CRC_FenceH -policy custom_policy -group

ALL_HOST_PORTS -monitor CRC -timebase Hour -op ge -value 120 -action
RASLOG,EMAIL,SNMP,FENCE

dlutz> mapsrule --create SWICTH_EportDown_Alert -policy custom_policy -group

switch -monitor EPORT_DOWN -timebase min -op ge -value 1 -action RASLOG,EMAIL,SNMP

dlutz> mapspolicy --enable custom_policy

Note: For more information about MAPS and MAPS commands, see the Brocade MAPS
Administration Guide, found at:
http://my.brocade.com

Enabling Slow Drain Device Quarantine

On switches with FOS V7.4 and higher, enable SDDQ for the MAPS IO_FRAME_LOSS
events.

Note: To use the SDDQ switch, QOS must be enabled on all switches.

Complete the following steps:

1. To verify whether QOS is active on the ISL switch ports, run the islshow command on the
switch and look for QOS next to each ISL link. You can view the port QOS setting by running
the portcfgshow command. If QOS is not enabled, run the portcfgqos command.
Example 20 show the islshow command on an ISL that has QOS active.

Example 20 The islshow command showing the QOS setting

SANA_DCX1:FID16:dlutz> islshow
1:199-> 28 10:00:00:05:33:99:12:02 20 SANA_DCX2 sp: 4.000G bw:
4.000G TRUNK QOS

37
Example 21 shows the portcfgshow command on ports with the default QOS AutoEnable
setting.

Example 21 The portcfgshow command showing the QOS setting

SANA_DCX1:FID16:dlutz> portcfgshow
Ports of Slot 2 16 17 18 19 20 21 22 23 29 30 31
----------------------+---+---+---+---+-----+---+---+---+---+---+---
Speed AN AN AN AN AN AN AN AN AN AN AN
Fill Word(On Active) 0 0 0 0 0 0 0 0 0 0 0
Fill Word(Current) 0 0 0 0 0 0 0 0 0 0 0
AL_PA Offset 13 .. .. .. .. .. .. .. .. .. .. ..

QOS Port AE AE AE AE AE AE AE AE AE AE AE
EX Port .. .. .. .. .. .. .. .. .. .. ..

2. To enable SDDQ, update the appropriate IO_FRAME_LOSS MAPS rules to use the
SDDQ action.
3. Start the MAPS configuration dialog box by clicking Monitor → Fabric Vision →
MAPS → Configure. In the MAPS Configure dialog box, select the appropriate MAPS
policy and click Edit. Edit the IO_FRAME_LOSS rule on the FPI tab and select the SDDQ
check box.
Figure 16 shows the FPI rule IO_FRAME_LOSS with the SDDQ action enabled.

Figure 16 Network Advisor update MAPS Fabric Performance Impact IO_FRAME_LOSS rule

Note: Run several weeks with this rule enabled but with the SDDQ facility disabled to
ensure that the rule works as expected.

38 Fabric Resiliency Best Practices

4. Enable the SDDQ facility on the MAPS configuration dialog box by selecting the fabric and
clicking Actions.
Figure 17 shows the MAPS Policy actions dialog box with the FPI SDDQ action enabled.

Figure 17 Network Advisor MAPS Policy Actions with SDDQ enabled

Creating dashboards
Complete the following steps:
1. Create Fabric Health and Ports dashboards with the widgets that are shown in Table 7.

Table 7 Suggested Product Status and Ports dashboards widgets

Dashboard name Status widget Performance widget

Fabric Health SAN Status Top Product CPU

SAN Inventory Top Product Memory

Events

Status

Out of Range Violations

Port Health Violations

Ports Port Health Violations Top Port C3 Discards

Top Port C3 Discards RX TO

39
Dashboard name Status widget Performance widget

Top Port C3 Discards TX TO

Top Port CRC

Top Port ITW

Top Port Too Long Errors

Top Port Utilization Percent

2. Optionally, create a Host, Storage, and ISL port dashboards (which can be useful when
doing problem determination), as shown in Table 8.

Table 8 Optional Host, Storage, and ISL dashboard widgets

Dashboard Status widget Performance widget

Host Ports Initiator Port Health Violations Top Initiator Ports C3 Discards

Initiator Bottleneck Ports Top Initiator Ports CRC Errors

Top Initiator Ports Encode

Error

Top Initiator Port Sync Loss

Top Initiator Port ITWs

Storage Ports Target Port Health Violations Top Target Ports C3 Discards

Target Bottleneck Ports Top Target Ports CRC Errors

Top Target Ports Encode Error

Top Target Port Sync Loss

Top Target Port ITWs

ISL Ports ISL Port Health Violations Top ISL Ports C3 Discards

ISL Bottleneck Ports Top ISL Ports CRC Errors

Top ISL Ports Encode Error

Top ISL Port Sync Loss

40 Fabric Resiliency Best Practices

Dashboard Status widget Performance widget

Top ISL Ports C3 Discards RX

Top ISL Ports C3 Discards TX

3. Create a dashboard by selecting the My Dashboard group and clicking Add.

Figure 18 shows the Network Advisor window that is used to add a dashboard.

Figure 18 Network Advisor: Add dashboard

41
4. Enter the dashboard name in the Name entry field in the Add Dashboard dialog box and
click OK.
Figure 19 shows the Add Dashboard dialog box.

Figure 19 Network Advisor Add Dashboard dialog box

5. Use the Customize Dashboard tool to add widgets to the empty dashboard.
Figure 20 shows the icon to start the Customize Dashboard tool.

Figure 20 Network Advisor Customer Customize Dashboard icon

6. To add widgets to the dashboard, select the required widgets by checking the check box
next to the widget titles.
Figure 21 on page 43 shows the Customize Dashboard Status dialog box.

42 Fabric Resiliency Best Practices

Figure 21 Network Advisor Customer Dashboard Status dialog box

7. Select the Performance tab to add performance widgets.

Figure 22 shows the Customize Dashboard Performance dialog box.

Figure 22 Network Advisor Customize Dashboard Performance dialog box

43
Figure 23 shows the completed dashboard with widgets.

Figure 23 Network Advisor Fabric Health dashboard

Access gateway
BladeCenter and chassis-style systems typically have embedded switches that are installed
in them. These switches can operate in native fabric mode or access gateway (AG) mode. AG
mode uses NPIV to connect the devices in the chassis to the network instead of native fabric
mode, which operates as a standard switch, requires its own fabric domain, and requires a
copy of the name server and configuration databases. In AG mode, the embedded switch
does not use any of these items.

Embedded switches can support trunking, which usually requires an optional license.
Trunking allows transparent failover and failback within the trunk group. Trunked links are
more efficient and can distribute I/O more evenly across all the links in the trunk group.

Run embedded switches in AG mode. For chassis with high throughput or high availability
goals, use trunking.

For more information, see Brocade Access Gateway Administrator’s Guide, found at:
http://my.brocade.com

44 Fabric Resiliency Best Practices

CPU utilization
When MAPS is enabled, one of the default rules is the Chassis_CPU rule, which generates a
RASlog message and sends an SNMP or email alert when the switch CPU utilization exceeds
the threshold, which is typically 80%. For more details, see Example 22.

Example 22 The default Chassis CPU MAPS rule

SANA_DCX2:FID16:dlutz> mapsrule --show Chassis_CPU
Rule Data:
RuleName: Chassis_CPU
Condition: CHASSIS(CPU/None>=80)
Actions: raslog,snmp,email
Associated Policies:

It is common for switch CPU utilization to exceed the 80% threshold and reach 99% utilization
while performing tasks such as gathering supportsave data or doing SFP statistics polling.
This results in MAPS messages being generated.

Figure 24 shows the MAPS 1003 messages that are created when the CPU utilization
threshold is exceeded.

2016/06/19-02:09:50, [MAPS-1003], 4924, SLOT 6 | FID 128, WARNING, SW, Chassis,

Condition=CHASSIS(CPU>=80.00), Current Value:[CPU,100.00 %],
RuleName=Chassis_CPU, Dashboard Category=Switch Resource .
2016/06/19-02:13:50, [MAPS-1003], 4926, SLOT 6 | FID 128, WARNING, SW, Chassis,
Condition=CHASSIS(CPU>=80.00), Current Value:[CPU,100.00 %],
RuleName=Chassis_CPU, Dashboard Category=Switch Resource .
2016/06/19-02:15:50, [MAPS-1003], 4927, SLOT 6 | FID 128, WARNING, SW, Chassis,
Condition=CHASSIS(CPU>=80.00), Current Value:[CPU,96.00 %],
RuleName=Chassis_CPU, Dashboard Category=Switch Resource .
2016/06/19-02:17:50, [MAPS-1003], 4929, SLOT 6 | FID 128, WARNING, SW, Chassis,
Condition=CHASSIS(CPU>=80.00), Current Value:[CPU,92.00 %],
RuleName=Chassis_CPU, Dashboard Category=Switch Resource .
Figure 24 Example of MAPS CPU utilization messages

High CPU utilization is not a problem if the tasks that use the CPU release the CPU when
high priority requests, such as port logins or name server queries, occur. If the high CPU
utilization occurs only for short periods, it is typically not a problem. The MAPS high CPU
utilization messages do not indicate whether the duration is a short or a long duration.

A good way to identify the actual CPU utilization is to use the real-time or historical CPU
usage charts in Network Advisor.

45
Figure 25 shows a picture of the CPU utilization widget.

Figure 25 Network Advisor CPU Utilization by Time widget

This example is of a switch that does not have a CPU utilization issue. The chart shows a
CPU utilization spike, which would have resulted in creating some MAPS-1003 messages, but
because the spike is a single short duration event, it is not considered to be an issue.

The most common cause of high CPU utilization is due to external management applications
making requests by using the Ethernet management port. Having more than one Network
Advisor managing the switches or performance gathering products such as IBM Spectrum™
Control directly probing the switches are examples of applications that can cause high CPU
utilization.

Only one Network Advisor application should be managing the switches, and performance
applications such as IBM Spectrum Control™ should get their data from Network Advisor (if
supported). If the performance application being used must get its data from the switch, then
only one instance of the application should be getting its data from the switch.

Frame Viewer
Frames that are discarded due to hold-time timeout are sent to the CPU for processing.
During subsequent CPU processing, information about the frame, such as SID, DID, and
transmit port number, is retrieved and logged. This information is maintained for a certain
fixed number of frames.

Frame Viewer captures only FC frames that are dropped due to a timeout that is received on
an Edge ASIC (an ASIC with front-end (FE) ports). If the frame is dropped due to any other
reason, it is not captured by Frame Viewer. If the frame is dropped due to timeout on an Rx
buffer on a Core ASIC, the frame is not captured by Frame Viewer. Timeout is defined as a
frame that lives in an Rx buffer for longer than the HT default of 500 ms or the EHT value
custom setting.

46 Fabric Resiliency Best Practices

Figure 23 shows the framelog command to show C3 frame Tx timeouts.

Example 23 The framelog command output

framelog --show -n 1200
=========================================================================
Sun Oct 16 23:49:07 EDT 2016
=========================================================================
Log TX RX
timestamp port port SID DID SFID DFID Type Count
=========================================================================
Sep 18 05:56:05 2/22 -1/-1 0x14a040 0x0a4e00 128 128 timeout 20
Sep 18 05:56:04 9/46 -1/-1 0x14a040 0x0a4e00 128 128 timeout 20
Sep 18 05:39:08 -1/-1 3/16 0x14a040 0x0a4e00 128 128 timeout 2
Sep 18 05:39:08 9/46 -1/-1 0x14a040 0x0a4e00 128 128 timeout 20
Sep 04 05:21:56 9/46 -1/-1 0x14a040 0x0a4e00 128 128 timeout 20
Sep 04 05:21:55 -1/-1 1/1 0x140100 0x0ae940 128 128 timeout 8
Aug 03 12:00:44 2/22 -1/-1 0x14a940 0x0a1d03 128 128 timeout 9
Aug 03 12:00:44 2/22 -1/-1 0x149a40 0x0a3900 128 128 timeout 11
Aug 02 04:47:23 -1/-1 4/2 0x143200 0x0a7100 128 128 timeout 8
Aug 02 04:47:23 2/22 -1/-1 0x14acc0 0x0a2d00 128 128 timeout 2
Aug 02 04:47:23 2/22 -1/-1 0x14a940 0x0a1d03 128 128 timeout 17
Aug 02 04:47:23 2/22 -1/-1 0x142000 0x0ad1c0 128 128 timeout 1

Forward Error Correction

Brocade Gen 5 (16 Gbps) platforms support Forward Error Correction (FEC) that
automatically corrects bit errors. This function enhances the link reliability and improves
resiliency with the presence of marginal media. FEC is preferred between all supported
devices. FEC is mandatory on Gen 6 link (32 Gbps) speed.

FEC on Gen 5 can correct up to 11-bit errors in every 2112-bit transmission in a 10 Gbps/16
Gbps data stream in both frames and primitives. FEC is enabled by default on the back-end
(BE) links of Condor 3 ASIC-based switches and blades and minimizes the loss of credits on
BE links. FEC is also enabled by default on FE links when connected to another FEC-capable
device. FEC on Gen 6 uses a more robust coding algorithm that corrects up to seven 10-bit
streams and detects up to fourteen 10-bit streams, without the requirement that the errors be
in a burst. FEC is mandatory on Gen 6 platforms for 32 Gbps speed to ensure that the
bit-error rate stays within the standard requirement. Condor 4 ASIC automatically turns on
FEC when a port operates at 32 Gbps speed and cannot be disabled.

Enable FEC on 10 Gbps/16 Gbps connections when both ends of the link support it.

47
Authors
This paper was produced by a team of specialists from around the world, working at the IBM
International Technical Support Organization. The content is based on Brocade
documentation and is presented in a form that specifically identifies IBM preferred practices.

Chad Collie
IBM Systems

Michael Hrencecin
IBM Systems

David Lutz
IBM GTS

Ian MacQuarrie
IBM Systems

Shawn Wright
IBM Systems

This project was managed by:

Jon Tate
IBM ITSO

Thanks also go to:

Serge Monney
IBM GTS

Special thanks to Brocade for their support of this paper in terms of equipment and support in
many areas, and to the following people at Brocade:

Silviano Gaona, Tom Chen, Brian Steffler

Now you can become a published author, too!

Here's an opportunity to spotlight your skills, grow your career, and become a published
author—all at the same time! Join an ITSO residency project and help write a book in your
area of expertise, while honing your experience using leading-edge technologies. Your efforts
will help to increase product acceptance and customer satisfaction, as you expand your
network of technical contacts and relationships. Residencies run from two to six weeks in
length, and you can participate either in person or as a remote resident working from your
home base.

Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html

48 Fabric Resiliency Best Practices

Stay connected to IBM Redbooks
򐂰 Find us on Facebook:
http://www.facebook.com/IBMRedbooks
򐂰 Follow us on Twitter:
http://twitter.com/ibmredbooks
򐂰 Look for us on LinkedIn:
http://www.linkedin.com/groups?home=&gid=2130806
򐂰 Explore new IBM Redbooks® publications, residencies, and workshops with the IBM
Redbooks weekly newsletter:
https://www.redbooks.ibm.com/Redbooks.nsf/subscribe?OpenForm
򐂰 Stay current on recent Redbooks publications with RSS Feeds:
http://www.redbooks.ibm.com/rss.html

49
50 Fabric Resiliency Best Practices
Notices

This information was developed for products and services offered in the US. This material might be available
from IBM in other languages. However, you may be required to own a copy of the product or product version in
that language in order to access it.

IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US

INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION “AS IS”

WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in
certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.

Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.

IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.

The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.

Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.

Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and
represent goals and objectives only.

This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to actual people or business enterprises is entirely
coincidental.

This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are
provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use
of the sample programs.

Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright
and trademark information” at http://www.ibm.com/legal/copytrade.shtml

The following terms are trademarks or registered trademarks of International Business Machines Corporation,
and might also be trademarks or registered trademarks in other countries.
C3® IBM Spectrum Control™ Redbooks (logo) ®
FICON® IBM z13® z13™
IBM® Redbooks®
IBM Spectrum™ Redpaper™

The following terms are trademarks of other companies:

Other company, product, or service names may be trademarks or service marks of others.

52 Fabric Resiliency Best Practices

Back cover

REDP-4722-03

ISBN 073845589X

Printed in U.S.A.

®
ibm.com/redbooks

KB How To Troubleshoot A Slow Drain Device in A SAN
No ratings yet
KB How To Troubleshoot A Slow Drain Device in A SAN
4 pages
5 Minute Troubleshooting
No ratings yet
5 Minute Troubleshooting
9 pages
Cisco Slow Drain PDF
No ratings yet
Cisco Slow Drain PDF
61 pages
SAN Workshop for IT Professionals
No ratings yet
SAN Workshop for IT Professionals
14 pages
Slow Drain Detection in Cisco
No ratings yet
Slow Drain Detection in Cisco
61 pages
Brocade BCFA 250 Preparing BCFA Certified Professionals For The 16Gbps BCA Exam
No ratings yet
Brocade BCFA 250 Preparing BCFA Certified Professionals For The 16Gbps BCA Exam
130 pages
SAN101 Brocade
No ratings yet
SAN101 Brocade
46 pages
Brocade SAN Zoning Guide
No ratings yet
Brocade SAN Zoning Guide
34 pages
05 Business Continuity RemoteReplication
No ratings yet
05 Business Continuity RemoteReplication
44 pages
Brocade Fibre Channel Buffer Credits and Frame Management WP
No ratings yet
Brocade Fibre Channel Buffer Credits and Frame Management WP
20 pages
Brocade Commands: Hemant Hemant
No ratings yet
Brocade Commands: Hemant Hemant
9 pages
SAN Switch Cheat Sheet
No ratings yet
SAN Switch Cheat Sheet
13 pages
h14556 Dell Emc Powermax Vmax All Flash SRDF Metro Overview Best Practices
No ratings yet
h14556 Dell Emc Powermax Vmax All Flash SRDF Metro Overview Best Practices
120 pages
VMAX3 Local Replication Fundamentals - SRG
No ratings yet
VMAX3 Local Replication Fundamentals - SRG
41 pages
Non-Disruptive Migration VMAX All Flash - VMAX3 PDF
No ratings yet
Non-Disruptive Migration VMAX All Flash - VMAX3 PDF
61 pages
Isilonfundamentals m1 4 Final Studentguide
100% (1)
Isilonfundamentals m1 4 Final Studentguide
55 pages
Ig, PG and SG
No ratings yet
Ig, PG and SG
6 pages
Nvme™ and Nvme-Of™ in Enterprise Arrays
No ratings yet
Nvme™ and Nvme-Of™ in Enterprise Arrays
33 pages
Configure E-ports on Brocade FC Switch
No ratings yet
Configure E-ports on Brocade FC Switch
5 pages
tr-3816 Oracle and NetApp DR Total Solution
No ratings yet
tr-3816 Oracle and NetApp DR Total Solution
58 pages
ECS - ECS Upgrade Procedures-ECS 2.0.x.x To 2.1.x.x Operating System Online Update
No ratings yet
ECS - ECS Upgrade Procedures-ECS 2.0.x.x To 2.1.x.x Operating System Online Update
22 pages
SRDF Interfamily Connectivity Information
No ratings yet
SRDF Interfamily Connectivity Information
15 pages
SCSI iSCSI RAID SAN FC
No ratings yet
SCSI iSCSI RAID SAN FC
74 pages
Cisco Case Study Emc SRDF Fcip
No ratings yet
Cisco Case Study Emc SRDF Fcip
12 pages
EMC VNX Data
No ratings yet
EMC VNX Data
21 pages
Brocade SAN Troubleshooting (SAN-TS 300-ILT) Training
No ratings yet
Brocade SAN Troubleshooting (SAN-TS 300-ILT) Training
2 pages
EMC VMAX Gatekeepers
No ratings yet
EMC VMAX Gatekeepers
5 pages
(VMAX) What Is New With VMAX?: Print Email
No ratings yet
(VMAX) What Is New With VMAX?: Print Email
2 pages
Resume Rashid Adiyarambath 29 March 2019
No ratings yet
Resume Rashid Adiyarambath 29 March 2019
5 pages
Centera Foundation Student Resource Guide
No ratings yet
Centera Foundation Student Resource Guide
41 pages
SRDF Architecture PDF
No ratings yet
SRDF Architecture PDF
12 pages
Introduction To Fibre Channel Over Ethernet (Fcoe) : A Detailed Review
No ratings yet
Introduction To Fibre Channel Over Ethernet (Fcoe) : A Detailed Review
10 pages
HPE 3PAR RCFC Setup Guide
No ratings yet
HPE 3PAR RCFC Setup Guide
3 pages
SRDF Solutions
No ratings yet
SRDF Solutions
192 pages
DD Final
No ratings yet
DD Final
854 pages
BRM Support Day - March 2010-Vmware
No ratings yet
BRM Support Day - March 2010-Vmware
120 pages
Fibre Channel SAN Essentials
No ratings yet
Fibre Channel SAN Essentials
30 pages
Fibre Channel For Connectrix - SRG
No ratings yet
Fibre Channel For Connectrix - SRG
59 pages
Dell Unity - Software Upgrade Procedures-Performing Software Upgrades
No ratings yet
Dell Unity - Software Upgrade Procedures-Performing Software Upgrades
28 pages
DCX 8510 4 HardwareManual
No ratings yet
DCX 8510 4 HardwareManual
156 pages
Chapter 14
No ratings yet
Chapter 14
30 pages
An "Apples To Apples" Comparison: June, 2016
No ratings yet
An "Apples To Apples" Comparison: June, 2016
66 pages
Unisphere For VMAX Product Guide V1.5.1
67% (3)
Unisphere For VMAX Product Guide V1.5.1
534 pages
NFS File Migration To Isilon PDF
No ratings yet
NFS File Migration To Isilon PDF
46 pages
HP 3par Storeserv Storage: The Only Storage Architecture You Will Ever Need
No ratings yet
HP 3par Storeserv Storage: The Only Storage Architecture You Will Ever Need
16 pages
Cisco MDS SAN Zoning Best Practices - Cisco Community
No ratings yet
Cisco MDS SAN Zoning Best Practices - Cisco Community
11 pages
Smart Zoning for XtremIO & Cisco
No ratings yet
Smart Zoning for XtremIO & Cisco
7 pages
FC Fs 2 PDF
No ratings yet
FC Fs 2 PDF
498 pages
3PAR Remote Copy With VMware Site Recovery Manager
No ratings yet
3PAR Remote Copy With VMware Site Recovery Manager
15 pages
Brocade SAN Switches & Directors: Business Unit or Product Name
No ratings yet
Brocade SAN Switches & Directors: Business Unit or Product Name
34 pages
Brocade Switch Command Guide
No ratings yet
Brocade Switch Command Guide
5 pages
Fabric Resiliency Best Practices
No ratings yet
Fabric Resiliency Best Practices
76 pages
Fabric Best Practise
No ratings yet
Fabric Best Practise
86 pages
Fabric Resiliency Best Practices
No ratings yet
Fabric Resiliency Best Practices
24 pages
Brocade Fabric Notification Fundamentals StudentGuide FN120 1120
No ratings yet
Brocade Fabric Notification Fundamentals StudentGuide FN120 1120
29 pages
FCIA Fibre Channel Performance Congestion Slowdrain and Over Utilization Final
No ratings yet
FCIA Fibre Channel Performance Congestion Slowdrain and Over Utilization Final
30 pages
Oversubscription and Density Best Practices
No ratings yet
Oversubscription and Density Best Practices
10 pages
Student Guide FC 120 FC Basics L
No ratings yet
Student Guide FC 120 FC Basics L
66 pages
Cisco CRS-1 Fabric Troubleshooting
No ratings yet
Cisco CRS-1 Fabric Troubleshooting
16 pages
Building and Scaling BROCADE SAN Fabrics: Design and Best: Practices Guide
No ratings yet
Building and Scaling BROCADE SAN Fabrics: Design and Best: Practices Guide
31 pages
Logical Config Concepts Ds 8000
100% (1)
Logical Config Concepts Ds 8000
106 pages
Implementing The IBM Storwize V7000 and IBM Spectrum Virtualize V7.8
No ratings yet
Implementing The IBM Storwize V7000 and IBM Spectrum Virtualize V7.8
744 pages
Basic Concepts of Netapp Ontap 9
No ratings yet
Basic Concepts of Netapp Ontap 9
259 pages
Unisphere For VMAX Implementation and Management
No ratings yet
Unisphere For VMAX Implementation and Management
75 pages
Unisphere For VMAX Implementation and Management
No ratings yet
Unisphere For VMAX Implementation and Management
75 pages
Transport Layer: A Note On The Use of These PPT Slides
No ratings yet
Transport Layer: A Note On The Use of These PPT Slides
109 pages
Bibliography & Abbreviations
No ratings yet
Bibliography & Abbreviations
5 pages
Computer Networking: Network Design & Architecture
No ratings yet
Computer Networking: Network Design & Architecture
27 pages
Common Transmission Resource Management On MBSC (SRAN7.0 - 01)
No ratings yet
Common Transmission Resource Management On MBSC (SRAN7.0 - 01)
23 pages
(Notes Index) : Unit - I Fundamentals & Link Layer
No ratings yet
(Notes Index) : Unit - I Fundamentals & Link Layer
8 pages
Network Layer Design & Routing
No ratings yet
Network Layer Design & Routing
101 pages
M.SC (Computer Science), GGCA, Syllabus-New
No ratings yet
M.SC (Computer Science), GGCA, Syllabus-New
32 pages
Computer Networks Exam Solutions
No ratings yet
Computer Networks Exam Solutions
2 pages
Game Theory in Communication Networks
No ratings yet
Game Theory in Communication Networks
6 pages
Network Congestion Solutions
No ratings yet
Network Congestion Solutions
9 pages
Computer Networks and Security (18CS52) Notes
67% (9)
Computer Networks and Security (18CS52) Notes
218 pages
Lab Manual
No ratings yet
Lab Manual
42 pages
NCIIT 12 Proceedings
100% (1)
NCIIT 12 Proceedings
86 pages
100 COMPUTER NETWORKS Multiple Choice Questions and Answer
No ratings yet
100 COMPUTER NETWORKS Multiple Choice Questions and Answer
17 pages
Unit-V Ece 15
No ratings yet
Unit-V Ece 15
82 pages
CNIP
100% (2)
CNIP
39 pages
BSC Timer Configuration Guide
100% (4)
BSC Timer Configuration Guide
32 pages
Edgecore AI Server 20250110
No ratings yet
Edgecore AI Server 20250110
25 pages
GATE Computer Networks Quiz
No ratings yet
GATE Computer Networks Quiz
54 pages
CN Lab 6 Sem
No ratings yet
CN Lab 6 Sem
70 pages
Teletraffic Theory: Traffic Modeling
No ratings yet
Teletraffic Theory: Traffic Modeling
35 pages
MCA Program Regulations Guide
No ratings yet
MCA Program Regulations Guide
32 pages
6 Computer Engineer 6 Level 076-2-12 Final PDF
No ratings yet
6 Computer Engineer 6 Level 076-2-12 Final PDF
9 pages
The General Comparison Between AIMD and AIAD Congestion Control Algorithms
No ratings yet
The General Comparison Between AIMD and AIAD Congestion Control Algorithms
8 pages
IPPB Mock Test - 01
No ratings yet
IPPB Mock Test - 01
10 pages
Network Congestion & QoS Guide
No ratings yet
Network Congestion & QoS Guide
27 pages
zOVN SIGCOMM2013
No ratings yet
zOVN SIGCOMM2013
13 pages
Computer Science Projects - Final Year Projects, NCCT
No ratings yet
Computer Science Projects - Final Year Projects, NCCT
23 pages
VF Germanychapter 9: Technical Requirements "Network Performance & Optimisation Aspects"
No ratings yet
VF Germanychapter 9: Technical Requirements "Network Performance & Optimisation Aspects"
8 pages
Sagem - Link F-H Microwave Installation and Operation Manual (2008)
50% (2)
Sagem - Link F-H Microwave Installation and Operation Manual (2008)
134 pages

Brocade Resiliency

Uploaded by

Brocade Resiliency

Uploaded by

Front cover

Fabric Resiliency Best Practices

© Copyright IBM Corp. 2011, 2017. All rights reserved. ibm.com/redbooks 3

Factors affecting fabric resiliency

4 Fabric Resiliency Best Practices

Signatures of these types of failures include:

Figure 1 Device latency example

6 Fabric Resiliency Best Practices

Moderate device latencies

Severe device latencies

8 Fabric Resiliency Best Practices

Designing resiliency into the fabric

For more information, see “Preferred implementation” on page 26.

Inter-switch link trunking

10 Fabric Resiliency Best Practices

Credit recovery tools

12 Fabric Resiliency Best Practices

Enabling credit recovery tools before Fabric OS V7.2

Example 1 shows the use of the bottleneckmon --cfgcredittools command to enable

Example 1 The bottleneckmon --cfgcredittools command example

Enabling credit recovery tools in Fabric OS V7.2 and later

Example 2 The creditRecovMode command

Dynamic port naming

Example 3 The switchshow command showing port names

103 2 39 20:67:00:05:1e:d0:b5:05 SANC_DCX1.E_PORT.103

Maintaining an optimal FC SAN environment

14 Fabric Resiliency Best Practices

Enabling and disabling Bottleneck Detection

Displaying the history of bottlenecks on a port

Example 4 Results of the bottleneckmon --show command

Bottleneck alert example

Example 5 Example Bottleneck Detection alert on an F_Port

Fabric Performance Impact

16 Fabric Resiliency Best Practices

Monitoring Alerting Policy Suite

Figure 3 Network Advisor Fabric Vision MAPS menu

Custom policy for high availability fabrics

Table 1 Custom MAPS port metrics thresholds

CRC hour 240

18 Fabric Resiliency Best Practices

Table 2 Custom MAPS switch metric thresholds

Configuring Port Fencing

Figure 4 MAPS edit rule

Figure 5 on page 21 shows the activate MAPS policy actions.

20 Fabric Resiliency Best Practices

Network Advisor dashboards

Figure 6 Network Advisor Fabric Health dashboard

22 Fabric Resiliency Best Practices

Figure 8 Network Advisor Dashboard ITW widget

Figure 9 Network Advisor Dashboard ITW graph

Table 3 Suggested Product Status and Ports dashboards widgets

Fabric Health SAN Status Top Product CPU

SAN Inventory Top Product Memory

Out of Range Violations

Port Health Violations

Ports Port Health Violations Top Port C3 Discards

Top Port C3 Discards RX TO

Top Port C3 Discards TX TO

Top Port CRC

Top Port ITW

Top Port Link Failures

Top Port Link Resets

Top Port Too Long Errors

Top Port Utilization Percent

24 Fabric Resiliency Best Practices

Table 4 Optional Host, Storage, and ISL dashboard widgets

Initiator Bottleneck Ports Top Initiator Ports CRC Errors

Top Initiator Ports Encode Error

Top Initiator Port Link Failures

Top Initiator Port Sync Loss

Top Initiator Port Link Resets

Top Initiator Port ITWs

Target Bottleneck Ports Top Target Ports CRC Errors

Top Target Ports Encode Error

Top Target Port Link Failures

Top Target Port Sync Loss