Brocade Resiliency
Brocade Resiliency
Chad Collie
Michael Hrencecin
David Lutz
Ian MacQuarrie
Shawn Wright
Redpaper
Fabric Resiliency Best Practices
This IBM® Redpaper™ publication describes preferred practices for deploying and using
advanced Brocade Fabric Operating System (FOS) features to identify, monitor, and protect
Fibre Channel (FC) SANs from problematic devices and media behavior.
FOS: This paper focuses on the FOS command options and features that are available
from versions 7.2 - 7.4, but also covers other features such as bottleneck detection, port
fencing, and Fabric Watch.
This document concentrates specifically on Brocade Fabric Vision features (and related
capabilities) that help provide optimum fabric resiliency. Although some Fabric Vision features
have been available since FOS V7.0, most of the features are available since FOS V7.2.
For more information about the features that are described in this publication, see the product
documents that are appropriate for your FOS release. They are available to registered users
at:
http://my.brocade.com
SAN Design and Best Practices
Fabric OS Administrator’s Guide
Fabric OS Command Reference Manual
Fabric OS Monitoring and Alerting Policy Suites Configuration Guide
Fabric OS Flow Vision Configuration Guide
Brocade Network Advisor Administrator’s Guide
Failures on ISLs or E_Ports can have an even greater impact. Many flows (host and target
pairs) can simultaneously traverse a single E_Port. In large fabrics, this can be hundreds or
thousands of flows. If there is a media failure involving one of these links, it is possible to
disrupt some or all of the flows that use the path. Severe cases of faulty media, such as a
disconnected cable, can result in a complete failure of the media, which effectively brings a
port offline. This situation is typically easy to detect and identify. When it occurs on an F_Port,
the impact is specific to flows involving the F_Port. E_Ports are typically redundant, so severe
failures on E_Ports typically only result in a minor drop in bandwidth because the fabric
automatically uses redundant paths. Also, error reporting that is built into FOS readily
identifies the failed link and port, allowing for simple corrective action and repair. With
moderate cases of faulty media, failures occur, but the port can remain online or transition
between online and offline. This situation can cause repeated errors, which can occur
indefinitely or until the media fails. When these types of failures occur on E_Ports, the result
can be devastating because there can be repeated errors that impact many flows, which can
result in significant impacts to applications that last for prolonged durations.
Misbehaving devices
Another common class of abnormal behavior originates from high-latency end devices (host
or storage). A high-latency end device is one that does not respond as quickly as expected
and thus causes the fabric to hold frames for excessive periods. This situation can result in
application performance degradation or, in extreme cases, I/O failure. Common examples of
moderate device latency include disk arrays that are overloaded and hosts that cannot
process data as fast as requested. Misbehaving hosts, for example, become more common
as hardware ages. Bad host behavior is usually caused by defective host bus adapter (HBA)
hardware, bugs in the HBA firmware, and problems with HBA drivers. Storage ports can
produce the same symptoms due to defective interface hardware or firmware issues. Some
arrays deliberately reset their fabric ports if they are not receiving host responses within their
specified timeout periods. Severe latencies are caused by badly misbehaving devices that
stop receiving, accepting, or acknowledging frames for excessive periods. However, with the
proper knowledge and capabilities, the fabric can often identify and, in some cases, mitigate
or protect against the effects of these misbehaving components to provide better fabric
resiliency.
5
Congestion
Congestion occurs when the traffic being carried on a link exceeds its capacity. Sources of
congestion might be links, hosts, or storage responding more slowly than expected.
Congestion is typically due to either fabric latencies or insufficient link bandwidth capacity. As
FC link bandwidth has increased from one to 16 Gbps, instances of insufficient link bandwidth
capacities have radically decreased. Latencies, particularly device latencies, are the major
source of congestion in today’s fabrics due to their inability to promptly return buffer credits to
the switch.
Device-based latencies
A device experiencing latency responds more slowly than expected. The device does not
return buffer credits (through R_RDY primitives) to the transmitting switch fast enough to
support the offered load, even though the offered load is less than the maximum physical
capacity of the link that is connected to the device.
Figure 1 illustrates the condition where a buffer backup on ingress port 6 on B1 causes
congestion upstream on S1, port 3. When all available credits are exhausted, the switch port
that is connected to the device must hold additional outbound frames until a buffer credit is
returned by the device.
When a device does not respond in a timely fashion, the transmitting switch is forced to hold
frames for longer periods, resulting in high buffer occupancy, which results in the switch
lowering the rate at which it returns buffer credits to other transmitting switches. This effect
propagates through switches (and potentially multiple switches, when devices attempt to send
frames to devices that are attached to the switch with the high-latency device) and ultimately
affects the fabric.
Figure 2 on page 7 shows how latency on a switch can propagate through the fabric.
Note: The impact to the fabric (and other traffic flows) varies based on the severity of the
latency that is exhibited by the device. The longer the delay that is caused by the device in
returning credits to the switch, the more severe the problem.
The effect of moderate device latencies on host applications might still be profound, based on
the average disk service times that are expected by the application. Mission-critical
applications that expect average disk service times of, for example, 10 ms, are severely
affected by storage latencies in excess of the expected service times. Moderate device
latencies have traditionally been difficult to detect in the fabric. Advanced monitoring
capabilities that are implemented in Brocade ASICs and FOS have made these moderate
device latencies much easier to detect by providing the following information and alerts:
Switches in the fabric generate Fabric Performance Impact (FPI) Alerts if FPI is enabled
on the affected ports.
Elevated tim_txcrd_z counts on the affected F_Port, that is, the F_Port where the affected
device is connected.
Potentially elevated tim_txcrd_z counts on all E_Ports carrying the flows to and from the
affected F_Port/ device.
7
Note: tim_txcrd_z is defined as the number of times that the port was polled and that the
port was unable to transmit frames because the transmit Buffer-to-Buffer Credit (BBC) was
zero. The purpose of this statistic is to detect congestion or a device that is affected by
latency. This parameter is sampled at intervals of 2.5 microseconds, and the counter is
incremented if the condition is true. Each sample represents 2.5 microseconds of time with
zero Tx BBC. tim_txcrd_z counts are not an absolute indication of significant congestion or
latencies and are just one of the factors in determining whether real latencies or fabric
congestion are present. Some level of congestion is to be expected in a large production
fabric and is reflected in tx_crd_z counts. The Brocade FPI feature was introduced to
remove uncertainty around identifying congestion in a fabric.
Note: tim_latency_vc is a Brocade Gen 5 Condor3 ASIC counter that measures the
latency time that a frame incurs in the transmit queue of its corresponding VC. The
purpose of this statistic is to directly measure the frame transmit latency of a switch port.
Each unit of the counter value represents 250 nanoseconds of latency. The Brocade FPI
feature uses this counter to enhance the detection of devices introducing latency into the
fabric.
Because the effect of device latencies often spreads through the fabric, frames can be
dropped due to timeouts, not just on the F_Port to which the misbehaving device is
connected, but also on E_Ports carrying traffic to the F_Port. Dropped frames typically cause
I/O errors that result in a host retry, which can result in significant decreases in application
performance. The implications of this behavior are compounded and exacerbated by the fact
that frame drops on the affected F_Port (device) result not only in I/O failures to the
misbehaving device (which are expected), but also on E_Ports, which might cause I/O failures
for unrelated traffic flows involving other hosts (and typically are not expected).
Latencies on ISLs
Latencies on ISLs are usually the result of back pressure from latencies elsewhere in the
fabric. The cumulative effect of many individual device latencies can result in slowing the link.
The link itself might be producing latencies, if it is a long-distance link with distance delays or
there are too many flows that use the same ISL. Although each device might not appear to be
a problem, the presence of too many flows with some level of latency across a single ISL or
trunked ISL might become a problem. Latency on an ISL can ripple through other switches in
the fabric and affect unrelated flows.
FOS can provide alerts and information indicating possible ISL latencies in the fabric, through
one or more of the following items:
Switches in the fabric generate FPI Alerts if FPI is enabled on the affected ports.
C3 transmit discards (er_tx_c3_timeout) on the device E_Port or EX_Port carrying the
flows to and from the affected F_Port or device.
Credit loss
Buffer credits are a part of the FC flow control and the mechanism that Fibre Channel
connections use to track the number of frames that are sent to the receiving port. Every time
a frame is sent, the credit count is reduced by one. When the sending port runs out of credits,
it is not allowed to send any more frames to the receiving port. When the receiving port
successfully receives a frame, it tells the sending port that it has the frame by returning an
r_rdy primitive. When the sending port receives an r_rdy, it increments the credit count. Credit
loss occurs when either the receiving port does not recognize a frame (usually due to bit
errors), so it does not return an r_rdy, or the sending port does not recognize the r_rdy
(usually due to link synchronization issues).
FC links are never perfect, so the occasional credit loss can occur, but it becomes an issue
only when all available credits are lost. Credit loss can occur on both external and internal FC
links. When credit loss occurs on external links, it is usually caused by faulty media, and credit
lost on internal ports is usually associated with jitter, which in most cases is adjusted for by
the internal adapter firmware. The switch automatically tries to recover from a complete loss
of credit on external links after 2 seconds by issuing a link reset. For the switch to perform
automatic recovery from internal link credit loss, the Credit Loss Detection and Recovery
feature must be enabled.
General topics that are related to the architecture, topology, and capacity planning for a SAN
are described in SAN Design and Best Practices, found at:
http://my.brocade.com
9
Trunking improves system reliability by maintaining in-order delivery of data and avoiding I/O
retries if one link within the trunk group fails.
Trunking provides excellent protection from credit lost on ISLs. If credit loss occurs on an ISL,
frames continue to flow by using the other link until the switch can detect the credit loss
(typically 2 seconds) and perform a link reset to recover the credits.
More IT environments are relying on server virtualization technologies that can share host
adapter connections. Specifically, N_Port ID Virtualization (NPIV) allows many clients
(servers, guest, or hosts) to use a single physical port on the SAN. Each of these
communications paths from server, virtual or otherwise, is a data flow that must be
considered when planning for how many interswitch links are needed. These virtualized
environments often lead to a situation where there are a many data flows from the edge
switches potentially leading to frame-based congestion if there are not enough ISL or trunk
resources.
To avoid frame-based congestion in environments where there are many data flows between
switches, it is better to create several two-link trunks than one large trunk with multiple links.
For example, it is better to have two 2-link trunk groups than one 4-link trunk group.
Routing policies
The routing policy determines the route or path frames take when traversing the fabric. There
are three routing policies available:
The default exchange-based routing (EBR)
Port-based routing (PBR)
Device-based routing (DBR)
EBR is the preferred routing policy for FCP fabrics. Before 2013, cascaded FICON
configurations supported only static PBR across ISLs. In this case, the ISL (route) for a given
port was assigned statically based on a round-robin algorithm at fabric login (FLOGI) time.
PBR can result in some ISLs being overloaded. In mid-2013, IBM z Systems added support
for DBR, which spread the routes across ISLs based on a device ID hash value. With the z13
release in mid-2015, IBM added FICON Dynamic Routing (FIDR), which supports Brocade
EBR to improve load balancing for cascaded FICON across ISLs.
The prerequisite z13 driver levels, adapter features, storage, and FOS levels to support FIDR
are included in the following white paper:
https://community.brocade.com/dtscp75322/attachments/dtscp75322/MainframeSolutions
/186/1/FICON%20Dynamic%20Routing%20White%20Paper%202016-08.pdf
FICON cascaded configurations with z13 and all other appropriate prerequisites should use
EBR. All other FICON cascaded configurations should use DBR.
For more information about the FICON Dynamic Routing feature, see Get More Out of Your IT
Infrastructure with IBM z13 I/O Enhancements, REDP-5134, found at:
http://www.redbooks.ibm.com/abstracts/redp5134.html?Open
Note: For FICON it should be DBR (if z/OS and z System supports DBR) regardless if it is
a FICON/FCP intermix or not. If z System does not support DBR then it must be PBR
regardless of intermix or not.
Experience shows that when high latencies occur even on a single initiator or device in a
fabric, not only does the port that is attached to this initiator device see Class 3 frame
discards, but the resulting back pressure due to the lack of credit can build up in the fabric,
causing other flows that are not directly related to the high latency device to have their frames
discarded at ISLs.
Edge Hold Time (EHT) allows an overriding value for default HT that is applied to individual
F_Ports on Gen 5 FC platforms or all ports on an individual ASIC for 8 Gbps platforms if any
of the ports on that ASIC are operating as F_Ports.
Setting a lower EHT can be used to reduce the likelihood of this back pressure into the fabric
by assigning this lower HT value only for edge ports (initiators or targets). The lower EHT
value ensures that frames are dropped at the initiator or target port where the credit is lacking
before the higher default HT value that is used at the ISLs expires. This action can localize the
impact of a high latency port to just the single edge where the initiator or target is, preventing
the lack of credit from spreading into the fabric and impacting other unrelated flows.
Like HT, the EHT is configured for the entire switch, and is not configurable on individual ports
or ASICs. Whether the EHT or HT values are used on a port depends on the particular
platform and ASIC, and the type of port and also other ports that are on the same ASIC.
EHT is enabled by default in FOS V7.0 and later and there is no additional license that must
be configured.
Behavior
All Brocade Gen 5 platforms (16 Gbps) can set the HT value based on the type of each port
basis for ports on Gen 5 ASICs:
All F-ports are programmed with the alternate EHT.
All E_Ports are programmed with the default HT value (500 ms).
The same EHT value that is set for the switch is programmed into all F_Ports on that switch.
Different EHT values cannot be programmed on an individual port basis.
If 8 Gbps blades are installed into a Gen 5 platform (that is, an FC8-64 blade in a DCX 8510),
the same EHT value is programmed into all ports on the ASIC:
If any single port on an ASIC is an E_Port, the default HT value (500 ms) value is
programmed into the ASIC, and all ports (E_Ports and F_Ports) use this one value.
If all ports on an ASIC are F_Ports, the entire ASIC is programmed with the alternate EHT
value.
When deploying Virtual Fabrics, a unique EHT value can be independently configured for
each Logical Switch for Gen 5 Platforms running FOS V7.1 or later. When deploying Virtual
Fabrics with FOS V7.0, the EHT value that is configured into the default switch is the value
that is used for all Logical Switches. 8 Gbps blades that are installed in a Gen 5 platform
continue to use the Default Logical Switch configured value for all ports on those blades
regardless of which Logical Switches those ports are assigned to.
11
Preferred settings
Starting with FOS V7.0, the default EHT value is set to a moderate value of 220 ms. This
default EHT value is appropriate for most all environments.
The lowest EHT value of 80 ms can provide more protection from misbehaving initiators
compared to the default value, but this aggressive setting is preferable only for fabrics that are
well maintained and when a more aggressive monitoring and protection strategy is being
deployed. Additionally, this lowest value should be configured only on edge switches
composed entirely of initiators (with no device target ports) because a frame drop has more
significance for a target device than an initiator because multiple initiators typically
communicate with a single target port. Frame drops on target ports usually result in “SCSI
Transport” error messages being generated in server logs. Multiple frame drops from the
same target port can affect multiple servers in what appears to be a random fabric or storage
problem. Because the source of the error is not obvious, this situation can result in time that is
wasted for determining the source of the problem. Extra care should be taken to avoid
applying this lowest EHT, especially on switches where targets are deployed.
FC credit-based recovery applies to external switch ports and back-end ports (ports that are
connected to the core blade or core blade back-end ports) that are used for traffic within a
switch. Traffic stalls on these internal back-end ports can have a wide impact, particularly
when they impact virtual circuits of an ISL. Starting with FOS V6.4.2 and V7.0, Brocade
introduced enhanced credit recovery tools to mitigate this type of problem. These tools can be
enabled to automatically reset back-end ports when a loss of credits is detected on internal
ports.
As a preferred practice, explicitly enable the credit recovery tools for internal ports because
this function is not enabled by default.
There are two main choices for how the recovery can proceed when enabled:
An escalating recovery based on the results of a single link reset only (onLrOnly)
A threshold-based approach that uses multiple link resets (onLrThresh).
When used with the onLrOnly option, the recovery mechanism takes the following escalating
actions:
When it detects credit loss, it performs a link reset and logs a RASlog message (RAS
Cx-1014).
If the link reset fails to recover the port, the port reinitializes. A RASlog message is
generated (RAS Cx-1015). The port reinitialization does not fault the blade.
If the port fails to reinitialize, the port is faulted. A RASlog message (RAS Cx-1016) is
generated.
If a port is faulted and there are no more online back-end ports in the trunk, the core blade
is faulted. (The port blade is always faulted.) A RASlog message is generated (RAS
Cx-1017).
As a preferred practice, enable the credit tools with the onLrOnly recovery option.
Example 2 shows the use of the creditRecovMode command to enable credit recovery on
back-end ports by using the recovery option onLrOnly.
Using a more meaningful port name makes these messages and dashboards more
meaningful and makes identifying external devices that are causing fabric problems easier
and quicker to identify.
The problem is that manually setting port names to more meaningful names is
labor-intensive, and typically only done with scripts to set the port name to the alias name of
the attached device.
13
With FOS V7.4, Brocade introduced dynamic port names that dynamically set the port name
to <switch name>.<port type>.<port index>.<alias name>. Dynamic port name is enabled by
using the configure command and setting dynamic port name to on.
In FOS V8, enhancements were made to allow configuring the dynamic port name by using
any of the following fields:
Switch Name
Port Type
Port Index
F_Port Alias
FDMI Host name
Remote Switch Name
Slot / Port Number
Example 3 shows examples of both dynamically and manually set port names.
Note: Ports that have manually set port names are not updated with dynamic port names.
To remove a manually set port name, you must reset the port configuration with the
portcfgdefault command, which resets all port parameters, including the port name, to
their default values.
Preferred settings
Enable dynamic port names.
Bottleneck Detection
Bottleneck Detection provides monitoring and detection of devices that introduce latency or
congestion into the fabric. This function was originally introduced in FOS V6.3.0 and has been
significantly enhanced in subsequent FOS releases in terms of effectiveness, reporting, and
configuration.
Starting in FOS V7.3, a new feature that is called FPI was introduced, which provides an
enhanced implementation of the Bottleneck Detection function and provides integration with
MAPS.
In FOS V7.3, the original Bottleneck Detection feature is still available for use, but it cannot be
used with FPI. The original Bottleneck Detection feature and FPI cannot be enabled
concurrently. As a preferred practice, use FPI over the Bottleneck Detection feature if the
Fabric Vision license is in place to enable it.
Starting with FOS V7.4, the original Bottleneck Detection feature was removed and FPI
becomes the only option to use this important function.
If the alert parameter is not specified, alerts are not sent, but a history of bottleneck
conditions for the port can be viewed. The thresh, time, and qtime parameters are also
ignored if the alert parameter is not specified.
Use the default values for the thresh (0.1), time (300), and qtime (300) parameters.
15
Example 4 shows the bottleneck history for port 3 in 5-second windows over a period of 30
seconds.
Preferred settings
On switches running FOS V6.2 through FOS V7.2, enable bottleneck monitoring.
FPI detects different severity levels of latency and reports two latency states:
The IO_FRAME_LOSS state is a severe level of latency. In this state, frame timeouts
either have occurred or are likely to occur. Administrators should take immediate action to
prevent application interruption.
The IO_PERF_IMPACT state is a moderate level of latency. In this state, device-based
latencies can negatively impact the overall network performance.
FOS V7.4 added the IO_LATENCY_CLEAR state so that administrators are alerted when the
latency conditions clear. Administrators should act to mitigate the effect of latency devices.
The separate states enable administrators to apply different MAPS actions for different
severity levels. For example, administrators can configure SDDQ or the port toggling action
for the IO_FRAME_LOSS state and email alert action for IO_PERF_IMPACT.
The port toggle action disables a port for a short and user-configurable duration, and then
enables the port. The port toggle action can recover slow draining devices such as those
caused by a faulty host adapter. In addition, the port toggle action can induce multi-pathing
software (MPIO) to trigger traffic failing over to the alternative path to prevent severe
performance degradation. By using the SDDQ or port toggle actions, administrators can
monitor for device-based latency and automatically mitigate the problem when such
conditions are detected by FPI.
Preferred settings
On switches running FOS V7.3, enable FPI; on switches with FOS V7.4 and higher, enable
FPI and set up MAPS to quarantine the port for IO_FRAME_LOSS events by using the SDDQ
option (see “Monitoring Alerting Policy Suite setup” on page 31).
Note: To use SDDQ, quality of service (QoS) must be enabled on all switches, which is the
factory shipped default.
MAPS was introduced in FOS V7.2 and replaces Fabric Watch as the preferred monitoring
tool. In FOS V7.4, Fabric Watch is no longer available.
Note: A Fabric Watch and Advanced Performance Monitoring or Fabric Vision license is
required to use MAPS.
Enabling MAPS monitoring with Network Advisor by using one of the default profiles is quick
and easy and provides effective monitoring of key metrics on every switch in the fabric. You
can enable MAPS by going to Network Advisor and clicking Monitor → Fabric Vision →
MAPS-Enable, and select and enable a default MAPS policy and distribute the policy to all
switches that are managed by Network Advisor by clicking Monitor → Fabric Vision →
MAPS → Configure.
17
Figure 3 shows the Network Advisor Fabric Vision MAPS menu.
If the switches were running Fabric Watch when MAPS is enabled, the Fabric Watch
configuration is converted to MAPS rules and a policy for the active and default Fabric Watch
settings is created. If you had a customized Fabric Watch set of rules, you can use the Fabric
Watch MAPS policy that was created instead of the default MAPS policy.
Note: Fabric Watch thresholds can be converted to MAPS rules/policies only on FOS V7.2
or V7.3. After the switch is upgraded to FOS V7.4, the Fabric Watch settings are lost.
MAPS alerts in the default policy generate emails and SNMP alerts. It is a preferred practice
to configure SNMP alerts to be sent to an SNMP manager, or configure email alerts to have
emails that are sent to key personnel to notify them of MAPS alerts.
Smaller installations that do not have Network Advisor running FOS V7.2 or V7.3 can enable
MAPS by running mapsconfig --enablemaps. In FOS V7.4 and higher, MAPS is enabled by
default, but unless you have a license, you can use only the limited base monitoring policy.
Note: It is common after enabling MAPS to see CPU utilization alerts. These can be
normal, as described in “CPU utilization” on page 45.
Use the moderate default policy as a base and then customize the MAPS thresholds that are
shown in Table 1. Port thresholds should be customized for the Non_E_F_PORTS,
ALL_E_PORTS, ALL_OTHER_F_PORTS, ALL_HOSTS_PORTS, and
ALL_TARGET_PORTS groups.
C3TXTO min ge 3 20
CRC min ge 10 20 40
ITW min ge 20 40
LF min ge 3 5
LOSS_SIGNAL min ge 3
LOSS_SYNC min ge 3
LR min ge 5 10 20
LR hour 60
PE min ge 37 7
RX / TX / UTIL hour ge 75 90
STATE_CHG min ge 5 10
Note: Thresholds in the fence column should be created only in the ALL_HOST_PORTS
group.
Set the EPORT_DOWN and FAB_SEG thresholds to greater than or equal to 1 to create
alerts for every E_Port link that is down or switch segmentation by using the thresholds that
are shown in Table 2.
EPORT_DOWN min ge 1
FAB_SEG min ge 1
Preferred settings
Enable MAPS with the default moderate policy on all switches running FOS V7.2 or higher.
For fabrics that have high availability targets, create custom rules to provide additional
monitoring for marginal issues.
Port Fencing
The Brocade MAPS feature provides the ability to protect against faulty components and
conditions that impact links by automatically blocking ports when predefined thresholds are
reached.
Enabling MAPS rules with the Port Fencing option should be used with care so that fencing
ports happens only on ports that have severe issues. As a preferred practice, only MAPS
rules for host ports for the CRC and Link Rest thresholds be enabled for port fencing. Also,
run with the port fencing rules that are set up by the port fencing action as disabled for a
period to monitor the port fencing rules to ensure that they achieve the wanted effect.
19
3. Click the right arrow to transfer the rule to the selected policy.
Figure 4 shows the editing of a MAPS rule.
4. After the rules are modified or created, activate the MAPS policy and monitor to ensure
that the rules are operating properly. Then, enable the port fencing facility. From the MAPS
configuration, click Actions and select the Fence check box.
Preferred settings
Enable port fencing on well-managed fabrics with high availability targets, and only on host
ports link reset and CRC metrics.
The dashboard displays different widgets that contain switch and port status, port thresholds,
performance monitors. and other items. Network Advisor comes with some standard
dashboards, such as Product Status and Traffic and SAN Port Health, and you can create
additional custom dashboards.
A dashboard provides a high-level overview of the network and the current states of managed
devices. You can easily check the status of the devices in the network. The dashboard also
provides several features to help you quickly access reports, device configurations, and
system event logs.
21
The custom Fabric Health dashboard that is shown in Figure 6 has several widgets defined
that can quickly show the current state and health of the fabric. At the top, the Scope field
defines which switches and what time frame is used to populate the widgets. Widgets such as
the Out of Range widget shows the number of ports that had violations for each category for
the selected time range, and by double-clicking the category, dialog boxes open where you
can drill down to the specific details for the violations. Similarly, you can use the Events
widgets to click the event severity to display the individual event messages.
Note: For more information about setting up dashboards and configuring the dashboard
widgets, see the Brocade Network Advisor SAN User Manual for your release by searching
in the Brocade Document Library:
http://www.brocade.com/en/support/document-library.html
One of the more powerful features of the dashboards is the ability to select the time frame or
which fabric is used to populate the widgets. You can set the time frame for the last 24 hours
to see what issues occurred in the past day to monitor for marginal issues that might be
occurring, or dial down the scope to 30 minutes to focus on the current metrics when trying to
investigate a problem currently happening.
Figure 7 on page 23 shows the dashboard time and fabric scope selection window.
Another useful feature of the dashboard widgets is the ability to double-click most of the
widget metrics to see additional details or a graph of the metric. Figure 8 shows the ITW port
widget. By double-clicking the port name, a chart showing the ITW occurrences opens.
Figure 8 shows the Host port ITW widget showing 427 ITWs.
23
Figure 9 shows the ITW graph after double-clicking the port on the ITW widget.
As a preferred practice, create a customized dashboard to monitor the overall fabric health,
and a dashboard to monitor port metrics. Optionally, create specialized custom dashboards to
show port metrics for storage devices, server ports, and ISLs. These dashboards are used
during major incidents and can help identify whether there is a storage port, server port, or
ISL that is causing a problem.
Preferred settings
Create Fabric Health and Ports dashboards with the widgets that are shown in Table 3.
Events
Status
Host Ports Initiator Port Health Violations Top Initiator Ports C3 Discards
Storage Ports Target Port Health Violations Top Target Ports C3 Discards
ISL Ports ISL Port Health Violations Top ISL Ports C3 Discards
25
Summary of preferred practices
Here is a summary of the preferred features and capabilities to improve the overall resiliency
of FOS-based FC fabric environments:
Enable an appropriate routing policy.
Configure an appropriate In Order Delivery (IOD) setting.
Configure an appropriate Dynamic Load Sharing (DLS) setting.
Verify Edge Hold Time.
Enable Credit Recovery Tool.
Enable Dynamic Port name.
Enable Fabric Performance Impact or Bottleneck monitoring.
Enable MAPS monitoring and alerting.
Enable Slow Drain Device Quarantine.
Configure and use Network Advisor Dashboards.
Preferred implementation
This section describes the preferred sequence for implementing the fabric resiliency features
that are provided by the Brocade FOS along with the preferred configuration values.
Note: The preferred sequence and associated thresholds that are presented are identified
for most environments. It is possible that specific environments might require alternative
settings to meet specific requirements.
Enable EBR for open systems environments by using the Advanced Performance Tuning
Policy (aptpolicy) command.
Example 6 shows the aptpolicy command that is used to set the EBR policy.
DCX1_Default:FID128:dlutz> aptpolicy
Current Policy: 3
3 : Default Policy
1: Port Based Routing Policy
2: Device Based Routing Policy (FICON support only)
3: Exchange Based Routing Policy
Enable EBR policy for switches that support FICON only or FICON and open systems
environments.
DCX1_Default:FID128:dlutz> aptpolicy
Current Policy: 2
3 : Default Policy
1: Port Based Routing Policy
2: Device Based Routing Policy (FICON support only)
3: Exchange Based Routing Policy
Example 8 shows the iodset and iodshow commands to enable frame IOD.
DCX1_SANA:FID16:dlutz> iodshow
IOD is set
Example 9 shows the iodreset and iodshow commands to disable frame IOD.
DCX1_SANA:FID16:dlutz> iodreset
IOD is not set
DCX1_SANA:FID16:dlutz> iodshow
IOD is not set
Lossless should be enabled. Example 10 shows the dlsset command to enable lossless.
DCX1_Default:FID128:dlutz> dlsshow
DLS is set with Lossless enabled
E_Port Balance Priority is not set
27
E_Port balance priority should be enabled. Example 11 shows the dlsset commands to
enable E_Port balance priority.
DCX1_SANA:FID16:dlutz> dlsshow
DLS is set with Lossless enabled
E_Port Balance Priority is set
Note: To update the EHT setting on switches with virtual fabrics, run the configure
command from all logical switches.
Example 12 shows the configure command that is used to set the EHT.
Configure...
Example 13 shows the bottleneck commands to enable credit tools and display the credit
tools’ current setting for FOS V6.4 through FOS V7.2.
Example 13 The bottleneckmon cfgcredittools command for Fabric OS V6.4 through Fabric OS V7.2
IBM_2005_BK5:dlutz> bottleneckmon --cfgcredittools -intport -recover onLrOnly
Example 15 shows the creditrecovmode command to enable credit tools and display the
credit tools settings for FOS V7.4.
Note: To enable the dynamic port name on switches with virtual fabrics, run the configure
command from all logical switches.
29
Example 16 shows the configure commands that are used to enable the dynamic port name.
Configure...
2. Select the switches that you want to enable MAPS on by selecting them in the Available
Switches pane and click the right arrow to move them to the Selected Switches pane. After
all the switches that you want to enable MAPS on are selected, click OK to enable MAPS
on those switches.
31
Figure 11 shows the Network Advisor MAPS enable switch selection window.
33
5. In the MAPS Policy Actions dialog box, select the RAS Log Event, SNMP Trap, E-mail,
Switch Status Marginal, Switch Status Critical, and SFP Status Marginal check boxes,
and for switches with FOS V7.4 and higher, select FPI Actions and SDDQ. Click OK.
6. In the MAPS Configuration dialog box, expand the list of available policies for each of the
switches. Select the dft_conservative_policy for each switch. To select a policy for each
switch, hold the Ctrl key while selecting the policies. After policies for each switch are
selected, click Activate.
Figure 15 on page 35 shows the MAPS Configuration dialog box with
dft_conservativy_policy selected.
35
Table 5 shows the custom MAPS port metric thresholds.
Hint: To help identify which existing default rules must be removed, run the following
command:
C3TXTO min ge 3 20
CRC min ge 10 20 40
ITW min ge 20 40
LF min ge 3 5
LOSS_SIGNAL min ge 3
LOSS_SYNC min ge 3
LR min ge 5 10 20
LR hour 60
PE min ge 37 7
RX / TX / UTIL hour ge 75 90
STATE_CHG min ge 5 10
EPORT_DOWN min ge 1
FAB_SEG min ge 1
Example 19 shows the sample commands to create custom MAPS policy and rules.
Note: For more information about MAPS and MAPS commands, see the Brocade MAPS
Administration Guide, found at:
http://my.brocade.com
Note: To use the SDDQ switch, QOS must be enabled on all switches.
37
Example 21 shows the portcfgshow command on ports with the default QOS AutoEnable
setting.
SANA_DCX1:FID16:dlutz> portcfgshow
Ports of Slot 2 16 17 18 19 20 21 22 23 29 30 31
----------------------+---+---+---+---+-----+---+---+---+---+---+---
Speed AN AN AN AN AN AN AN AN AN AN AN
Fill Word(On Active) 0 0 0 0 0 0 0 0 0 0 0
Fill Word(Current) 0 0 0 0 0 0 0 0 0 0 0
AL_PA Offset 13 .. .. .. .. .. .. .. .. .. .. ..
QOS Port AE AE AE AE AE AE AE AE AE AE AE
EX Port .. .. .. .. .. .. .. .. .. .. ..
2. To enable SDDQ, update the appropriate IO_FRAME_LOSS MAPS rules to use the
SDDQ action.
3. Start the MAPS configuration dialog box by clicking Monitor → Fabric Vision →
MAPS → Configure. In the MAPS Configure dialog box, select the appropriate MAPS
policy and click Edit. Edit the IO_FRAME_LOSS rule on the FPI tab and select the SDDQ
check box.
Figure 16 shows the FPI rule IO_FRAME_LOSS with the SDDQ action enabled.
Figure 16 Network Advisor update MAPS Fabric Performance Impact IO_FRAME_LOSS rule
Note: Run several weeks with this rule enabled but with the SDDQ facility disabled to
ensure that the rule works as expected.
Creating dashboards
Complete the following steps:
1. Create Fabric Health and Ports dashboards with the widgets that are shown in Table 7.
Events
Status
39
Dashboard name Status widget Performance widget
2. Optionally, create a Host, Storage, and ISL port dashboards (which can be useful when
doing problem determination), as shown in Table 8.
Host Ports Initiator Port Health Violations Top Initiator Ports C3 Discards
Storage Ports Target Port Health Violations Top Target Ports C3 Discards
ISL Ports ISL Port Health Violations Top ISL Ports C3 Discards
41
4. Enter the dashboard name in the Name entry field in the Add Dashboard dialog box and
click OK.
Figure 19 shows the Add Dashboard dialog box.
5. Use the Customize Dashboard tool to add widgets to the empty dashboard.
Figure 20 shows the icon to start the Customize Dashboard tool.
6. To add widgets to the dashboard, select the required widgets by checking the check box
next to the widget titles.
Figure 21 on page 43 shows the Customize Dashboard Status dialog box.
43
Figure 23 shows the completed dashboard with widgets.
Access gateway
BladeCenter and chassis-style systems typically have embedded switches that are installed
in them. These switches can operate in native fabric mode or access gateway (AG) mode. AG
mode uses NPIV to connect the devices in the chassis to the network instead of native fabric
mode, which operates as a standard switch, requires its own fabric domain, and requires a
copy of the name server and configuration databases. In AG mode, the embedded switch
does not use any of these items.
Embedded switches can support trunking, which usually requires an optional license.
Trunking allows transparent failover and failback within the trunk group. Trunked links are
more efficient and can distribute I/O more evenly across all the links in the trunk group.
Run embedded switches in AG mode. For chassis with high throughput or high availability
goals, use trunking.
For more information, see Brocade Access Gateway Administrator’s Guide, found at:
http://my.brocade.com
It is common for switch CPU utilization to exceed the 80% threshold and reach 99% utilization
while performing tasks such as gathering supportsave data or doing SFP statistics polling.
This results in MAPS messages being generated.
Figure 24 shows the MAPS 1003 messages that are created when the CPU utilization
threshold is exceeded.
High CPU utilization is not a problem if the tasks that use the CPU release the CPU when
high priority requests, such as port logins or name server queries, occur. If the high CPU
utilization occurs only for short periods, it is typically not a problem. The MAPS high CPU
utilization messages do not indicate whether the duration is a short or a long duration.
A good way to identify the actual CPU utilization is to use the real-time or historical CPU
usage charts in Network Advisor.
45
Figure 25 shows a picture of the CPU utilization widget.
This example is of a switch that does not have a CPU utilization issue. The chart shows a
CPU utilization spike, which would have resulted in creating some MAPS-1003 messages, but
because the spike is a single short duration event, it is not considered to be an issue.
The most common cause of high CPU utilization is due to external management applications
making requests by using the Ethernet management port. Having more than one Network
Advisor managing the switches or performance gathering products such as IBM Spectrum™
Control directly probing the switches are examples of applications that can cause high CPU
utilization.
Only one Network Advisor application should be managing the switches, and performance
applications such as IBM Spectrum Control™ should get their data from Network Advisor (if
supported). If the performance application being used must get its data from the switch, then
only one instance of the application should be getting its data from the switch.
Frame Viewer
Frames that are discarded due to hold-time timeout are sent to the CPU for processing.
During subsequent CPU processing, information about the frame, such as SID, DID, and
transmit port number, is retrieved and logged. This information is maintained for a certain
fixed number of frames.
Frame Viewer captures only FC frames that are dropped due to a timeout that is received on
an Edge ASIC (an ASIC with front-end (FE) ports). If the frame is dropped due to any other
reason, it is not captured by Frame Viewer. If the frame is dropped due to timeout on an Rx
buffer on a Core ASIC, the frame is not captured by Frame Viewer. Timeout is defined as a
frame that lives in an Rx buffer for longer than the HT default of 500 ms or the EHT value
custom setting.
FEC on Gen 5 can correct up to 11-bit errors in every 2112-bit transmission in a 10 Gbps/16
Gbps data stream in both frames and primitives. FEC is enabled by default on the back-end
(BE) links of Condor 3 ASIC-based switches and blades and minimizes the loss of credits on
BE links. FEC is also enabled by default on FE links when connected to another FEC-capable
device. FEC on Gen 6 uses a more robust coding algorithm that corrects up to seven 10-bit
streams and detects up to fourteen 10-bit streams, without the requirement that the errors be
in a burst. FEC is mandatory on Gen 6 platforms for 32 Gbps speed to ensure that the
bit-error rate stays within the standard requirement. Condor 4 ASIC automatically turns on
FEC when a port operates at 32 Gbps speed and cannot be disabled.
Enable FEC on 10 Gbps/16 Gbps connections when both ends of the link support it.
47
Authors
This paper was produced by a team of specialists from around the world, working at the IBM
International Technical Support Organization. The content is based on Brocade
documentation and is presented in a form that specifically identifies IBM preferred practices.
Chad Collie
IBM Systems
Michael Hrencecin
IBM Systems
David Lutz
IBM GTS
Ian MacQuarrie
IBM Systems
Shawn Wright
IBM Systems
Jon Tate
IBM ITSO
Serge Monney
IBM GTS
Special thanks to Brocade for their support of this paper in terms of equipment and support in
many areas, and to the following people at Brocade:
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
49
50 Fabric Resiliency Best Practices
Notices
This information was developed for products and services offered in the US. This material might be available
from IBM in other languages. However, you may be required to own a copy of the product or product version in
that language in order to access it.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.
The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and
represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to actual people or business enterprises is entirely
coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are
provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use
of the sample programs.
The following terms are trademarks or registered trademarks of International Business Machines Corporation,
and might also be trademarks or registered trademarks in other countries.
C3® IBM Spectrum Control™ Redbooks (logo) ®
FICON® IBM z13® z13™
IBM® Redbooks®
IBM Spectrum™ Redpaper™
Other company, product, or service names may be trademarks or service marks of others.
REDP-4722-03
ISBN 073845589X
Printed in U.S.A.
®
ibm.com/redbooks