0% found this document useful (0 votes)
179 views9 pages

5 Minute Troubleshooting

The document provides guidance on initial troubleshooting steps that can be taken on Brocade equipment before opening a support case. It recommends checking the health status of switches using commands like switchstatusshow and switchshow, verifying port statuses and SFP details using commands like sfpshow, and reviewing link error counters with porterrshow to help identify potential cable, SFP, or firmware issues. Taking these initial troubleshooting steps can often resolve issues quickly without involving support.

Uploaded by

sam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
179 views9 pages

5 Minute Troubleshooting

The document provides guidance on initial troubleshooting steps that can be taken on Brocade equipment before opening a support case. It recommends checking the health status of switches using commands like switchstatusshow and switchshow, verifying port statuses and SFP details using commands like sfpshow, and reviewing link error counters with porterrshow to help identify potential cable, SFP, or firmware issues. Taking these initial troubleshooting steps can often resolve issues quickly without involving support.

Uploaded by

sam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

5-minute initial troubleshooting on Brocade

equipment
created by ELonden on Jun 18, 2013 11:22 PM, last modified by ELonden on Sep 4, 2013 4:15
PM
Version 2
Very often the HDS support organisation (GSC) is getting involved in cases whereby a massive
amount of host logs, array dumps, FC and IP traces are taken which could easily add up to many
gigabytes of data. This is then accompanied by a very synoptic problem description such as "I
have a problem with my host, can you check?".
I'm sure the intention is good to provide us all the data but the problem is the lack of the details
around the problem. We do require a detailed explanation of what the problem is, when did it
occur or is it still ongoing?

There are also things you can do yourself before opening a case with HDS. In many occasions
you'll find that the feedback you get from us in 10 minutes results in either the problem being
fixed or a simple workaround has made your problem creating less of an impact. Further
troubleshooting can then be done in a somewhat less stressful time frame.

This example provides some bullet point what you can do on a Brocade platform. (Mainly since
many of the problems I see are related to fabric issues and my job is primarily focused on storage
networking.)

First of all take a look at the over health of the switch:

Command
Explanation
switchstatussho Provides an overview of the general components of
w
the switch. These all need to show up HEALTHY and
not (as shown here) as "Marginal"

Sydney_ILAB_DCX-4S_LS12
Switch Health Report
Switch Name: Sydney_ILAB
IP address: 10.129.2.143
SwitchState: MARGINAL
Duration: 214:29

Command

Explanation

Power supplies monitor MAR


Temperatures monitor
HEA
Fans monitor
HEALTH
WWN servers monitor
HE
CP monitor
HEALTH
Blades monitor
HEALT
Core Blades monitor HEALT
Flash monitor
HEALTH
Marginal ports monitor HEA
Faulty ports monitor
HEALT
Missing SFPs monitor
HEA
Error ports monitor
HEALT

switchshow

All ports are healthy


Sydney_ILAB_DCX-4S_LS12
switchName: Sydney_ILAB_
Provides a general overview of logical switch status
(no physical components) plus a list of ports and their switchType: 77.3
switchState: Online
status.
switchMode: Native
switchRole: Principal
switchDomain: 143
switchId: fffc8f
The switchState should alway be online.
switchWwn: 10:00:00:05:1e:
zoning:
ON (Brocade)
The switchDomain should have a unique ID in the
switchBeacon: OFF
fabric.
FC Router: OFF
If zoning is configured it should be in the "ON" state. Fabric Name: FID 128
Allow XISL Use: OFF
LS Attributes: [FID: 128, Bas
Mode 0]
As for the ports connected these should all be
"Online" for connected and operational ports. If you
see ports showing "No_Sync" whereby the port is not
disabled there is likely a cable or SFP/HBA problem. Index Slot Port Address Media
=======================
====
0 1 0 8f0000 id 4G
10:00:00:05:1e:36:02:bc "BR4
If you have configured FabricWatch to enable
master)

Command

Explanation
1 1 1 8f0100 id
portfencing you'll see indications like here with port 75 50:06:0e:80:06:cf:28:59
2 1 2 8f0200 id
50:06:0e:80:06:cf:28:79
3 1 3 8f0300 id
50:06:0e:80:06:cf:28:39
Obviously for any port to work it should be enabled.
4 1 4 8f0400 id
5 1 5 8f0500 id
50:06:0e:80:14:39:3c:15
6 1 6 8f0600 id
7 1 7 8f0700 id
8 1 8 8f0800 id
50:06:0e:80:13:27:36:30

N8
N8
N8
4G
N2
4G
4G
N8

75 2 11 8f4b00 id N8
State Change threshold excee
76 2 12 8f4c00 id N4
sfpshow
<slot>/<port>

One of the most important pieces of a link irrespective


of mode and distance is the SFP. On newer hardware
and software it provides a lot of info on the overall
health of the link.

With older FOS codes there could have been a


discrepancy of what was displayed in this output as to
what actually was plugged in the port. The reason was
that the SFP's get polled so every now and then for
status and update information. If a port was persistent
disabled it didn't update at all so in theory you plug in
another SFP but sfpshow would still display the old
info. With FOS 7.0.1 and up this has been corrected
and you can also see the latest polling time per SFP
now.

The question we often get is: "What should these


values be?". The answer is "It depends". As you can
imagine a shortwave 4G SFP required less amps then
a longwave 100KM SFP so in essence the SFP specs
should be consulted. As a ROT you can say that
signal quality depends ont he TX power value minus

Sydney_ILAB_DCX-4S_LS12
Identifier: 3 SFP
Connector: 7 LC
Transceiver: 540c4040000000
Encoding: 1 8B10B
Baud Rate: 85 (units 100 m
Length 9u: 0 (units km)
Length 9u: 0 (units 100 me
Length 50u (OM2): 5 (units
Length 50u (OM3): 0 (units
Length 62.5u:2 (units 10 me
Length Cu: 0 (units 1 mete
Vendor Name: BROCADE
Vendor OUI: 00:05:1e
Vendor PN: 57-1000012-01
Vendor Rev: A
Wavelength: 850 (units nm)
Options: 003a Loss_of_Sig
BR Max:
0
BR Min:
0
Serial No: UAF110480000NY
Date Code: 101125
DD Type: 0x68
Enh Options: 0xfa
Status/Ctrl: 0x80
Alarm flags[0,1] = 0x5, 0x0
Warn Flags[0,1] = 0x5, 0x0

Command

Explanation

Ala

low
Temperature: 25
Centigrad
Current: 6.322 mAmps
Voltage: 3290.2 mVolts
RX Power: -3.2 dBm (476.
Also check the Current and Voltage of the SFP. If an 1000.0 uW
SFP is broken the indication is often it draws no power TX Power: -3.3 dBm (472.
562.3 uW
at all and you'll see these two dropping to zero.
the link-loss budget. The result should be within the
RX Power specifications of the receiving SFP.

State transitions: 1
Last poll time: 06-20-2013 ES
porterrshow
For link state counters this is the most useful
command in the switch however there is a perception
that this command provides a "silver" bullet to solve
port and link issues but that is not the case. Basically
it provides a snapshot of the content of the LESB (Link
Error Status Block) of a port at that particular point in
time. It does not tell us when these counters have
accumulated and over which time frame. So in order
to create a sensible picture of the statuses of the ports
we need a baseline. This baseline can be created to
reset all counters and start from zero. To do this issue
the "statsclear" command on the cli.

There are 7 columns you should pay attention to from


a physical perspective.

enc_in - Encoding errors inside frames. These are


errors that happen on the FC1 with encoding 8 to 10
bits and back or, with 10G and 16G FC from 64 bits to
66 and back. Since these happen on the bits that are
part of a data frame these are counted in this column.

crc_err - An enc_in error might lead to a CRC error


however this column shows frames that have been

Sydney_ILAB_DCX-4S_LS128:FID128:a
frames
enc
crc
link
loss
loss
frjt
fbsy
tx
rx
in
err
fail
sync
sig
0: 100.1m 53.4m
0
0
0
0
0
0
0
1: 466.6k 154.5k
0
0
0
0
0
0
0
2: 476.9k 973.7k
0
0
0
0
0
0
0
3: 474.2k 155.0k
0
0
0
0
0
0
0

Command

Explanation
market as invalid frames because of this crc-error
earlier in the datapath. According to FC specifications
it is up to the implementation of the programmer if he
wants to discard the frame right away or mark it as
invalid and send it to the destination anyway. There
are pro's and con's on both scenarios. So basically if
you see crc_err in this column it means the port has
received a frame with an incorrect crc but this
occurred further upstream.

crc_g_eof - This column is the same as crc_err


however the incoming frames are NOT marked as
invalid. If you see these most often the enc_in counter
increases as well but not necessarily. If the enc_in
and/or enc_out column increases as well there is a
physical link issue which could be resolved by
cleaning connectors, replacing a cable or (in rare
cases) replacing the SFP and/or HBA. If the enc_in
and enc_out columns do NOT increase there is an
issue between the SERDES chip and the SFP which
causes the CRC to mismatch the frame. This is a
firmware issue which could be resolved by upgrading
to the latest FOS code. There are a couple of defects
listed to track these.

enc_out - Similar to enc_in this is the same encoding


error however this error was outside normal frame
boundaries i.e. no host IO frame was impacted. This
may seem harmless however be aware that a lot of
primitive signals and sequences travel in between
normal data frame which are paramount for fibrechannel operations. Especially primitives which
regulate credit flow. (R_RDY and VC_RDY) and signal
clock synchronization are important. If this column
increases on any port you'll likely run into performance
problems sooner or later or you will see a problem
with link stability and sync-errors (see below).

Command

Explanation

Link_Fail - This means a port has received a NOS


(Not Operational) primitive from the remote side and it
needs to change the port operational state to LF1
(Link Fail 1) after which the recovery sequence needs
to commence. (See the FC-FS standards specification
for that)

Loss_Sync - Loss of synchronization. The transmitter


and receiver side of the link maintain a clock
synchronization based on primitive signals which start
with a certain bit pattern (K28.5). If the receiver is not
able to sync its baud-rate to the rate where it can
distinguish between these primitives it will lose sync
and hence it cannot determine when a data frame
starts.

Loss_Sig - Loss of Signal. This column shows a drop


of light i.e. no light (or insufficient RX power) is
observed for over 100ms after which the port will go
into a non-active state. This counter increases often
when the link-loss budget is overdrawn. If, for
instance, a TX side sends out light with -4db and the
receiver lower sensitivity threshold is -12 db. If the
quality of the cable deteriorates the signal to a value
lower than that threshold, you will see the port bounce
very often and this counter increases. Another culprit
is often unclean connectors, patch-panels and badly
made fibre splices. These ports should be shut down
immediately and the cabling plant be checked.
Replacing cables and/or bypassing patch-panels is
often a quick way to find out where the problem is.

Command

Explanation
The other columns are more related to protocol issues
and/or performance problems which could be the
result of a physical problem but not be a cause. In
short look at these 7 columns mentioned above and
check if no port increases a value.

=========================================
===
too_short/too_long - indicates a protocol error where
SOF or EOF are observed too soon or too late. These
two columns rarely increase.

bad_eof - Bad End-of-Frame. This column indicates


an issue where the sender has observed and
abnormality in a frame or it's transceiver whilst the
frameheader and portions of the payload where
already send to its destination. The only way for a
transceiver to notify the destination is to invalidate the
frame. It truncates the frame and add an EOFni or
EOFa to the end. This signals the destination that the
frame is corrupt and should be discarded.

F_Rjt and F_Bsy are often seen in Ficon


environments where control frames could not be
processes in time or are rejected based on fabric
configuration or fabric status.

c3timout (tx/rx) - These are counters which indicate


that a port is not able to forward a frame in time to it's
destination. These either show a problem downstream
of this port (tx) or a problem on this port where it has
received a frame meant to be forwarded to another
port inside the sames switch. (rx). Frames are
ALWAYS discarded at the RX side (since that's where
the buffers hold the frame). The tx column is an

Command

Explanation
aggregate of all rx ports that needs to send frames via
this port according to the routing tables created by
FSPF.

pcs_err - Physical Coding Sublayer - These values


represent encoding errors on 16G platforms and
above. Since 16G speeds have changed to 64/66 bits
encoding/decoding there is a separate control
structure that takes car of this.

As a best practise is it wise to keep a trace of these


port errors and create a new baseline every week.
This allows you to quickly identify errors and solve
these before they can become an problem with an
elongated resolution time. Make sure you do this
fabric-wide to maintain consistency across all switches
in that fabric.

Make sure that all of these physical issues are solved first. No software can compensate for
hardware problems and the HDS support organization will give you this task anyway before
commencing on the issue.

As for which information to collect please refer to https://tuf.hds.com where you will find pages
for all GSC supported products and a method on how to collect these.

Regards,
Erwin

You might also like