How to Troubleshoot OSPF Flapping On All Sessions on S6720-EI switch?

Abstract

For some reasons, OSPF neighbors flap in several switch devices at the same time occasionally. We note that there are 7 Switches S6720-30C-EI-24S-DC with OSPF implemented between them. If manually disconnected or reboot some interfaces can make the OSPF neighbors recovered?

Issue Description

Version: S6720 V200R010C00SPC600

For porpuse of this case we will focous in the configurations, logs and troubleshooting considering only some devices of the topologie: 016, 023 and 006-OUL

===============Device 016C0 ===============

!Software Version V200R010C00SPC600

interface LoopBack0

 description lo0

 ip address X.X.0.16 255.255.255.255

#

ospf 1 router-id X.X.0.16

 import-route direct type 1

 import-route static type 1

 import-route ospf 16 type 1

 area 0.0.0.0

  network X.X.16.176 0.0.0.3

  network X.X.16.220 0.0.0.3

  network X.X.18.200 0.0.0.3

  network X.X.100.208 0.0.0.3

#

ospf 16 router-id X.X.0.16

 import-route direct type 1

 import-route static type 1

 import-route ospf 1 type 1

 area X.X.16.0

  network X.X.16.216 0.0.0.3

#

cpu-defend policy test

 car packet-type arp-request cir 4096 cbs 800000

 car packet-type arp-reply cir 4096 cbs 800000

 car packet-type ospf cir 4096 cbs 800000

 car packet-type ospf-hello cir 4096 cbs 800000

 car packet-type ttl-expired cir 128 cbs 48128

 car packet-type icmp cir 1024 cbs 192512

 car packet-type arp-miss cir 1024 cbs 400000

#




cpu-defend-policy test global

cpu-defend-policy test

#

===============023C0 ===============

!Software Version V200R010C00SPC600

interface LoopBack0

 description lo

 ipv6 enable

 ip address X.X.0.23 255.255.255.255

#

ospf 1 router-id X.X.0.23

 import-route direct type 1

 import-route static type 1

 area 0.0.0.0

  network X.X.6.140 0.0.0.3

  network X.X.23.136 0.0.0.3

  network X.X.23.168 0.0.0.3

  network X.X.23.188 0.0.0.3

  network X.X.23.216 0.0.0.3

  network X.X.23.220 0.0.0.3

  network X.X.23.240 0.0.0.3

  network X.X.35.192 0.0.0.3

  network X.X.151.132 0.0.0.3

  network X.X.157.152 0.0.0.3

  network X.X.165.156 0.0.0.3

#

cpu-defend policy test

 car packet-type arp-request cir 4096 cbs 800000

 car packet-type arp-reply cir 4096 cbs 800000

 car packet-type ospf cir 4096 cbs 800000

 car packet-type ospf-hello cir 4096 cbs 800000

 car packet-type ttl-expired cir 128 cbs 800000

 car packet-type icmp cir 1024 cbs 96256

 car packet-type arp-miss cir 1024 cbs 400000

#

cpu-defend-policy test global

cpu-defend-policy test

#

===============006-OUL===============

!Software Version V200R010C00SPC600




interface LoopBack0

 description lo0

 ipv6 enable

 ip address X.X.0.6 255.255.255.255

#

ospf 1 router-id X.X.0.6

 import-route direct type 1

 import-route static type 1

 import-route ospf 2 type 1

 area 0.0.0.0

  network X.X.6.132 0.0.0.3

  network X.X.6.140 0.0.0.3

  network X.X.6.168 0.0.0.3

  network X.X.6.176 0.0.0.3

  network X.X.6.184 0.0.0.3

#

ospf 2 router-id X.X.0.6

 default-route-advertise type 1

 import-route direct type 1

 import-route static type 1

 import-route ospf 1 type 1

 area X.X.6.0

  network X.X.6.144 0.0.0.3

  network X.X.6.180 0.0.0.3

  network X.X.6.188 0.0.0.3

#

cpu-defend policy test

 car packet-type arp-request cir 4096 cbs 800000

 car packet-type arp-reply cir 4096 cbs 800000

 car packet-type ospf cir 4096 cbs 800000

 car packet-type ospf-hello cir 4096 cbs 800000

 car packet-type ttl-expired cir 128 cbs 48128

 car packet-type icmp cir 1024 cbs 192512

 car packet-type arp-miss cir 1024 cbs 400000

#

cpu-defend-policy test global

cpu-defend-policy test

 

Alarm Information

Through the analysis of OSPF flapping in the network, we can find some OSPF flapping on the 016 Device

 Oct 11 2017 10:43:34-03:00 016C0 - Workcenter %%01ADPIPV4/4/CPCAR_TTL1_DROP(l)[38]:The number of packets sent to the CPU exceed the threshold 20000.(SLOT=0, CPCAR TYPE=CPCAR_TTL1, DiscardPacketCount=24258, Reason="A routing loop may occur")

Oct 11 2017 10:43:34-03:00 016C0 - Workcenter %%01DEFD/4/CPCAR_DROP_MPU(l)[39]:Rate of packets to cpu exceeded the CPCAR limit on the MPU. (Protocol=ttl-expired, CIR/CBS=128/48128, ExceededPacketCount=24258)

Oct 11 2017 10:41:12-03:00 016C0 - Workcenter %%01RM/4/IPV4_DEFT_RT_CHG(l)[40]:IPV4 default Route is changed. (ChangeType=Delete, InstanceId=0, Protocol=OSPF, ExitIf=XGigabitEthernet0/0/4, Nexthop=X.19.18.201, Neighbour=0.0.0.0, Preference=150, Label=NULL, Metric=0)

Oct 11 2017 10:41:12-03:00 016C0 - Workcenter %%01OSPF/3/NBR_DOWN_REASON(l)[41]:Neighbor state leaves full or changed to Down. (ProcessId=1, NeighborRouterId=X.19.0.18, NeighborAreaId=0, NeighborInterface=XGigabitEthernet0/0/4,NeighborDownImmediate reason=Neighbor Down Due to Inactivity, NeighborDownPrimeReason=Hello Not Seen, NeighborChangeTime=2017-10-11 10:41:12-03:00)

Oct 11 2017 10:41:12-03:00 016C0 - Workcenter %%01OSPF/3/NBR_CHG_DOWN(l)[42]:Neighbor event:neighbor state changed to Down. (ProcessId=1, NeighborAddress=X.19.18.201, NeighborEvent=InactivityTimer, NeighborPreviousState=Full, NeighborCurrentState=Down)

Oct 11 201 10:41:11-03:00 016C0 - Workcenter %%01OSPF/3/NBR_DOWN_REASON(l)[43]:Neighbor state leaves full or changed to Down. (ProcessId=1, NeighborRouterId=X.19.252.16, NeighborAreaId=0, NeighborInterface=XGigabitEthernet0/0/10,NeighborDownImmediate reason=Neighbor Down Due to Inactivity, NeighborDownPrimeReason=Hello Not Seen, NeighborChangeTime=2017-10-11 10:41:11-03:00)

Oct 11 2017 10:41:11-03:00 016C0 - Workcenter %%01OSPF/3/NBR_CHG_DOWN(l)[44]:Neighbor event:neighbor state changed to Down. (ProcessId=1, NeighborAddress=X.19.16.222, NeighborEvent=InactivityTimer, NeighborPreviousState=Full, NeighborCurrentState=Down)

Oct 11 2017 10:41:11-03:00 016C0 - Workcenter %%01OSPF/3/NBR_DOWN_REASON(l)[45]:Neighbor state leaves full or changed to Down. (ProcessId=1, NeighborRouterId=X.18.0.5, NeighborAreaId=0, NeighborInterface=XGigabitEthernet0/0/3,NeighborDownImmediate reason=Neighbor Down Due to Inactivity, NeighborDownPrimeReason=Hello Not Seen, NeighborChangeTime=2017-10-11 10:41:11-03:00)

Oct 11 2017 10:41:11-03:00 016C0 - Workcenter %%01OSPF/3/NBR_CHG_DOWN(l)[46]:Neighbor event:neighbor state changed to Down. (ProcessId=1, NeighborAddress=X.19.16.178, NeighborEvent=InactivityTimer, NeighborPreviousState=Full, NeighborCurrentState=Down)

Oct 11 2017 10:41:10-03:00 016C0 - Workcenter %%01OSPF/3/NBR_DOWN_REASON(l)[47]:Neighbor state leaves full or changed to Down. (ProcessId=16, NeighborRouterId=X.19.18.254, NeighborAreaId=3323138048, NeighborInterface=XGigabitEthernet0/0/2,NeighborDownImmediate reason=Neighbor Down Due to Inactivity, NeighborDownPrimeReason=Hello Not Seen, NeighborChangeTime=2017-10-11 10:41:10-03:00)

Oct 11 2017 10:41:10-03:00 016C0 - Workcenter %%01OSPF/3/NBR_CHG_DOWN(l)[48]:Neighbor event:neighbor state changed to Down. (ProcessId=16, NeighborAddress=X.19.16.218, NeighborEvent=InactivityTimer, NeighborPreviousState=Full, NeighborCurrentState=Down)

Oct 11 2017 10:41:02-03:00 016C0 - Workcenter %%01OSPF/3/NBR_DOWN_REASON(l)[49]:Neighbor state leaves full or changed to Down. (ProcessId=1, NeighborRouterId=X.19.254.151, NeighborAreaId=0, NeighborInterface=XGigabitEthernet0/0/5,NeighborDownImmediate reason=Neighbor Down Due to Inactivity, NeighborDownPrimeReason=Hello Not Seen, NeighborChangeTime=2017-10-11 10:41:02-03:00)

Oct 11 2017 10:41:02-03:00 016C0 - Workcenter %%01OSPF/3/NBR_CHG_DOWN(l)[50]:Neighbor event:neighbor state changed to Down. (ProcessId=1, NeighborAddress=X.19.100.209, NeighborEvent=InactivityTimer, NeighborPreviousState=Full, NeighborCurrentState=Down)

Though the device log, we can find there are OSPF retransmission records and incoming port congestion at the fault time on 006-OUL

Oct 11 2017 10:40:33-03:00 006C0 - UOL Glete OSPF/4/IFRETX:OID 1.3.6.1.2.1.14.16.2.10: An OSPF packet is retransmitted on a non-virtual interface. (IfIpAddress=X.19.6.169, AddressLessIf=0, NbrIfIpAddress=X.19.6.170, NbrAddressLessIf=0, LsdbAreaId=0.0.0.0, LsdbType=5, LsdbLsid=X.229.236.64, LsdbRouterId=X.18.0.38, ProcessId=1, RouterId=X.18.0.6, IfNeighbor=X.18.0.98, PacketType=4, InstanceName=)

Oct 11 2017 10:40:42-03:00 006C0 - UOL Glete OSPF/4/IFRETX:OID 1.3.6.1.2.1.14.16.2.10: An OSPF packet is retransmitted on a non-virtual interface. (IfIpAddress=X.19.6.169, AddressLessIf=0, NbrIfIpAddress=X.19.6.170, NbrAddressLessIf=0, LsdbAreaId=0.0.0.0, LsdbType=5, LsdbLsid=X.18.0.38, LsdbRouterId=X.18.0.38, ProcessId=1, RouterId=X.18.0.6, IfNeighbor=X.18.0.98, PacketType=4, InstanceName=)

Oct 11 2017 10:40:45-03:00 006C0 - UOL Glete OSPF/4/IFRETX:OID 1.3.6.1.2.1.14.16.2.10: An OSPF packet is retransmitted on a non-virtual interface. (IfIpAddress=X.19.6.133, AddressLessIf=0, NbrIfIpAddress=X.19.6.134, NbrAddressLessIf=0, LsdbAreaId=0.0.0.0, LsdbType=5, LsdbLsid=X.19.89.192, LsdbRouterId=X.18.0.89, ProcessId=1, RouterId=X.18.0.6, IfNeighbor=X.19.0.3, PacketType=4, InstanceName=)

Oct 11 2017 10:40:50-03:00 006C0 - UOL Glete OSPF/4/IFRETX:OID 1.3.6.1.2.1.14.16.2.10: An OSPF packet is retransmitted on a non-virtual interface. (IfIpAddress=X.19.6.169, AddressLessIf=0, NbrIfIpAddress=X.19.6.170, NbrAddressLessIf=0, LsdbAreaId=0.0.0.0, LsdbType=1, LsdbLsid=X.18.0.25, LsdbRouterId=X.18.0.25, ProcessId=1, RouterId=X.18.0.6, IfNeighbor=X.18.0.98, PacketType=4, InstanceName=)

 Though logs we can see several pause packets on 006-OUL this interface connected to 023 Switch

XGigabitEthernet0/0/5 current state : UP

Line protocol current state : UP

Omited

.

.

.

Input:  156855421911 packets, 139958323690721 bytes

  Unicast:               156852854426,  Multicast:                     2077592

  Broadcast:                       12,  Jumbo:                           11638

 Discard:                     291314,  Pause:                          133472

  Frames:                           0

 Total Error:                  53457

  CRC:                          48296,  Giants:                              0

  Runts:                          215,  DropEvents:                          0

  Alignments:                       0,  Symbols:                          4946

  Ignoreds:                         0

Output:  38694671191 packets, 19234046696551 bytes

  Unicast:                38681103576,  Multicast:                     1879413

 Broadcast:                       22,  Jumbo:   

 

Handling Process

Normally, there are no input packets discarded on the ports. The device ports are configured with flow-control, and there are more PASUE frames on the port. So we configured flow-control on ports and tried to cause many pause packets in lab.

When test for some time, we can find OSPF flapping on different ports at the same time and we can also find the abnormal records in the network: OSPF retransmission and input packets discard.

OSPF retransmission records in the lab:

Oct 18 2017 11:31:00-03:00 023C0 - Barra Funda OSPF/4/IFRETX:OID 1.3.6.1.2.1.14.16.2.10: An OSPF packet is retransmitted on a non-virtual interface. (IfIpAddress=X.X.157.153, AddressLessIf=0, NbrIfIpAddress=X.X.157.200, NbrAddressLessIf=0, LsdbAreaId=0.0.0.0, LsdbType=3, LsdbLsid=X.X.58.0, LsdbRouterId=X.X.0.1, ProcessId=1, RouterId=X.X.0.23, IfNeighbor=X.X.0.2, PacketType=4, InstanceName=)

Oct 18 2017 11:31:09-03:00 023C0 - Barra Funda OSPF/4/IFRETX:OID 1.3.6.1.2.1.14.16.2.10: An OSPF packet is retransmitted on a non-virtual interface. (IfIpAddress=X.X.157.153, AddressLessIf=0, NbrIfIpAddress=X.X.157.200, NbrAddressLessIf=0, LsdbAreaId=0.0.0.0, LsdbType=3, LsdbLsid=X.X.58.0, LsdbRouterId=X.X.0.1, ProcessId=1, RouterId=X.X.0.23, IfNeighbor=X.X.0.2, PacketType=4, InstanceName=)

Oct 18 2017 11:31:18-03:00 023C0 - Barra Funda OSPF/4/IFRETX:OID 1.3.6.1.2.1.14.16.2.10: An OSPF packet is retransmitted on a non-virtual interface. (IfIpAddress=X.X.157.153, AddressLessIf=0, NbrIfIpAddress=X.X.157.200, NbrAddressLessIf=0, LsdbAreaId=0.0.0.0, LsdbType=3, LsdbLsid=X.X.58.0, LsdbRouterId=X.X.0.1, ProcessId=1, RouterId=X.X.0.23, IfNeighbor=X.X.0.2, PacketType=4, InstanceName=)

Oct 18 2017 11:31:25-03:00 023C0 - Barra Funda OSPF/4/IFRETX:OID 1.3.6.1.2.1.14.16.2.10: An OSPF packet is retransmitted on a non-virtual interface. (IfIpAddress=X.X.157.153, AddressLessIf=0, NbrIfIpAddress=X.X.157.200, NbrAddressLessIf=0, LsdbAreaId=0.0.0.0, LsdbType=3, LsdbLsid=x.x.66.0, LsdbRouterId=X.X.0.1, ProcessId=1, RouterId=X.X.0.23, IfNeighbor=X.X.0.2, PacketType=4, InstanceName=)

Input packets discard records in the lab:

%2017-Oct-17 14:04:20.000.1-00:00 023C0 - Barra Funda 01IFPDT/4/PKT_INDISCARD_ABNL(D)[1202]:-Slot=CCC0/0; Input-discard in the last interval is over threshold. (IfName=[STRING], Threshold=[ULONG], Discard=[STRING], Interval=[ULONG](s))

%2017-Oct-17 15:10:56.450.1-00:00 023C0 - Barra Funda 01IFPDT/4/PKT_INDISCARD_NL(D)[1300]:-Slot=CCC0/0; Input-discard in the last interval is under threshold. (IfName=[STRING], Threshold=[ULONG], Discard=[STRING], Interval=[ULONG](s))

%2017-Oct-18 10:44:21.280.1-00:00 023C0 - Barra Funda 01IFPDT/4/PKT_INDISCARD_ABNL(D)[1593]:-Slot=CCC0/0; Input-discard in the last interval is over threshold. (IfName=[STRING], Threshold=[ULONG], Discard=[STRING], Interval=[ULONG](s))

%2017-Oct-18 17:14:23.380.1-00:00 023C0 - Barra Funda 01IFPDT/4/PKT_INDISCARD_ABNL(D)[868]:-Slot=CCC0/0; Input-discard in the last interval is over threshold. (IfName=[STRING], Threshold=[ULONG], Discard=[STRING
Root Cause

With flow-control, when the device received many pause packets, the device will put the packets to the chip packet buffer. So that the device will send the packets slowly. But the packet chip buffer is shared with all the ports and if there are too many pause packets, the packet chip buffer will be full, and then the device cannot forward any packets including OSPF packets (Updates, Hello, Keepalives). So that the OSPF packets will be discarded and OSPF neighbor will flap. This could happen in all the devices that use Flow-Control feature. When this issue happens in one of the seven Switches, all his OSPF sessions flaps, making all network need to converge again. When manually disconnected/connected or reboot some of the interfaces of the device the buffer is emptied and started to work normally and OSPF works properly.

Solution

So flow-control needs to be removed on the S6720-EI. Without flow-control feature, the network will not have this issue.

Tags