S5700

If 5700 stack member fall out caused by A large number of MAC entries delete.

Issue Description

When the issue happened, S5700 stack member cannot be accessed intermittently. Status information about this stack member cannot be obtained through commands and this fault cannot be automatically rectified. After powered off and restarted the stack member, the fault was disappeared.
The command output showed that information about a stack member cannot be obtained. The following uses the display environment command as an example.

S5700

S5700

The preceding command output shows that temperature information about all stack members except the device with SlotID4 can be obtained normally. That is, obtaining temperature information about the stack member with slot ID 4 failed.

Alarm Information

None

Handling Process

1. Check the process of obtaining stack member status information in a stack.
In an S5700SI stack, the master switch obtains status information through Remote Process Call (RPC), and stack members exchange data by sending Interprocess Communication (IPC) messages. Because temperature information about a stack member cannot be obtained, a fault occurred during RPC invoking. RPC uses IPC messages to exchange information, so the IPC message exchange process may be abnormal.

2. Analyze the IPC processing flow.
When the stack member was restarted, the software was re-initialized, and the fault was rectified. Therefore, an error occurred during software processing. Additionally, powering off and restarting the stack member can rectify the fault, indicating that the fault occurred on the stack member.
View message queue statistics on the master switch.

S5700

S5700

The preceding command output shows that messages were accumulated in VLAN, L2MA, and CXQO queues. The L2MA message queue (MAC synchronization task message queue) was full of messages, indicating that the IPC tasks of stack members were suspended and cannot process IPC messages. As a result, messages were accumulated on the master switch.

4. Analyze the reason for IPC task suspension.
Because the fault occurred on a stack member, we checked the black box of the stack member.

S5700

S5700

The preceding command output shows that an infinite loop existed. Detailed information about the infinite loop is as follows:

[s5700_ST_5ET-diagnose]display deadloop 20 slot 4

============ Task Infinite Loop Information Begin ============
Dopra Version                    = DOPRA V100R006C09CP0671
Application Version              = UnConfig
Task Infinite Loop Type          = Task overrun
Task Infinite Loop Handle        = Reset system
Task Infinite Loop CpuId         = 0
Overrun Task Name                = DELM
Overrun Task VOS ID              = 21
Overrun Task Osal ID             = 0x06299840
Task Overrun Threshold           = 30000 (ms)
Task Has-run Time                = 30000 (ms)
Task Infinite Loop Occur Time    = [2014.05.28  18:14:02]
Task Infinite Loop Occur Cputick = [0x00023868, 0x456585a5]

The task experiencing an infinite loop is DELM, which is used to delete MAC addresses. When an infinite loop occurs, the mv_l2_del_addr_by_port function occupies the semaphore of MAC entries. When other tasks, for example the IPC task, need to operate MAC entries, these tasks will be suspended because no semaphore is available. However, the infinite loop cannot be broken. Subsequently, the IPC task is always suspended, resulting in the fault.
5. Analyze the reason for an infinite loop.
After a code walk-through was performed, messages notifying the deletion of MAC addresses were accumulated in the message queue when a large number of MAC entries were triggered in a short period. Due to a software processing bug, the DELM task was always reading the message queue status when the messages were accumulated. Consequently, an infinite loop occurred on the DELM task.
The infinite loop occurred because of the deletion of MAC entries. After analyzing logs, we found that S5700s often received STP TC messages from Eth-Trunk 5. After an S5700 received TC messages, it deletes MAC entries of the related interfaces.
6. conclusion
When a device was triggered to delete a large number of MAC entries, a software bug caused other tasks unable to apply for the semaphore of MAC entries. The IPC task was then suspended when applying for the semaphore, and the master switch cannot access other stack members.
7. After implement the workaround that run the stp edged-port enable command on the related ports to reduce TC messages, the issue is disappeared
8. The patch for this software bug will be released at the end of July. 2014 to resolve this issue completely.

Root Cause

1. High CPU usage
2. Stack cable problem
3. Software bug

Suggestions

When run STP on switches, configure stp edged-port on interfaces which connect to PCs and servers to avoid MAC addresses fresh frequently.

 

Categories:

Tags:

Comments are closed