Why does the high usage of CPU on NE40E cause abnormal traffic to the NMS?

Abstract

In the monitoring tool (3rd Party NMS), there are notably peak traffic for an interface of the NE40, however, these peaks do not show the actual traffic, therefore this phenomenon is abnormal.

 

Handling Process
  1. Check the logs to validate if packet drops are observed, the CP CAR is throwing a lot of packets. :
Line 48119: Oct 22, 2019 10:18:43-03:00 NE40E %%01DEFEND/6/CPCARALARMEDLOG(l)[3590236]:Slot=1;The CP CAR dropped packets is detected to slide into a warning state(TypeID=183, ProtocolName=183, Threshold=30000, Interval=600, Dropped-Packets=179314).

Line 48132: Oct 22, 2019 10:24:38-03:00 NE40E %%01DEFEND/6/CPCARALARMEDLOG(l)[3590249]:Slot=1;The CP CAR dropped packets is detected to slide into a warning state(TypeID=1700, ProtocolName=1700, Threshold=153600, Interval=300, Dropped-Packets=230735).

Line 48140: Oc

t 22 2019 10:28:38-03:00 NE40E %%01DEFEND/6/CPCARALARMEDLOG(l)[3590257]:Slot=1;The CP CAR dropped packets is detected to slide into a warning state(TypeID=15, ProtocolName=ipv4Arp, Threshold=30000, Interval=600, Dropped-Packets=275775)
  1. Review cpu-defend statistics, a large amount of ARP traffic is observed again

  1. Review the equipment version, the version is V600R009C20SPC600, according to the release notes for that version, there is a known problem in the ARP processing logic, it is suggested that the client install the V600R009SPH071 patch to validate whether the ARP processing improves. However, after the patch is applied, the behavior remains the same.
  2. It is suspected that the collection and sending intervals of MIBs is not working properly, according to observed behavior, it is suspected that the process is running slowly. The re-collection/sending interval of the MIBs is 1 minute, however this time is being exceeded, the collection lasts more than 1 minute and therefore more traffic is collected. For example, the collection interval is 1 minute, but the process takes 2 minutes, so the traffic information is collected at 2 intervals for a single interval. This causes the data for the first interval to indicate high traffic and the next interval to appear empty. It is suspected that the cause of this behavior is high CPU usage.
  3. Review the CPU usage in slot 1.

It is noted that the FECD process uses a high CPU because many packets are sent to the CPU of the main board. The FECD process is responsible for packet transit between LPF and MPU.

  1. It is found that the interface has an infinite limit arp rate configured, so all ARP packets are sent to the CPU causing the SNMP process to slow down for MIB collection/sending. The command “arp rate-limit 0” means that there is no defined limit.

  1. It is recommended to set a limit to release CPU usage. It was tested with several values and it was found that with a rate limit of 300, the CPU usage by the FECD process is reduced by half, after this the graphics in the monitoring tool are consistently displayed.

There is a very high number of ARP request being sent to CPU of the NE40E, as a result the SNMP 

process becomes slow, causing abnormal report of traffic statistics to the NMS. 

Solution

 Please set a limit for the number of ARP packets sent to the CPU. It is necessary to test with different values until the CPU usage is reduced without affecting the end user service. In this case, a rate limit of 300 packets per second was set.

To set the limit, you only need to run the following command on the interface:

arp rate-limit 300

The default arp rate limit is 20 pps. If you uncheck this value, only the following command is required:

undo rate-limit

END

 

Category
Tags