Fault Identification and Proactive Prediction

In traditional network O&M, problems can only be solved after they occur. If faults can be predicted, potential problems can be detected in advance. This enables the implementation of necessary measures, such as network hardening or rectification, to mitigate the impact on services.

Fault baseline and exception detection

FIGURE 10.9 Fault baseline and exception detection.

Based on big data analysis, intelligent O&M fault identification and proactive prediction can be used to predict certain network faults and provide warnings. For example, users may experience network access failures that are not caused by network faults. As illustrated in Figure 10.9, the SDN controller generates a baseline based on historical big data training. Within this baseline, failures and exceptions are considered to be terminal behaviors. The system automatically identifies exceptions only when they are beyond the baseline range, then it identifies patterns and root causes, as well as faults promptly, and notifies O&M personnel, enabling fault handling before users are aware.

Fault Locating and Root Cause Analysis

Network O&M personnel are responsible for maintaining networks. When a fault occurs, they need to quickly identify the fault cause, rectify the fault, and minimize the impact on services. Traditional methods for locating a fault are difficult due to high reliance on manual analysis of massive data and personal experience. In intelligent O&M, the SDN controller can use protocol tracing to graphically display the packet exchange process when a fault occurs, enabling O&M personnel to quickly locate the fault.

For example, if a user encounters network access difficulties or failures, the protocol tracing function of the SDN controller can visualize the entire process in the three phases (association, authentication, and DHCP) of user access. Subsequently, by analyzing the result and duration of each protocol interaction phase, the SDN controller can quickly determine where user access errors exist, and implement precise fault location, as illustrated in Figure 10.10.

Network O&M Mode Transformed by Intelligent O&M

With the powerful SDN controller and mobile apps, imagine how convenient it will be for network administrators to manage networks in the near future.

Protocol tracing

FIGURE 10.10 Protocol tracing.

Suppose you are a network administrator. On Wednesday, you open the controller dashboard to check for normal indicators and faults. You can perform these tasks on your mobile app if you are, for example, in a meeting.

At 10:00 a.m., the platform pushes a new device version, which you check for changes and discover that they can solve a previous issue. As such, you set a scheduled upgrade and plan to perform the upgrade at 10:00 p.m. on Friday, as there will be few network users at night and during weekend.

At 3:00 p.m., the platform indicates that a large number of users failed to access the network, and you receive an SMS notification simultaneously. You click the link to check the details page and discover that the DFICP address could not be obtained. You then log in to the DFICP server and realize that the same problem previously occurred. The new server version was not released but a solution was provided. You restore the server by following instructions in the solution. After the operation is complete, you receive a call from a colleague, explain the situation, and notify them that the fault has been rectified and the network will recover soon. Then you send a group message through the internal communication platform to notify colleagues of the situation.

At 6:00 p.m. on Friday, you leave work on time. At 10:00 p.m., your mobile app sends a message to notify you that the upgrade has started. At

10:30 p.m., the app sends a message to notify you of upgrade completion and sends network O&M reports both before and after the upgrade. The reports indicate that the APs are working properly.

On Sunday morning, the construction team starts to reconstruct the equipment room as scheduled. At 11:00 a.m., you receive a major alarm indicating that 50 APs went offline because the switch interfaces connected to the APs are abnormal. You contact the construction personnel to check the equipment room status and realize that the power supply of the switch was accidentally turned off during construction. One minute after the power supply is restored, the network recovers.

 
Source
< Prev   CONTENTS   Source