Big Data Analytics Process

The big data analytics platform is the core of the security collaboration solution. Using Huawei cybersecurity intelligence system (CIS) as an example, Figure 8.7 shows the big data analytics process during security collaboration.

1. Data collection

Data collection includes both log collection and traffic collection, which are implemented by the log collector and the flow probe, respectively.

The log collection process includes log receiving, categorization, formatting, and forwarding, while the traffic collection process

Big data analytics process during security collaboration

FIGURE 8.7 Big data analytics process during security collaboration.

includes traffic collection, protocol resolution, file restoration, and traffic metadata reporting.

2. Big data processing

Big data processing includes data preprocessing, distributed storage, and distributed indexing. Data preprocessing formats the normalized logs reported by the collector and the traffic metadata reported by the flow probe, supplements related context information (including users, geographical locations, and areas) and releases this formatted data to the distributed bus. Distributed storage stores the formatted data and classifies heterogeneous data of different types, such as normalized logs, traffic metadata, and Process Characterization Analysis Package (PCAP) files for storage. These stored data are mainly used for threat detection and visualization. Distributed indexing creates indexes for key formatted data, providing keyword-based quick search services for visualized investigation and analysis.

3. Threat detection

The analyzer performs multidimensional threat analysis on the data that have been collected and processed based on big data, leading to the identification of threats.

4. Threat display

The analyzer displays the results of threat identification on the Graphical User Interface (GUI), enabling users to intuitively understand the entire network’s security situation. However, some security threats still require manual analysis and identification.

5. Threat interworking

The analyzer generates an interworking policy based on suspicious analysis results and delivers it to all NEs on the network. This policy contains precise control instructions, enabling the NEs to block any suspicious threats.

Principles of Big Data Analytics

1. Mail anomaly detection

Mail anomaly detection extracts mail traffic metadata from historical data. It analyzes Simple Mail Transfer Protocol (SMTP), Post Office Protocol Version 3 (POP3), and Interactive Mail Access Protocol (IMAP) information such as the recipient, sender, mail server, mail body, and mail attachment. It then detects mail anomalies in offline mode, such as sender and recipient anomalies, malicious mail downloads, mail server access anomalies, and mail body URL anomalies, based on sandbox file inspection results.

2. Web anomaly detection

Web anomaly detection recognizes web penetration and abnormal communication. It extracts HTTP traffic metadata from historical data and analyzes HTTP fields, including the URL, User-Agent, Refer, and message-digest algorithm 5 (MD5) values of uploaded and downloaded files, in order to detect anomalies in offline mode, such as malicious files, access to unusual websites, and non-browser traffic, based on sandbox file inspection results.

3. C&C anomaly detection

C&C anomaly detection analyzes DNS, HTTP, Layer 3 protocol, and Layer 4 protocol traffic to detect C&C communication anomalies. DNS traffic-based C&C anomaly detection adopts a machine learning method; performs training based on sample data to generate a classifier model; and uses the classifier model to identify communication anomalies that access Domain Generation Algorithm (DGA) domain names in the customer network to discover zombie hosts or abnormal APT behavior in the C&C phase. C&C anomaly detection based on Layer 3/Layer 4 protocol traffic analyzes the characteristics of information flows between C&C Trojan horses and external devices, differentiates between C&C communication information flows and normal information flows, and performs traffic detection to discover C&C communication information flows existing in the network. HTTP traffic-based C&C anomaly detection uses statistical analysis, recording each time an intranet host accesses the same destination IP address and domain name, calculating the length of time between each connection, and periodically checking for any changes that might reveal abnormal external connections from the intranet host.

4. Covert tunnel anomaly detection

Covert tunnel anomaly detection identifies the transmission of unauthorized data by compromised hosts using normal protocols and tunnels. The detection methods include Ping Tunnel, DNS Tunnel, and file anti-evasion detection. Ping Tunnel detection analyzes and compares Internet Control Message Protocol (ICMP) payloads transmitted between a pair of source and destination IP addresses within a certain time window to detect abnormal Ping Tunnel communications. DNS Tunnel detection checks the validity of domain names in DNS packets between a pair of source and destination IP addresses within a certain time window, and analyzes the DNS request and response frequency to detect abnormal DNS Tunnel communications. File antievasion detection analyzes and compares file types in traffic metadata to detect inconsistencies between file types and file name extensions.

5. Traffic baseline anomaly detection

Traffic baseline anomaly detection identifies abnormal access between intranet hosts or regions (between intranet and extranet regions, between an intranet region and the Internet, between intranet hosts, between an intranet host and the Internet, and between an intranet host and a region). The traffic baseline is a rule for access between intranet hosts, between regions, or between the intranet and external network. It specifies whether access is allowed within a given time range and, if allowed, the access frequency range and the traffic volume range.

The traffic baseline can be obtained through system autolearning or defined by users. System autolearning refers to the system automatically collecting the access and traffic information between intranet hosts, between regions, and between the intranet and external network within a time period (for example, one month) and generating a traffic baseline from the information gathered (an appropriate floating range is automatically set for the traffic data). A user-defined traffic baseline refers to a user manually configuring the access and traffic rules between intranet hosts, between regions, and between the intranet and external network. Traffic baseline anomaly detection loads the auto-learned and user-defined traffic baselines to memory, collects statistics on and analyzes traffic data in online mode, and exports anomaly events once inconsistencies have been detected between network behaviors and the traffic baselines.

6. Event correlation analysis

Event correlation analysis determines the correlation and time sequence relationship between events, in order to detect effective attacks. Event correlation analysis uses a high-performance traffic computing engine, which obtains normalized logs directly from the distributed messaging bus, stores them in memory, and analyzes the logs based on correlation analysis rules. Some correlation analysis rules are preset in the system, but users can also customize their own specific correlation analysis rules. If multiple logs match the same correlation analysis rule, the system considers these logs to be correlated, exports an anomaly event, and records the original logs in the event.

7. Advanced threat determination

Advanced threat determination correlates, evaluates, and determines anomalies in order to generate advanced threat characteristics, providing data for threat monitoring and attack chain visualization. Specifically, it identifies and classifies anomalies based on attack chain stages, and establishes the time sequence and correlation relationships of anomalies through host IP addresses, file MD5 values, and URLs based on the time an anomaly occurred. It then determines whether advanced threats exist based on the predefined behavior determination mode, provides scores and evaluation results based on the severity, impact scope, and credibility of the associated anomalies, before finally generating threat events.

< Prev   CONTENTS   Source   Next >