# IoT Data Analytical Technologies

Data analytics (DA) is defined as a process used to observe big and small datasets with varying data properties to extract meaningful conclusions and actionable insights. These conclusions are usually in the form of trends, patterns, and statistics that help business organizations in proactively using the data to implement effective decision-making processes.

## Approaches and Methods for Data Analytics in IoT

The numbered items are approaches and the underlying methods are highlighted beneath each.

i. *Applied Statistics-.* In this approach, data is collected and analyzed through sampling to generalize metrics about a population. It uses the following different methods for data analysis:

*Sigma Analysis*: This is a very simple but powerful way to use statistics for detecting outliers in real time. When the average value of some measurement is characterized, it’s often helpful to understand the variance as well. Knowing the variances helps to look at observations and measurements in real time to determine how many standard deviations (sigmas) these observations are away from the mean.

*Statistical Hypothesis Testing*: This is a method for testing whether an observation of a designed random variable is statistically significant, or unlikely to have occurred by chance. This is a powerful way to determine if a measured value is likely to be meaningful for making a business decision. *Analysis of Variance-.* This method determines whether differences exist between means for different groups. It is similar to statistical hypothesis testing application, but is useful to compare across multiple groups for statistical significance.

*Lag Variogram*: This method determines the periodicity of a process, which is useful in characterizing processes of unknown period or duration.

ii. *Probability Theory.* This approach involves the analysis of random processes related to a population to characterize likely or expected observations. It uses the following methods for data analysis:

*Markov Chain Modeling*: This method characterizes transition states in a process where the future state depends only on the current state, and is powerful when expected transitions involve a finite number of states.

*Decision Tree Modeling-.* This structure is very popular for visualizing downstream probabilities. It uses a branching graph structure to model all possible consequences of a process with associated probabilities for each branch. It is useful for characterizing downstream probabilities at the leaves of the tree.

iii. *Unsupervised Machine Learning:* This approach includes algorithms that find hidden patterns or structures in large datasets using clustering, classification, and other statistically “heavy” methods.

*Clustering:* This method discovers patterns in data where elements in each cluster are most similar to one another than any of the elements in other clusters. *Data Mining:* This is an automated process for identifying anomalies in data or hidden rules in data based purely on statistics. Typically, there is little reliance on theory or subject matter expertise in data mining approaches. Data mining can be useful to develop hypotheses, but may be dangerous as a holistic solution.

*Random Forest Modeling:* This method is a variant of decision tree optimization wherein all possible trees are constructed to create specific classes in data. The optimal tree, which is the best predictor of classes, is the model output.

iv. *Supervised Machine Learning:* This approach leverages algorithms that optimize the decision-making and reasoning skills of human beings by programmatically capturing hidden preferences and rules.

*Classification:* This method identifies which class an element belongs to given a training set of classes based on attributes of that element and comparison to other elements in each class.

*Predictive Coding:* This method actively trains an algorithm regarding which attributes are most important about an event or data element based on a human interaction determining which elements from random subsets are the most meaningful.

*Reinforced Learning:* This is a hybrid machine learning method where a training set is identified by an unsupervised algorithm. The training set is supervised by a predictive coding process where a human reinforces or discourages learning and refinement.

v. *Natural Language Processing (NLP):* The NLP approach adds structure, computation, and quantities to traditional language to create analytic opportunity. *Term FrequencyIDocument Frequency Matrix:* This method characterizes how anomalous a document is based on the ratios of words used in that document to the ratios of words used in all documents throughout a corpus. *Sentiment Analysis:* This includes methods that determine the sentiment of written text based on the words used and the structure of the speech. *Topic Tagging:* This method includes algorithms that determine the topic of a document based on the associations of words used and comparisons to “word bags” of interest.

vi. *Network Analysis:* This approach analyzes the structure of a network graph to determine the relationships between nodes and edges.

*Network Descriptive Statistics:* This method calculates descriptive measures to characterize network position and examine the change and evolution of the position over time.

vii. *Geospatial Statistics:* This approach provides analysis of data that has geographical or spatial relevance.

*Kernel Density:* This method graphically measures the point density of multiple observations in multiple dimensions. Kernel density can be extended to linear density and other creative variants; however, it is ultimately useful for “hot spot” characterization in two and three dimensions.

*Local Outlier Factor:* This method determines how likely an observation is given the proximity to its nearest neighbors, and is a powerful method to look for outlier observations in dense data that is very regularly measured. Although nearest neighbors can be considered spatially, they can also be extended to temporal proximity, scalar proximity, etc.