IoT Big Data Management

Data can be defined as a systematic record of information for a particular activity or values. Different values of that information are represented together in a set. Data is a collection of facts and figures used for a specific purpose such as a survey or analysis [3]. When data is arranged in an organized form, it can be called information. Big data refers to complex and large datasets that need to be processed and analyzed to uncover valuable information that can benefit businesses and organizations. The term big data refers to:

■ A massive amount of data that grows exponentially with time.

■ Voluminous data that cannot be processed or analyzed using conventional data processing techniques.

■ Uses data mining, data storage, data analysis, data sharing, and data visualization concepts.

Based on their structure, big data can be classified into the following types:

i. Structured: Structured data refers to data that can be processed, stored, and retrieved in a fixed format. It refers to highly organized information that can be readily and seamlessly stored and accessed from a database by simple search engine algorithms. For instance, employee table in a company database will be structured as employee details, their job positions, their salaries, etc. in an organized manner.

ii. Unstructured: Unstructured data refers to data that lacks any specific form or structure. This makes it very difficult and time-consuming to process and analyze unstructured data. Email is an example of unstructured data.

iii. Semi-structured: Semi-structured data refers to data containing both the formats mentioned above, that is, structured and unstructured data. To be precise, it refers to data that, although has not been classified under a particular repository (database), contains vital information or tags that segregate individual elements within the data.

Data management refers to management of information and data for secure and structured access and storage. Data management embodies the creation of knowledge management policies, analysis, and architecture; database management system (DMS) integration; information or data security; and information supply, identification, segregation, and storage [4]. Data management encompasses various techniques that facilitate and ensure that data control and flow from creation to processing, utilization, and deletion is smooth. Data management is enforced through organized infrastructure of technological resources and governing framework that outline the executive processes used throughout the lifecycle of the data. This is a large area and is just an overarching term for an entire segment of IoT.

IoT Data Lifecycle

The lifecycle of data in an IoT system is illustrated in Figure 10.4. It starts from data production to aggregation, transfer, optional filtering, and preprocessing, and, finally, to storage and archiving. Querying and analysis are endpoints that initiate

loT data lifecycle and data management

Figure 10.4 loT data lifecycle and data management.

(request) and consume data production; however, data production can be set to be “pushed” to IoT-consuming services.

Querying-. Data-intensive systems rely on querying as the core process to access and retrieve data. In the context of IoT, a query can be issued either to request realtime data to be collected for temporal monitoring or to retrieve a certain view of the data stored within the system. The first case is typical when a (mostly localized) real-time request for data is needed. The second case represents more globalized views of data and in-depth analysis of trends and patterns.

Production: Data production involves sensing and transfer of data by the “Tilings” within the IoT framework and reporting this data to interested parties periodically (as in a subscribe/notify model), pushing it up the network to aggregation points and subsequently to database servers, or sending it as a response triggered by queries that request the data from sensors and smart objects. Data is usually time-stamped and possibly geo-stamped, can be in the form of simple key- value pairs, or may contain rich audio/image/video content, with varying degrees of complexity.

Collection: The sensors and smart objects within the IoT may store data for a certain time interval or report it to governing components. Data may be collected at concentration points or gateways within the network where it is further filtered and processed, and possibly fused into compact forms for efficient transmission. Wireless communication technologies such as Zigbee, Wi-Fi, and cellular are used by objects to send data to collection points.

Aggregation!Fusion: Transmitting raw data out of the network in real-time is often prohibitively expensive given the increasing data streaming rates and the limited bandwidth. Aggregation and fusion techniques deploy summarization and merging operations in real-time to compress the volume of data to be stored and transmitted.

Delivery: As data is filtered, aggregated, and possibly processed either at the concentration points or at the autonomous virtual units within the IoT, the results of these processes may need to be sent further up the system, either as final responses or for storage and in-depth analysis. Wired or wireless broadband communications may then be used to transfer data to permanent data stores.

Preprocessing-. IoT data possibly come from different sources with varying formats and structures. Data may need to be preprocessed to handle missing data, remove redundancies, and integrate data from different sources into a unified schema before being committed to storage. Preprocessing is a known procedure in data mining called data cleaning. Schema integration does not imply brute-force fitting of all the data into a fixed relational (tables) schema, but rather a more abstract definition of a consistent method to access data without having to customize access for each source’s data format(s). Probabilities at different levels in the schema may be added at this phase to IoT data items to handle uncertainty that may be present in data, or to deal with the lack of trust that may exist in data sources.

Storage/Update and Archiving-. This phase handles the efficient storage and organization of data, as well as the continuous update of data with new information as and when it becomes available. Archiving refers to the offline long-term storage of data that is not immediately needed for the system’s ongoing operations. At the core of centralized storage is the deployment of storage structures that adapt to the various data types and the frequency of data capture. RDBMSs are a popular choice involving the organization of data into a table schema with predefined interrelationships and metadata for efficient retrieval at later stages.

NoSQL key-value stores are gaining popularity as storage technologies for their support of big data storage without relying on relational schema or strong consistency requirements typical of relational database systems. Storage can also be decentralized for autonomous IoT systems, where data is retained at the objects that generate it and is not sent up the system. However, due to limited capabilities of such objects, storage capacity remains limited in comparison to the centralized storage model.

Processing/Analysis: Hi is phase involves the ongoing retrieval and analysis operations performed and stored and archived data to gain insights into historical data and predict future trends, or to detect data defects that may trigger further investigation or action. Task-specific preprocessing may be needed to filter and clean data before any meaningful operations can occur. When an IoT subsystem is autonomous and does not require permanent storage of its data, but rather retains the processing and storage in the network, in-network processing may be performed in response to real-time or localized queries.

Looking back at Figure 10.4, the flow of data may take one of three paths: a path for autonomous systems within the IoT that proceeds from query to production to in-network processing, and then delivery, a path that starts from production and proceeds to collection and filtering/aggregation/fusion and ends with data delivery to initiating (possibly global or near real-time) queries, and finally a path that further extends production to aggregation and includes preprocessing, permanent data storage and archival, and in-depth processing and analysis. In the next section, the need for data management solutions that surpass the current capabilities of traditional data management is highlighted in light of the previously outlined lifecycle [8,9].

< Prev   CONTENTS   Source   Next >