BIG DATA CHALLENGES
Data volumes are persisting to extend and so are the probabilities to do wfith so much raw data available. However, organizations need to know what they can do w'ith that data and how much they can resist building enlightenment for their consumers, products, and services. While big data offers a ton of advantages, it comes with its owm set of concern and challenges. This is a new set of complex technologies, w'hile still in the emergent stages of development and evolution.
Some of the commonly faced issues include incompetent knowledge regarding the technologies involved, data confidentiality, and inadequate analytical capabilities of organizations. Many enterprises face the issue of a lack of skills for dealing with big data technologies. In recent days, people aren’t trained to w'ork w'ith big data, which may become a bigger problem.
This isn’t the challenge or problem though. There are other challenges too, some are distinguished after organizations begin to move into the big data space, and some while they are cobblestone the roadmap for the same. Some of the challenges of the big data are discussed in the following sections.
Handling a Large Amount of Data
Definitely one has to be concern about size when it comes to big data. Managing large data and rapidly increasing volumes of data has been an issue for decades. This challenge was earlier directed by developing faster processor to endure with increasing volume of data. However, data volume is scaling faster than computer resources and CPU speed is passive, instead of processors doubling their clock cycle frequency every 18-24 months. Due to power restraint, clock speed has largely stalled and processors are developed with more cores.
There is a huge detonation in the data accessible. Look back a few years and compare it with today, and you will see that there has been an exponential increase in the data that enterprises can access. They have statistics for entire, right from what a consumer likes, to how they react, to a particular track, to the amazing restaurant that opened up in Italy last weekend.
This data exceeds the amount of statistics that can be accumulated and computed, as well as reclaimed. The challenge is not so much the possibility, but the management of this data. With data claiming that statistics would increase 6.6 times the distance between earth and moon by 2020, this is definitely a challenge.
Along wnth rise in unstructured or disorganized data, there has also been a rise in the number of data formats. Video, audio, social media, smart device data, etc., are just a few to name.
Some of the contemporary ways developed to manage this data are a hybrid of relational databases connected w'ith NoSQL databases. An example of this is MongoDB, which is an innate part of the MEAN stack. There are also dispersed computing systems like Hadoop to help manage big data volumes.
Netflix is a content streaming platform based on Node.js which allow's us to view different series with the payment of its charge. With the enlarged load of content and the complex formats available on the platform, they need a stack that can handle the storage and recuperation of the data. Hence, they used the MEAN stack, and with a relational database model, they can manage the data.
Security Aspects and Constraints
A lot of organizations marked allegation that they face trouble with data security. This happens to be the next bigger challenge for them than many other data-related problems. The data that comes into venture is made available from a wide range of authority, some of which cannot be trusted to be secure and docile within organizational standards.
They need to use a variety of data collection approaches to keep up with data needs. This in turn leads to inconsistencies and inaccuracy in the data, and then the outcomes of the analysis. An example such as annual turnover for the retail industry can be different if analyzed from different onset of input. A business will need to accustom the differences, and narrow it down to an answer that is valid and interesting.
This data is made available from various sources, and therefore has potential security problems. You never know' which channel of data is compromised, thus compromising the security of the data available in the organization gives a chance to hackers to move in. Henceforth, it’s necessary to introduce data security best practices for secure data collection, storage and retrieval and eliminate inconsistency of data.
Efficiently Processing Unstructured and Semi-Structured Data
Databases and warehouses are disappointing for processing of unstructured and semi-structured data. With big data, read or write operations are highly concurrent for large number of users. As the size of database enlarges, algorithm may become insufficient and invalid. The CAP (Consistency, Availability, Partition tolerance) theorem states that it is impossible for a dispersed system to have all the three; we can choose only 2 out of 3.
The fact that is difficult to serve such kind of data in a rigid form makes it difficult to process, which leads to the announcement of new processing mechanisms such as NOSQL. It is worth noting that the definition of big data is continuing changing to include new minutiae, which are becoming very important to consider. A good big data is measured by the following 5Vs:
- 1. Variety
- 2. Volume
- 3. Veracity
- 4. Value
- 5. Velocity
Data must be structured as a first step in data analysis. Example: A patient in a hospital who has one record for medical report/lab test, one for surgical operations, one for each admission at the hospital, and one for a lifetime hospital interaction with the patient. The number of surgical operations and lab tests per record would be different for every patient. The three design choices listed have consecutively less structure and conversely, consecutively greater variety. Pointing out the acquisition of big data from numerous sources with variety of structures, structuring these data is almost impossible before data analysis. Thus, it is one of the next important challenges of big data.
The main idea behind big data is to extract useful acumen by performing specific computations. However, it is important to secure and protect these computations to avoid risk or attempt to change or skew the extracted results and avoid loss of data. It is also important to protect the systems from any endeavor to spy on the nature or the number of performed computations.
Big data is hoarded in several nodes belonging to many clusters which are scattered all over the world. All communications between clusters and nodes are ensured through ordinary public and private networks. However, if someone can customize the inter-node communication it would be easy to extract valuable information. Therefore, it is a good challenge for big data tools to adopt new secure network protocols in order to protect synergy between different parties.
In a big data context, approach to the data should be managed by a strong access control system to revoke any malicious party from getting access to the storage servers. That is, the only node with sufficient administrative rights could have the probability to manage and process any content. Furthermore, any modification in clusters’ state such as addition or deletion of nodes should be examined by an authentication mechanism to protect the system from malicious nodes.
The concept of big data analytics is mainly based on parallelism, for this, the large data is gathered and processed in different clusters, which are a set of dispersed servers around the world and acting as one powerful station. The main issue w'ith this topology is it is very hard to know the exact location of storage and processing which can conclude in many security problems and regulation breaches. The main challenge with big data solutions is to be able to disperse storage and processing according to the regulations and data sensibility.