Big data

Growth of and digitization of global information-storage capacity^[1]

Big data is data sets that are so voluminous and complex that traditional data-processing application software are inadequate to deal with them. Big data challenges includecapturing data, data storage, data analysis, search, sharing, transfer, visualization,querying, updating, information privacy and data source. There are a number of concepts associated with big data: originally there were 3 concepts volume, variety, velocity.^[2] Other concepts later attributed with big data areveracity (i.e., how much noise is in the data) ^[3]and value.^[4]

Lately, the term "big data" tends to refer to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. "There is little doubt that the quantities of data now available are indeed large, but that’s not the most relevant characteristic of this new data ecosystem."^[5] Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on."^[6] Scientists, business executives, practitioners of medicine, advertising andgovernments alike regularly meet difficulties with large data-sets in areas including Internet search, fintech, urban informatics, andbusiness informatics. Scientists encounter limitations in e-Science work, includingmeteorology, genomics,^[7] connectomics, complex physics simulations, biology and environmental research.^[8]

Data sets grow rapidly - in part because they are increasingly gathered by cheap and numerous information-sensing Internet of things devices such as mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification(RFID) readers and wireless sensor networks.^[9]^[10] The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s;^[11] as of 2012, every day 2.5 exabytes(2.5×10¹⁸) of data are generated.^[12] Based on an IDC report prediction, the global data volume will grow exponentially from 4.4 zettabytes to 44 zettabytes between 2013 and 2020.^[13] By 2025, IDC predicts there will be 163 zettabytes of data.^[14] One question for large enterprises is determining who should own big-data initiatives that affect the entire organization.^[15]

Relational database management systemsand desktop statistics^{[clarification needed]} and software packages to visualize data often have difficulty handling big data. The work may require "massively parallel software running on tens, hundreds, or even thousands of servers".^[16] What counts as "big data" varies depending on the capabilities of the users and their tools, and expanding capabilities make big data a moving target. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."^[17]

DefinitionEdit

Visualization created by IBM of daily Wikipedia edits . At multiple terabytesin size, the text and images of Wikipedia are an example of big data.

The term has been in use since the 1990s, with some giving credit to John Mashey for coining or at least making it popular.^[18]^[19]^[20]Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.^[21] Big data philosophy encompasses unstructured, semi-structured and structured data, however the main focus is on unstructured data.^[22] Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many exabytes of data.^[23] Big data requires a set of techniques and technologies with new forms of integration to reveal insights from datasets that are diverse, complex, and of a massive scale.^[24]

A 2016 definition states that "Big data represents the information assets characterized by such a high volume, velocity and variety to require specific technology and analytical methods for its transformation into value".^[25] Additionally, a new V, veracity, is added by some organizations to describe it,^[26] revisionism challenged by some industry authorities.^[27] The three Vs (volume, variety and velocity) have been further expanded to other complementary characteristics of big data:^[28]^[29]

Machine learning: big data often doesn't ask why and simply detects patterns^[30]
Digital footprint: big data is often a cost-free byproduct of digital interaction^[29]^[31]

A 2018 definition states "Big data is where parallel computing tools are needed to handle data", and notes, "This represents a distinct and clearly defined change in the computer science used, via parallel programming theories, and losses of some of the guarantees and capabilities made by Codd’s relational model." ^[32]

The growing maturity of the concept more starkly delineates the difference between "big data" and "Business Intelligence":^[33]

Business Intelligence uses descriptive statistics with data with high information density to measure things, detect trends, etc.
Big data uses inductive statistics and concepts from nonlinear system identification^[34] to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density^[35] to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors.^[34]^[36]

CharacteristicsEdit

Big data can be described by the following characteristics:^[28]^[29]

Volume: The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not.

Variety: The type and nature of the data. This helps people who analyze it to effectively use the resulting insight. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion.

Velocity: In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time.

Veracity: The data quality of captured data can vary greatly, affecting the accurate analysis.^[37]

Factory work and Cyber-physical systemsmay have a 6C system:

Connection (sensor and networks)
Cloud (computing and data on demand)^[38]^[39]
Cyber (model and memory)
Content/context (meaning and correlation)
Community (sharing and collaboration)
Customization (personalization and value)

Data must be processed with advanced tools (analytics and algorithms) to reveal meaningful information. For example, to manage a factory one must consider both visible and invisible issues with various components. Information generation algorithms must detect and address invisible issues such as machine degradation, component wear, etc. on the factory floor.^[40]^[41]

ArchitectureEdit

Big data repositories have existed in many forms, often built by corporations with a special need. Commercial vendors historically offered parallel database management systems for big data beginning in the 1990s. For many years, WinterCorp published a largest database report.^[42]

Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system. Teradata systems were the first to store and analyze 1 terabyte of data in 1992. Hard disk drives were 2.5 GB in 1991 so the definition of big data continuously evolves according to Kryder's Law. Teradata installed the first petabyte class RDBMS based system in 2007. As of 2017, there are a few dozen petabyte class Teradata relational databases installed, the largest of which exceeds 50 PB. Systems up until 2008 were 100% structured relational data. Since then, Teradata has added unstructured data types including XML, JSON, and Avro.

In 2000, Seisint Inc. (now LexisNexis Group) developed a C++-based distributed file-sharing framework for data storage and query. The system stores and distributes structured, semi-structured, and unstructured data across multiple servers. Users can build queries in a C++ dialect called ECL. ECL uses an "apply schema on read" method to infer the structure of stored data when it is queried, instead of when it is stored. In 2004, LexisNexis acquired Seisint Inc.^[43] and in 2008 acquired ChoicePoint, Inc.^[44] and their high-speed parallel processing platform. The two platforms were merged into HPCC (or High-Performance Computing Cluster) Systems and in 2011, HPCC was open-sourced under the Apache v2.0 License.Quantcast File System was available about the same time.^[45]

CERN and other physics experiments have collected big data sets for many decades, usually analyzed via high performance computing (supercomputers) rather than the commodity map-reduce architectures usually meant by the current "big data" movement.

In 2004, Google published a paper on a process called MapReduce that uses a similar architecture. The MapReduce concept provides a parallel processing model, and an associated implementation was released to process huge amounts of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The framework was very successful,^[46] so others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce framework was adopted by an Apache open-source project named Hadoop.^[47] Apache Spark was developed in 2012 in response to limitations in the MapReduce paradigm, as it adds the ability to set up many operations (not just map followed by reduce).

MIKE2.0 is an open approach to information management that acknowledges the need for revisions due to big data implications identified in an article titled "Big Data Solution Offering".^[48] The methodology addresses handling big data in terms of usefulpermutations of data sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual records.^[49]

2012 studies showed that a multiple-layer architecture is one option to address the issues that big data presents. A distributed parallel architecture distributes data across multiple servers; these parallel execution environments can dramatically improve data processing speeds. This type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of framework looks to make the processing power transparent to the end user by using a front-end application server.^[50]

Big data analytics for manufacturing applications is marketed as a 5C architecture (connection, conversion, cyber, cognition, and configuration).^[51]

The data lake allows an organization to shift its focus from centralized control to a shared model to respond to the changing dynamics of information management. This enables quick segregation of data into the data lake, thereby reducing the overhead time.^[52]^[53]

TechnologiesEdit

A 2011 McKinsey Global Institute report characterizes the main components and ecosystem of big data as follows:^[54]

Techniques for analyzing data, such as A/B testing, machine learning and natural language processing
Big data technologies, like business intelligence, cloud computing and databases
Visualization, such as charts, graphs and other displays of the data

Multidimensional big data can also be represented as tensors, which can be more efficiently handled by tensor-based computation,^[55] such as multilinear subspace learning.^[56] Additional technologies being applied to big data include massively parallel-processing (MPP) databases, search-based applications, data mining,^[57] distributed file systems, distributed databases, cloud andHPC-based infrastructure (applications, storage and computing resources)^[58] and the Internet.^{[citation needed]} Although, many approaches and technologies have been developed, it still remains difficult to carry out machine learning with big data.^[59]

Some MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in the RDBMS.^[60]

DARPA's Topological Data Analysis program seeks the fundamental structure of massive data sets and in 2008 the technology went public with the launch of a company calledAyasdi.^[61]

The practitioners of big data analytics processes are generally hostile to slower shared storage,^[62] preferring direct-attached storage (DAS) in its various forms from solid state drive (SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of shared storage architectures—Storage area network (SAN) and Network-attached storage (NAS) —is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost.

Real or near-real time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in memory is good—data on spinning disk at the other end of a FC SAN connection is not. The cost of a SAN at the scale needed for analytics applications is very much higher than other storage techniques.

There are advantages as well as disadvantages to shared storage in big data analytics, but big data analytics practitioners as of 2011 did not favour it.^[63]

Big Data virtualizationEdit

Big Data virtualization is a way of gathering data from a few sources in a single layer. The gathered data layer is virtual. Unlike other methods, most of the data remains in place and is taken on demand directly from the source systems.^[64]

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

Search This Blog

Communication system

Big Data

Big data

DefinitionEdit

CharacteristicsEdit

ArchitectureEdit

TechnologiesEdit

Big Data virtualizationEdit

Comments

Post a Comment

Popular posts from this blog

CA

Telecommunication

Dig to anlg converter