Abstract—we are living in the information age, we need to keep
information about every aspect of our lives. These information can be anything
which is a data. Keeping this data for further analysis and computation is a
difficult task. Several research has done in making it useful for future
computations. Processing or
analysing such huge amount of data is a challenging task All
the existing technologies contain certain performance bottlenecks and
overheads. Different challenges occur like scalability etc. Spark is the
commonly used data analysis framework. MapReduce is a computing paradigm and a
popular model for distributed data analysis. This paper gives a review about
some big data technologies, and how it will handle big data, and has studied
about some of the performance bottlenecks and preventive methods, and has
discussed about Resilient Distributed Dataset (RDD), and how it is optimized.
Analytics; Map Reduce; Resilient Distributed Dataset (RDD); Dataframes;
This new media age
has witnessed the growing of today’s enterprises in an exponential rate day by
day, along with the explosion of data and the databases used in today’s
enterprises has been growing, this has caused a bigdata problem faced by the
industries due to its inability to manage or process this data within the time
limit. Data is generated through several social networking sites, as a result
of several transactions etc. The amount of data generated can be structured as
well as unstructured 1. Processing or analysing such huge amount of data is a
challenging task. The blowing up of data has created a major challenge in the
field of science and engineering. Datasets are fast growing there are no
solutions that exist so that the bulk amount of data can be managed easily.
Existing solutions use files or some based on storing in databases still fails
in handling and analysing data properly and to make it use in future. As
datasets exceeds the capacity of the system, its analysis gets difficult and
performance also gets limited. Data analysis is done by evaluating certain
attributes and necessary data is extracted and transformed, types of data
varying from simple to complex ones are extracted and perform multiple complex
joins to these datasets. Explosion of data size makes it inefficient to store
and process data, hence causing several challenges to preserve it for future
computation. Challenges include limited scalability of I/O, scalability factor
determines performance of application.
II. Bigdata technologies
“BigData is large
data that cannot be processed using traditional computing techniques. The
volume of data facebook and youtube handles will come under the category of
bigdata. The datasize varies from terabytes to petabytes of data as the data
can be structured or unstructured. Bigdata is important as in this growing
economy data is also growing in abundance due to the abundant uses of social
networking sites, mobile and networking. More often there is a need to analyse
this data to acquire the required information within a short time.
Hadoop is a fault
tolerant, cost effective, flexible and scalable computing solution. Most of the
industries use hadoop to analyse their dataset. It is an open source software,
and allows distributed processing of large datasets across clusters of
computer. It is a simple programming model, It involves HDFS (Hadoop Distributed
File System), which is a distributed file system providing fault tolerance and
it runs on a hardware. It provide a distributed file system that store large
datasets across multiple clusters of computer. It involves a master/slave
C. MAP REDUCE
Map reduce is a
computing paradigm, it is a popular model for distributed data analysis, and it
provides simplified data processing on large clusters. It’s a popular model of
its easiness to use, a programmer who does not have any experience in distributed
computing systems can make use of this model. It provides functionality like
fault tolerance, load balancing, hiding details of parallelization etc.
Programmers use map reduce to compute different programs relating to different
data types. It provide high performance and can be computed on a large clusters
of computers, it’s a simple model and has a powerful interface for computation
ranging from large scale computations in a distributed system, provides
parallelization, problems written in MapReduce are automatically parallized. Mapreduce proposed
by Google provides scalable solution to process large amount of data. Mapreduce
executes by distributing data across multiple clusters and execute parallel.
Mapreduce proposed by Google, provides a
distributed framework for processing across large clusters. Data is divided and
given to multiple nodes and each nodes assigned with a particular task.
Mapreduce involves two parts a mapper part and a reducer part, mapper part will
split up the data, and gives it as a particular key/value pair, this
intermediate key/value pair is the input given to reducer, reducer combines
this which forms the output. Result of a MapReduce jobs are stored in a
distributed file system.
Hadoop has a master-slave model.it has an
open source implementation. There is a job tracker node also known as master
node and a Task Tracker node called worker node. Client interact with job
tracker, job tracker takes jobs from client and necessary operations according
to the client request, decomposes the jobs into certain tasks. Task tracker
contains slots each slot will contain task. Task tracker receives task from job
tracker and executes finally, and after completion it send notification back to
the task tracker using a heartbeat message. If certain tasks are failed it
executes the failed task again.
Hadoop contains another component called
Hadoop Distributed File System (HDFS). Its design is based on the Google File
System, HDFS stores large amount of data, data is distributed across number of
machines, It handles files of large size also. It supports high- streaming read
performance. Has a block structured file system, files are dived into blocks,
and distributed and stored across different hadoop clusters. It also has a
master/slave architecture. Namenode
manages and provides access to files by clients. HDFS replicates blocks to
avoid missing of particular block. Datanode sends feedback as heartbeat message
to the namenode.
III. PERFORMANCE EVALUATION
are done aiming the improvement of performance of data analytics framework,
most of them failed to understand properly the factors affecting the
performance of the system. Kay Ouster hout in his work called making sense of
performance in data analytics frameworks has developed a method for identifying
the performance bottlenecks in a distributed computing framework, and have used
it to analyse the performance of spark framework on two sql benchmarks and
production workload, and he came to conclusion that cpu is often the bottleneck
and by improving the network performance can improve the job completion time by
2 percent. In order to improve the performance hadoop and spark is widely used.
Usually identifying performance bottlenecks is a problem due to parallelism,
there will be multiple task performing. Author in 6 says about enabling
bigdata analytics in the hybrid cloud using iterative map reduce.There are
different cloud computing models like private cloud, public cloud and a
combination of private and public cloud called Hybrid cloud. Hybrid clouds can
be made used in bigdata analytics and can hence improve the iterating level of
map reduce applications. Using cloud can improve the environment like hiring
some from off- premises also. Author in 8 introduced an approach that
identifies an remove data redundancy and minimizes the data that gets
replicated, and compresses the data which are in a more level.Using local
storage instead of parallel file systms causes large amount of related data,
and these local storage has limited capacity to hold large amount of data and
also these local storages are prone to failures also, and if multiple requests
arrive it will be difficult to handle in local storage, so to make high
availability replication is usually done. Author has proposed redundancy
elimination and replication as a co-optimized phase. In many of the high
performance computing platforms 9 input and output will be the main
performance bottlenecks. Author has proposed a collective input output strategy
called layout aware collective input output (LACIO), which optimizes the input
output performance and provides integration. Input output can cause performance
degradation causing low latency, LACIO improves the performance of the current
systems, it does it by a certain file system calls, and hence improves the
input output performance of parallel systems. Bigdata analytics plays an
important role in many fields like science, medicine healthcare etc. As the
datasizes has increased a lot there needed an efficient method for its
computation and its analysis. In memory platforms always face challenges like
data shuffling and it will diversely affect the performance and scalability of
the system. Commonly used in memory
platform is apache spark. There will be massive amount of data transfer that
occur and some cases these platforms need to operate with scarce memory also.
So to make shuffling easier efficient scheduling of the data transfer is
required. There are many solutions exists that gives sub optimal performance
and resource usage. Author in 11 gives a memory optimized data shuffling,
this shuffle strategy dynamically adapts to the computation with minimum memory
utilization. Author in 7 talks about optimizing shuffle performance in spark.
Spark is a commonly used framework for performing in-memory computation.
Shuffle performance in spark should be optimized so that it performs netter
than hadoop. There are different bottlenecks available which affect the
performance, the existing solutions have identified some bottlenecks. Author
has identified bottlenecks and has proposed some alternatives to mitigate the
operating system overheads associated with these bottlenecks. It is shuffle
file consolidation, the new simple solution has led to two times improvement in
overall job completion time. In mapreduce there are map and reduce phases and
shuffling occurs in between them. Shuffling is a part of reduce phase and it
has a little role to deal with data. Shuffling can cause operating system
overheads, and it causes input and output over the network. Several researches
are done to avoid the operating system overhead, since shuffle phase is not
connected to the semantics of the data, additional storage can significantly
reduce the overhead due to the data transferred. Reducer will combine the
results with the use of combiners. Author in 10 talks about enhancing
features on spark to improve the performance in shuffle phase. RDMA based
apache spark on high performance network. Apache spark is the most commonly
used method for inmemory processing of realtime data, in apache spark data can
be loaded in memory and can query it repeatedly. It gives an abstraction of
RDD, as it supports data lineage. Shuffle phase caused overhead and many
research works are going on to mitigate it. Author 10 has proposed a data
transfer strategy in the shuffle phase, which consumes only least amount of
memory, it has given a design of RDMA based Apache spark and also has proposed
block transfer service plugin which supports three shuffle scheme sort, hash
and tungsten sort in spark. It is expected that in future can remove all the
bottlenecks in spark and introduce methods to improve performance.
A. RESILIENT DISTRIBUTED DATASETS
Distributed dataset (RDD) is a memory abstraction, in memory computations can
be performed using RDD, iterative algorithm and interactive data ming tools
cannot handle the computing frameworks efficiently, RDD came motivated from it.
Performance can be improved by keeping the data in memory, RDD provides fault
tolerance. RDD provides coarse grained transformations rather the fine grained
one. RDD s are implemented in spark which is a commonly use computing platform.
Map reduce has been used widely for data analytics where uses can write
programs in a distributed way in a fault tolerant manner. RDDs allows data
reuse. It allow results to remain the intermediate memory. Defining a
programming interface is a challenge in RDD. RDDs apply same operation to many
data item as it is based on coarse grained transformation. In RDD lost data can
be acquired quickly, as it stores data in its lineage. RDDs can be created by
using data in a stable storage and using other RDDs. Spark expose RDD via language interpreted
API. Spark computes RDD lazily. Persist method is use to make use of RDD for future
operations. Spark keeps RDD in memory and it also distribute to disks in case
of shorter memory. A comparison between RDD and distributed shared memory is
given in 3.
Distributed shared memory
Program roll back
Data stored in lineage
Based on application, non-consistent
Requires check points, overhead
Fine grained fault recovery
Fig 1. Comparison
RDD and Distributed shared memory
In case of distributed computing datasets is
split into smaller nodes and divided among to achieve speedup and to improve
efficiency. Author in 12 proposed an novel coded data delivery scheme in case
of no excess storage, this new coded scheme exploits new coding opportunity
called leftover combining to reduce communication overhead.
IV. WORKING PRINCIPLE
Consider RDD as a
machine language, and dataset and dataframe as a wrapper on top of RDD, Dataset
and dataframe comes with RDD optimization, based on the distribution of the
system. RDD will computed in case of two, four, and eight distributed sytems.
But computer based optimization we will get from dataset and dataframe.
Creating an optimized RDD from this RDD. Dataset and dataframe is based on dynamic
language principle and can take python, scala, java and R.
Fig. 2. Primary memory lookup operation.
Bigdata hadoop system is a distributed file based
system. Data shuffling is a high I/O operation, it will have a input, map,
reduce and an output, in between there is a shuffle phase. It is a store and
forward pattern. While doing store and forward pattern there can be a lot of
input output operation, to avoid this shuffling data has to go into a primary
storage. In bigdata high input output happen as it has to go into a muliple
file storage system and it causes a high store and forward mechanism.To do a
I/O based lookup need to remove going to a primary storage. In a task to add
six numbers 1+2+3+4+5+6, consider all these numbers are in some files read one
value save it read another value save separetely as a two process, adding
happens in between a shuffle phase, and in some other file storage. Every
action or operation taken need a input output bound operation, all these
operations takes place in the memory, it will be saved in systems primary
memory. Do all these operations without looking secondary memory.
What we do is a primary
memory lookup opearation, it’s a DAG operation. A DAG structure will be
computed for every operation. In spark this operation is done using RDD , it
will have different computers for doing different operations and all thsese
will be attached with primary memory, figure shows RDD distribution. RDD is
immutable, RDD contains data in its lineage, hierarchy of data, will be saved
in the environment. In RDD what ever data is there it will remain we do not
edit it each and every time. There is no chance for the data to get changed. If
x=abc, there there is no chance for x to become x= xyz. Each and everytime no
need to go and look whether it is changed. It will be stored in memory,
therefore dats shuffling is not required.Next is about optimizing RDD. Now
spark 6.0 comes with RDD optimization.
Fig. RDD distribution
A. OPTIMIZING RDD
the distributed store we try to query distributed data set using RDD and spark
SQL. No matter which language or which API we use, or whether we write in java,
scala or we wuse sql, dataset and dataframe, firstthing happens is constructing
a logical plan, that will tell the structure of computation. Take this data
read it from this, do a join, do a filter etc. we do dataframe operations on
static data, it is easy to understand as a batch.
In the physical planning spark automatically
runs queries in streaming fashion, it will be done continuously. Logical plan
we will understand the structure of computation and in physical plan we will
understand how it is going to perform after optimization. We use the catalystic
optimizer and take the accurate plan and turn it into an incremental query
plan. It is incremented by spark.
There are four stages of plan,
a) Parsed Logical Plan: It is the logical plan.
b) Analyzed logical plan: It is the logical plan combined
with catalog information, which is combined with name of datasets and types.
c) Optimized Logical Plan:It is applied
to number of rows to simplify this query, we will do filter operation and then
d) Physical Plan: Regression model is used for
analyzing cost, cost model is similar to regression cost model.
In RDD we get a lot of optimization plan, it can be
different DAG operations. There will be cost for all. These DAG sets are
created based on the datasets. We will apply the data and find which one will
have least cost and that DAG will be selected
and RDD will be created. Here RDD creation will be an optimization. Data is in
table and it has attributes. If these attributes are unresolved there will be
some rules to do that called catalystic rules. Developers will optimize this
RDD based on these attributes, and then optimization will takes place using
Catalystic Optimizer. Dataframe is data organized in tables or named columns.
It is designed in such a way that to make data processing easier. Dataframes will allow developers to make
changes in the data. Datadframe uses the caatlystic tree transformation
framework. It will analyse the logical plan,
and optimize it. For analysis it uses catalytic rules, and using that
rules it resolve attributes that are not resolved. It will generate many
physical plans, physical operators help for that, and it select a plan among
them using the cost model.
are done aiming the improvement of performance of data analytics framework. Mapreduce proposed
by Google, provides a distributed framework for processing across large
clusters. Data is divided and given to multiple nodes and each nodes assigned
with a particular task. Hadoop is a fault tolerant, cost effective, flexible and scalable
computing solution. It involves HDFS (Hadoop Distributed File System), which is
a distributed file system providing fault tolerance and it runs on a hardware.
It provide a distributed file system that store large datasets across multiple
clusters of computer.
Apache spark is the
most commonly used method for in memory processing of real-time data, in apache
spark data can be loaded in memory and can query it repeatedly. It gives an
abstraction of RDD, as it supports data lineage. Resilient Distributed dataset
(RDD) is a memory abstraction, in memory computations can be performed using RDD,
iterative algorithm and interactive data mining tools cannot handle the
computing frameworks efficiently, RDD came motivated from it. RDDs allows data
reuse. It allow results to remain the intermediate memory. Defining a
programming interface is a challenge in RDD. Spark expose RDD via language
interpreted API. Spark computes RDD lazily.
Data shuffling is a high I/O operation. What we do is a
primary memory lookup opearation, it’s a DAG operation. A DAG structure will be
computed for every operation. In spark this operation is done using RDD. Spark
gives an optimized RDD. Data is in table and it has attributes. If these
attributes are unresolved there will be some rules to do that called catalystic
rules. Developers will optimize this RDD based on these attributes, and then
optimization will takes place using Catalystic Optimizer. Hence the RDD will be
an optimized RDD.
Aditya B. Patel, Manashvi Birla, Ushma Nair, “Addressing Big Data
Problem Using Hadoop and Map Reduce,” 2012Nirma University International
Conference On Engineering, NUiCONE-2012, December 2012.
J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd
ed., vol. 2. Oxford: Clarendon, 1892, pp.68-73.
I.S. Jacobs and C.P. Bean, “Fine particles, thin films and
exchange anisotropy,” in Magnetism, vol. III, G.T. Rado and H. Suhl, Eds. New
York: Academic, 1963, pp. 271-350.
K. Elissa, “Title of paper if known,” unpublished.
R. Nicole, “Title of paper with only first word capitalized,” J.
Name Stand. Abbrev., in press.
Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, “Electron
spectroscopy studies on magneto-optical media and plastic substrate interface,”
IEEE Transl. J. Magn. Japan, vol. 2, pp. 740-741, August 1987 Digests 9th
Annual Conf. Magnetics Japan, p. 301, 1982.
M. Young, The Technical Writer’s Handbook. Mill Valley, CA:
University Science, 1989.