Hadoop Performance Evaluation In Cluster Environment
Belay, Fitsum (2017)
Belay, Fitsum
Lahden ammattikorkeakoulu
2017
All rights reserved
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2017112618180
https://urn.fi/URN:NBN:fi:amk-2017112618180
Tiivistelmä
With the growth of the internet a huge amount of data is being roduced every second. Companies rely on data analytics to expand their business and to stay competitive in the market. Over time the technologies of big data analytics have become more affordable for small companies.Unfortunately,small companies usually find it difficult to make the best use of the resources due to wrong assumptions about big data or because they are unable to meet the infrastructural requirements big data analysis involves.
There is a general assumption that big data is only for big businesses, which is not true. Companies usually unable to use the existing infrastructure to implement big data analytics and consequently fail to use an opportunity for growth. The purpose of this study was to encourage small companies to consider big data in their expansion strategies by showing them how big data analytics assists business, using the existing infrastructure.
One of the objectives of this thesis was to evaluate the performance of Hadoop cluster interms of input-output (I/O). This test gives a preliminary idea of how fast the cluster performs in terms of I/O and data throughput. The performance can be measured by feeding different sizes of data sets and changing the number of datanodes in the cluster. Throughout the whole process, Hadoop core components and were investigated.
According to the results, the performance of a multi node cluster in terms of average throughput is better than that of a single node Hadoop It can be concluded that even with an inexpensive infrastructure, by optimizing the existing resources, it is possible to process large volumes of data.
There are different factors that affect the performance of a cluster. These factors include the number of the files the cluster deals with and the processing power of the nodes. However, the network and hardware factors that might degrade the performance were not considered in this thesis.
There is a general assumption that big data is only for big businesses, which is not true. Companies usually unable to use the existing infrastructure to implement big data analytics and consequently fail to use an opportunity for growth. The purpose of this study was to encourage small companies to consider big data in their expansion strategies by showing them how big data analytics assists business, using the existing infrastructure.
One of the objectives of this thesis was to evaluate the performance of Hadoop cluster interms of input-output (I/O). This test gives a preliminary idea of how fast the cluster performs in terms of I/O and data throughput. The performance can be measured by feeding different sizes of data sets and changing the number of datanodes in the cluster. Throughout the whole process, Hadoop core components and were investigated.
According to the results, the performance of a multi node cluster in terms of average throughput is better than that of a single node Hadoop It can be concluded that even with an inexpensive infrastructure, by optimizing the existing resources, it is possible to process large volumes of data.
There are different factors that affect the performance of a cluster. These factors include the number of the files the cluster deals with and the processing power of the nodes. However, the network and hardware factors that might degrade the performance were not considered in this thesis.