Big data storing in AWS using HADOOP

Palit, Rajib

Big data storing in AWS using HADOOP

Palit, Rajib (2024)

Avaa tiedosto

Palit_Rajib.pdf (1.453Mt)

Lataukset:

Palit, Rajib

2024

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2024121034256

Tiivistelmä

Big data refers to intricate and substantial datasets that comprise semi-structured, structured, and unstructured information. Big data was difficult to managed and store in a conventional storage system. Along with high availability distributed object-oriented platform and mapreduce, large data had been stored on amazon web service, a cloud-based storage system. Big data management and storage for this project had been handled via hadoop and mapreduce on aws. The setup of an emr cluster was demonstrated for storing large datasets and doing the required processing. S3 buckets and ec2 key pairs had also been employed extensively during the project. Since aws was a cloud-based storage solution, there were no additional complications arising from the size of the dataset. Hadoop was being used for the emr cluster creation, which would allowed massive amounts of data have been processed quickly. As part of the project, a large dataset had been uploaded to the s3 bucket for additional processing after the emr cluster creation was finished. Other folders for output, code script, and error log storage had also been created in the s3 bucket. A code script that shows some data retrieval from the large dataset had been written used the python pyspark module. The output and log file were saved in the s3 bucket after the job was completed; the log file's successful message signifies that the job was completed successfully; the emr cluster was terminated once all processing was finished. Hadoop, mapreduce, and aws simplified the processing of copious amounts of data.

Kokoelmat

Opinnäytetyöt (Avoin kokoelma)