Hyppää sisältöön
    • Suomeksi
    • På svenska
    • In English
  • Suomi
  • Svenska
  • English
  • Kirjaudu
Hakuohjeet
JavaScript is disabled for your browser. Some features of this site may not work without it.
Näytä viite 
  •   Ammattikorkeakoulut
  • Centria-ammattikorkeakoulu
  • Opinnäytetyöt (Avoin kokoelma)
  • Näytä viite
  •   Ammattikorkeakoulut
  • Centria-ammattikorkeakoulu
  • Opinnäytetyöt (Avoin kokoelma)
  • Näytä viite

Big data storing in AWS using HADOOP

Palit, Rajib (2024)

 
Avaa tiedosto
Palit_Rajib.pdf (1.453Mt)
Lataukset: 


Palit, Rajib
2024
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2024121034256
Tiivistelmä
Big data refers to intricate and substantial datasets that comprise semi-structured, structured, and unstructured information. Big data was difficult to managed and store in a conventional storage system. Along with high availability distributed object-oriented platform and mapreduce, large data had been stored on amazon web service, a cloud-based storage system. Big data management and storage for this project had been handled via hadoop and mapreduce on aws. The setup of an emr cluster was demonstrated for storing large datasets and doing the required processing. S3 buckets and ec2 key pairs had also been employed extensively during the project. Since aws was a cloud-based storage solution, there were no additional complications arising from the size of the dataset. Hadoop was being used for the emr cluster creation, which would allowed massive amounts of data have been processed quickly. As part of the project, a large dataset had been uploaded to the s3 bucket for additional processing after the emr cluster creation was finished. Other folders for output, code script, and error log storage had also been created in the s3 bucket. A code script that shows some data retrieval from the large dataset had been written used the python pyspark module. The output and log file were saved in the s3 bucket after the job was completed; the log file's successful message signifies that the job was completed successfully; the emr cluster was terminated once all processing was finished. Hadoop, mapreduce, and aws simplified the processing of copious amounts of data.
Kokoelmat
  • Opinnäytetyöt (Avoin kokoelma)
Ammattikorkeakoulujen opinnäytetyöt ja julkaisut
Yhteydenotto | Tietoa käyttöoikeuksista | Tietosuojailmoitus | Saavutettavuusseloste
 

Selaa kokoelmaa

NimekkeetTekijätJulkaisuajatKoulutusalatAsiasanatUusimmatKokoelmat

Henkilökunnalle

Ammattikorkeakoulujen opinnäytetyöt ja julkaisut
Yhteydenotto | Tietoa käyttöoikeuksista | Tietosuojailmoitus | Saavutettavuusseloste