Hyppää sisältöön
    • Suomeksi
    • På svenska
    • In English
  • Suomi
  • Svenska
  • English
  • Kirjaudu
Hakuohjeet
JavaScript is disabled for your browser. Some features of this site may not work without it.
Näytä viite 
  •   Ammattikorkeakoulut
  • Metropolia Ammattikorkeakoulu
  • Opinnäytetyöt
  • Näytä viite
  •   Ammattikorkeakoulut
  • Metropolia Ammattikorkeakoulu
  • Opinnäytetyöt
  • Näytä viite

Designing and developing a data processing pipeline for archiving sensitive human data

Kataja, Teemu (2018)

 
Avaa tiedosto
teemu_kataja.pdf (1.545Mt)
Lataukset: 


Kataja, Teemu
Metropolia Ammattikorkeakoulu
2018
Creative Commons License
Creative Commons Attribution-NonCommercial-ShareAlike 1.0 Suomi
Näytä kaikki kuvailutiedot
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-201803093204
Tiivistelmä
This thesis was done for CSC – IT Center for Science Ltd, which is Finland’s national IT center and one of the 21 ELIXIR Europe partners. ELIXIR is an intergovernmental project of the European Union (EU) that aims to standardize practices and facilitate the creation of a modern and service-oriented bioinformatics infrastructure to enable scientists to do research more efficiently. The work done in this thesis is part of an ELIXIR project work package to create data submission tools for sensitive human data.

The aim of this thesis was to create a scalable and modular python workflow for the purpose of processing and archiving transmitted genomic datasets. The project produced a set of self-sufficient scripts that are used as modules in automated workflows that utilize a python library called luigi, which creates a convenient pipeline for a tasked dataflow. Before getting into the technical part of the thesis, the background of genomic research in Europe is introduced along with cumbersome practices that are currently in place.

The work consisted of designing and developing completely new software components that will be used in supercomputers to process and archive genomic datasets. Software design and development was conducted using CSC’s cloud computing environment and version control was done using github. Due to the nature of EU projects the github repository that contains the produced scripts is public, and available to be viewed by anyone. Direct links can be found in the references and appendices.

The scripts created in this thesis were tested at the end of the project in mock-up end-to-end testing as well as in a real dataset transmission situation. The scripts were deemed to be well-functioning, and thanks to their well-documented nature, maintenance of the data processing pipeline will be easy. Future changes are also possible due to the nature of the modular object-oriented (OOP) design. This set of scripts will replace an outdated set of shell scripts that were used to archive datasets. The old scripts were not documented and were made hard to read and required the work effort of three maintenance engineers. Thus, this thesis creates immediate real value in reassigning work force, and the scripts are also available to all ELIXIR partners in Europe.
Kokoelmat
  • Opinnäytetyöt
Ammattikorkeakoulujen opinnäytetyöt ja julkaisut
Yhteydenotto | Tietoa käyttöoikeuksista | Tietosuojailmoitus | Saavutettavuusseloste
 

Selaa kokoelmaa

NimekkeetTekijätJulkaisuajatKoulutusalatAsiasanatUusimmatKokoelmat

Henkilökunnalle

Ammattikorkeakoulujen opinnäytetyöt ja julkaisut
Yhteydenotto | Tietoa käyttöoikeuksista | Tietosuojailmoitus | Saavutettavuusseloste