Designing and developing a data processing pipeline for archiving sensitive human data
Kataja, Teemu (2018)
Kataja, Teemu
Metropolia Ammattikorkeakoulu
2018

Creative Commons Attribution-NonCommercial-ShareAlike 1.0 Suomi
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-201803093204
https://urn.fi/URN:NBN:fi:amk-201803093204
Tiivistelmä
This thesis was done for CSC – IT Center for Science Ltd, which is Finland’s national IT center and one of the 21 ELIXIR Europe partners. ELIXIR is an intergovernmental project of the European Union (EU) that aims to standardize practices and facilitate the creation of a modern and service-oriented bioinformatics infrastructure to enable scientists to do research more efficiently. The work done in this thesis is part of an ELIXIR project work package to create data submission tools for sensitive human data.
The aim of this thesis was to create a scalable and modular python workflow for the purpose of processing and archiving transmitted genomic datasets. The project produced a set of self-sufficient scripts that are used as modules in automated workflows that utilize a python library called luigi, which creates a convenient pipeline for a tasked dataflow. Before getting into the technical part of the thesis, the background of genomic research in Europe is introduced along with cumbersome practices that are currently in place.
The work consisted of designing and developing completely new software components that will be used in supercomputers to process and archive genomic datasets. Software design and development was conducted using CSC’s cloud computing environment and version control was done using github. Due to the nature of EU projects the github repository that contains the produced scripts is public, and available to be viewed by anyone. Direct links can be found in the references and appendices.
The scripts created in this thesis were tested at the end of the project in mock-up end-to-end testing as well as in a real dataset transmission situation. The scripts were deemed to be well-functioning, and thanks to their well-documented nature, maintenance of the data processing pipeline will be easy. Future changes are also possible due to the nature of the modular object-oriented (OOP) design. This set of scripts will replace an outdated set of shell scripts that were used to archive datasets. The old scripts were not documented and were made hard to read and required the work effort of three maintenance engineers. Thus, this thesis creates immediate real value in reassigning work force, and the scripts are also available to all ELIXIR partners in Europe.
The aim of this thesis was to create a scalable and modular python workflow for the purpose of processing and archiving transmitted genomic datasets. The project produced a set of self-sufficient scripts that are used as modules in automated workflows that utilize a python library called luigi, which creates a convenient pipeline for a tasked dataflow. Before getting into the technical part of the thesis, the background of genomic research in Europe is introduced along with cumbersome practices that are currently in place.
The work consisted of designing and developing completely new software components that will be used in supercomputers to process and archive genomic datasets. Software design and development was conducted using CSC’s cloud computing environment and version control was done using github. Due to the nature of EU projects the github repository that contains the produced scripts is public, and available to be viewed by anyone. Direct links can be found in the references and appendices.
The scripts created in this thesis were tested at the end of the project in mock-up end-to-end testing as well as in a real dataset transmission situation. The scripts were deemed to be well-functioning, and thanks to their well-documented nature, maintenance of the data processing pipeline will be easy. Future changes are also possible due to the nature of the modular object-oriented (OOP) design. This set of scripts will replace an outdated set of shell scripts that were used to archive datasets. The old scripts were not documented and were made hard to read and required the work effort of three maintenance engineers. Thus, this thesis creates immediate real value in reassigning work force, and the scripts are also available to all ELIXIR partners in Europe.