QUALITY CONTROL IN GPCR PROTEIN FAMILY LABELING AS A DATA MINING PROBLEM : Knowledge engineering for bioinformatics

Kolobaeva, Aleksandra

QUALITY CONTROL IN GPCR PROTEIN FAMILY LABELING AS A DATA MINING PROBLEM : Knowledge engineering for bioinformatics

Kolobaeva, Aleksandra (2017)

Avaa tiedosto

AlKolobaeva_thesis_v7.pdf (1.413Mt)

Lataukset:

Kolobaeva, Aleksandra

Kaakkois-Suomen ammattikorkeakoulu

2017

Näytä kaikki kuvailutiedot

Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2017052610464

Tiivistelmä

Biology and medicine are becoming strongly data-dependent sciences where advances are, more than ever, based on data acquired by sophisticated machinery and methods. One area in which this is especially true is bioinformatics. Bioinformatics deals with –omics data, including proteomics, which was the field of the current project.

This study addressed quality control of curated protein databases, and can, therefore, be considered as a knowledge engineering problem.

G protein-coupled receptors are a large super-family of cell membrane proteins of interest to biology in general and pharmacology in particular. One of its families, class C, is of specific interest to pharmacology and drug design. This family is known to be quite heterogeneous and the discrimination of its several sub-families is a difficult problem, as it must rely on their primary amino acid sequences. We were interested not as much in investigating sub-family discrimination using a standard classification approach per se, but in exploring sequence misclassification behavior. To be more precise, we used well-known data mining classification techniques to isolate sequences that were very often misclassified and almost always, that is, consistently, to the same wrong sub-family.

I hope that this work will be a useful step towards assisting protein database curators in their quality control duties by providing them with knowledge of management tools.

Kokoelmat

Opinnäytetyöt