QUALITY CONTROL IN GPCR PROTEIN FAMILY LABELING AS A DATA MINING PROBLEM : Knowledge engineering for bioinformatics
Kolobaeva, Aleksandra (2017)
Kolobaeva, Aleksandra
Kaakkois-Suomen ammattikorkeakoulu
2017
All rights reserved
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2017052610464
https://urn.fi/URN:NBN:fi:amk-2017052610464
Tiivistelmä
Biology and medicine are becoming strongly data-dependent sciences where advances are, more than ever, based on data acquired by sophisticated machinery and methods. One area in which this is especially true is bioinformatics. Bioinformatics deals with –omics data, including proteomics, which was the field of the current project.
This study addressed quality control of curated protein databases, and can, therefore, be considered as a knowledge engineering problem.
G protein-coupled receptors are a large super-family of cell membrane proteins of interest to biology in general and pharmacology in particular. One of its families, class C, is of specific interest to pharmacology and drug design. This family is known to be quite heterogeneous and the discrimination of its several sub-families is a difficult problem, as it must rely on their primary amino acid sequences. We were interested not as much in investigating sub-family discrimination using a standard classification approach per se, but in exploring sequence misclassification behavior. To be more precise, we used well-known data mining classification techniques to isolate sequences that were very often misclassified and almost always, that is, consistently, to the same wrong sub-family.
I hope that this work will be a useful step towards assisting protein database curators in their quality control duties by providing them with knowledge of management tools.
This study addressed quality control of curated protein databases, and can, therefore, be considered as a knowledge engineering problem.
G protein-coupled receptors are a large super-family of cell membrane proteins of interest to biology in general and pharmacology in particular. One of its families, class C, is of specific interest to pharmacology and drug design. This family is known to be quite heterogeneous and the discrimination of its several sub-families is a difficult problem, as it must rely on their primary amino acid sequences. We were interested not as much in investigating sub-family discrimination using a standard classification approach per se, but in exploring sequence misclassification behavior. To be more precise, we used well-known data mining classification techniques to isolate sequences that were very often misclassified and almost always, that is, consistently, to the same wrong sub-family.
I hope that this work will be a useful step towards assisting protein database curators in their quality control duties by providing them with knowledge of management tools.
