Data mining on protein sequences: n-gram analysis of ordered and disordered protein regions

eLibrary

 
 

Data mining on protein sequences: n-gram analysis of ordered and disordered protein regions

Show simple item record

dc.contributor.advisor Mitić, Nenad
dc.contributor.author Alshafah, Samira
dc.date.accessioned 2018-12-13T16:41:54Z
dc.date.available 2018-12-13T16:41:54Z
dc.date.issued 2018
dc.identifier.uri http://hdl.handle.net/123456789/4746
dc.description.abstract Proteins with intrinsically disordered regions are involved in large number of key cell processes including signaling, transcription, and chromatin remodeling functions . On the other side, such proteins have been observed in people suffering from neurological and cardiovascular diseases, as well as various malignancies. Process of experimentally determining disordered regions in proteins is a very expensive and long - term process. As a consequence, a various computer programs for predicting position of disordered regions in proteins have been developed and constantly improved. In this thesis a new method for determining Amino acid sequences that characterize ordered/disordered regions is presented. Material used in research includes 4076 viruses wit h more than 190000 proteins. Proposed method is based on defining correspondence between n -grams (including both repeats and palindromic sequence s) characteristics and their belonging to ordered/disordered protein regions. Positions of ordered/disordered regions are predicted using three different predictors. The features of the repetitive strings used in the research include mol e fractions, fract ional differences, and z -values. Also, data mining techniques association rules and classification were applied on both repeats and palindromes. The results obtained by all techniques show a high level of agreement for a short length of less than 6, while the level of agreement grows up to the maximum with increasing the length of the sequences. The high reliability of the results obtained by the data mining techniques shows that there are n -grams, both repeating sequences and palindromes, which uniquely ch aracterize the disordered/ ordered regions of the proteins . The obtained results were verified by comparing with the results based on n- grams from the DisProt database which contain s the positions of experimentally verified disordered regions of the protein. Results can be used both for the fast localization of disordered/ordered regions in proteins as well as for further improving existing programs for their prediction. en_US
dc.description.provenance Submitted by Slavisha Milisavljevic (slavisha) on 2018-12-13T16:41:54Z No. of bitstreams: 1 ThesisSamira_Alshafah.pdf: 3106746 bytes, checksum: 1b8ab175aa8f27e8329b10d92a26ee16 (MD5) en
dc.description.provenance Made available in DSpace on 2018-12-13T16:41:54Z (GMT). No. of bitstreams: 1 ThesisSamira_Alshafah.pdf: 3106746 bytes, checksum: 1b8ab175aa8f27e8329b10d92a26ee16 (MD5) Previous issue date: 2018 en
dc.language.iso en en_US
dc.publisher Beograd en_US
dc.title Data mining on protein sequences: n-gram analysis of ordered and disordered protein regions en_US
mf.author.birth-date 1978-12-29
mf.author.birth-place Zawia en_US
mf.author.birth-country Libya en_US
mf.author.residence-state Libya en_US
mf.author.citizenship Libya en_US
mf.author.nationality Libya en_US
mf.subject.area Computer Science en_US
mf.subject.keywords n- gram, data mining, ordered/disordered regions, association rules, proteins en_US
mf.subject.subarea Data Mining en_US
mf.contributor.committee Malkov, Saša
mf.contributor.committee Beljanski, Miloš
mf.university.faculty Mathematics faculty en_US
mf.document.pages 111 en_US
mf.document.location Beograd en_US
mf.document.genealogy-project No en_US
mf.university Belgrade en_US

Files in this item

Files Size Format View
ThesisSamira_Alshafah.pdf 3.106Mb PDF View/Open

This item appears in the following Collection(s)

Show simple item record