Zusammenfassung:
|
Application of association rule and support vector machine
technique for T cell epitope prediction
Abstract: Data mining is an interdisciplinary sub eld of computer science, including
various scienti c disciplines such as: database systems, statistics, machine
learning, arti cial intelligence and the others. The main task of data mining is automatic
and semi-automatic analysis of large quantities of data to extract previously
unknown, nontrivial and interesting patterns. Rapid development in the elds of
immunology, genomics, proteomics, molecular biology and other related areas has
caused a large increase in biological data. Drawing conclusions from these data requires
sophisticated computational analyses. Without automatic methods to extract
data it is almost impossible to investigate and analyze this data.
Currently, one of the most active problems in immunoinformatics is T cell
epitope identi cation. Identi cation of T - cell epitopes, especially dominant T -
cell epitopes widely represented in population, is of the immense relevance in vaccine
development and detecting immunological patterns characteristic for autoimmune
diseases. Epitope-based vaccines are of great importance in combating infectious and
chronic diseases and various types of cancer. Experimental methods for identi cation
of T - cell epitopes are expensive, time consuming, and are not applicable for large
scale research (especially not for the choice of the optimal group of epitopes for
vaccine development which will cover the whole population or personalized vaccines).
Computational and mathematical models for T - cell epitope prediction, based
on MHC-peptide binding, are crucial to enable the systematic investigation and
identi cation of T - cell epitopes on a large dataset and to complement expensive and
time consuming experimentation [16]. T - cells (T - lymphocytes) recognize protein
antigen(s) only when degradated to peptide fragments and complexed with Major
Histocompatibility Complex (MHC) molecules on the surface of antigen-presenting
cells [1]. The binding of these peptides (potential epitopes) to MHC molecules and
presentation to T - cells is a crucial (and the most selective) step in both cellular
and humoral adoptive immunity. Currently exist numerous of methodologies that
provide identi cation of these epitopes.
In this PhD thesis, discussed methods are exclusively based on peptide sequence
binding to MHC molecules. It describes existing methodologies for T - cell epitope
prediction, the shortcomings of existing methods and some of the available databases
of experimentally determined linear T - cell epitopes. The new models for T - cell
epitope prediction using data mining techniques are developed and extensive analyses
concerning to whether disorder and hydropathy prediction methods could help
understanding epitope processing and presentation is done. Accurate computational
prediction of T cell epitope, which is the aim of this thesis, can greatly expedite epitope
screening by reducing costs and experimental e ort. These theses deals with
predictive data mining tasks: classi cation and regression, and descriptive data
mining tasks: clustering, association rules and sequence analysis.
The new-developed models, which are main contribution of the dissertation are
comparable in performance with the best currently existing methods, and even better
in some cases. Developed models are based on the support vector machine
technique for classi cation and regression problems. À new approach of extracting
the most important physicochemical properties that in uence the classi cation
of MHC-binding ligands is also presented. For that purpose are developed new
clustering-based classi cation models. The models are based on k-means clustering
technique.
The second part of the thesis concerns the establishment of rules and associations
of T - cell epitopes that belong to di erent protein structures. The task of this part
of research was to nd out whether disorder and hydropathy prediction methods
could help in understanding epitope processing and presentation. The results of
the application of an association rule technique and thorough analysis over large
protein dataset where T cell epitopes, protein structure and hydropathy has been
determined computationally, using publicly available tools, are presented. During
the research on this theses new extendable open source software system that support
bioinformatic research and have wide applications in prediction of various proteins
characteristics is developed.
A part of this thesis is described in the works [71][82][45][42][43][44][72][73] that
are published or submitted for publications in several journals. The dissertation is
organized as follows:
In section1 is illustrated introduction to the problem of identifying T - cell epitopes,
the importance of mathematical and computational methods in this area,
vii
as well as the importance of T - cell epitopes to the immune system and basis for
functioning of the immune system.
In section 2 are described in details data mining techniques that are used in the
thesis for development of new models.
Section 3 provides an overview of existing methods for predicting the T - cell
epitopes and explains the work methodologies of existing models and methods. It
pointed out the shortcomings of existing methods which have been the motivation
for the development of new models for the T - cell epitope prediction. Some of
the publicly available databases with the experimentally determined MHC binding
peptides and T - cell epitope are described.
In section 4 are presented new developed models for epitopes prediction. The
developed models include three new encoding schemes for peptide sequences representation
in the form of a vector which is more suitable as input to models based
on the data mining techniques.
Section 5 reports results of presented new classi cation and regression models.
The new models are compared with each other as well as with currently existing
methods for T cell epitope prediction.
Section 6 presents the research results of the T - cell epitopes relationship with
ordered and disordered regions in proteins. In the context of this chapter summary
results are presented which are shown in more detail in the published works
[71][82][45][44].
Section 7 concludes the dissertation with some discussion of the potential
signi cance of obtained results and some directions for future work. |