ONE: Obtaining training data: The Experimental Design
first work package aims to define the variables pertaining to the experiment
and to design the overall procedure and guidelines of obtaining the training
data from the patients involved. Because the accuracy of the treatment response
prediction model is very important in such clinical settings, specifically
defining the variables involved in obtaining the training data, i.e. the type
or stage of cancer or the treatment being administered, should be taken into
account; as it also helps drive the application of stratified, and even
For the first task, experimental
variables are defined as tissues that are sampled from the primary as well as
secondary tumors of stage IV colorectal cancer patients undergoing chemotherapeutic
agents containing Fluorouracil (5-FU)U1 .
Tissues will be sampled before treatment initiation, which would thus be the time point to
predict the treatment response from for future patients.
The second task further defines the type of
input data to obtain from the tissue samples. The BayCountU2 model will be applied, which factorizes a gene
expression matrix to compute the heterogeneous subclones present in each tissue
sample. Using RNA Seq counts and negative binomial analysis, its first computes
the estimated total number of subclone across all samples by means of Maximum
Likelihood. Additionally, it calculates the proportion of each subclone in
every sample, as well as the relevant gene expression pattern within each
subclone while taking into account the systematic variation and gene specific
bias, giving a normalized version of the data.
the third task concerns labeling the training data. After the completion of the
treatment plan, the patients’ response will be measured using the RECIST
grading system, where assigned scores are given as a complete response (CD), a partial
response (PR), a progressive disease (PD) or a stable disease (SD). Patients
graded as PD or SD would be sampled again, as done before the treatment, to
label, and also validate, which of the subclones present before the treatment
had survived the regimen. These subclones should be labeled as “resistant” and
all other subclones that seem to have disappeared after treatment in the same
patient should be labeled as “sensitive”. This is our training data. Because
BayCount is able to report the subclonal proportions of each patient, we can
also investigate whether the resistance of a subclone depends on its proportion
as well as its expression.
TWO: Feature selection
and data procesing
The second work package focuses on feature selection, which is a
procedure that narrows down the number of features, in this case genes, to be
used as input. Because some of the genes included are irrelevant to the
analysis conditions, their contribution to the “curse of dimensionality”
increases computationalU3 costs and introduces noise to the data.
Here, the manual selection is done to employ the most relevant genes that would
yield a good classifier.
Accordingly, we can employ prior knowledge in narrowing down the
number of genes to those involved in the cell cycle, for example, since we are
studying cancerous cells. Commonly mutated genes in cancer are oncogenes like the
RAS gene, tumor suppressor genes such as the TP53 gene, and DNA repair genes. We
can also include the genes that are possibly targeted by the chemotherapy. In
this case, previous studies has shown that the amplification of the thymidylate
synthase gene has rendered human colon cancer cell lines to be resistant to 5-FU
drugsU4 , whose mechanism depend on acting as a pyrimidine
analog antimetabolite to inhibit the synthesis of deoxythymidine monophosphate (dTMP),
eventually interrupting DNA synthesis.
Evidently, we can also include the genes that are known to date to
be useful biomarkersU5 for colorectal cancer, like mutations in
the APC and beta-catenin, both of which are involved in the Wnt signaling
pathway, and the BRAF gene which is involved in the MAPK pathwayU6 , where stimulation in the first pathway
activates the otherU7 .
THREE: Model selection
In this section, we apply some of the well known machine learning
models to perform the task of classifying our data. Because the training data
obtained from the patients are already labeled as being either “resistant” or
“sensitive”, the learning models can be applied in a supervised manner where
the algorithm, as opposed to unsupervised learning which aims to explore
unknown classes from the inherent variation of the data, can use the
information provided as labels to produce a more fitting classification model
for the patients to whom their treatment response will later be predicted.
There are two main tasks in this work package. The first will chose
and train different algorithms to accurately classify our data, while the
second will examine their performance in order to select the best one to apply
to our test data.
One of the most commonly used supervised algorithms is the support
vector machine or the SVM. They have the advantage of being able to compute
both linear and non-linear classifications while avoiding over-fitting and
retaining its generalization property. It is well supported mathematically and
can perform with high accuracy, especially given a lot of training data. It is
also a discriminative approach to learning: it works best for predicting
classes rather than interpreting the reasons behind the classificationU12 . SVMs only work with labeled data and focuses only on the data
points, called support vectors, that maximizes the distance between the classes
by modeling a separating hyperplane between the two classes.
For our high dimensional data,
Another commonly used learning algorithm is the random forest or
the RF approach which, like the SVM, performs with high accuracy. RF is a
variation of the decision tree methodology, a rather greedy analysis for
classification or regression tasks. The power of RF lies in that it repeatedly
subsets random samples from the training data, with replacement, as well as the
data parameters or variables and creates numerous trees from those subsets. It
then classifies the data based on the individual “votes” or the averaged value
of those week trees, giving a robust result and creating a model that solves
the problem of over-fitting.