The mechanism of leukemia, as in other cancers, is closely related to gene expression levels. For example, chronic myelogenous leukemia, CML is caused by the constant activation of a tyrosine kinase due to gene mutations. Identification of the proteins associated with the disease, or disease-associated genes, provides clues to the development of effective drugs for the disease. In the case of CML, it led the tyrosine kinase inhibitor imatinib
, an early example of a molecular targeted drug.
In general, when a genotype based on gene expression levels shares certain features with a patient's phenotype, that characteristics are considered to be the disease-associated genes. Based on this idea, in 1999, Golub et al. proposed to classify diseases based on genotype as features1.
The process of determining the association between genotypes and phenotypes is essentially a unsupervised learning, or more recently, self-supervised learning. Here we use one of the basic methods of unsupervised learning clustering to classify the phenotypes of patients based on genotypes.
We use the dataset of acute lymphocytic leukemia, ALL and acute myeloid leukemia, AML by Golub et al1. Here we use the training set consisting of 7,129 gene expression levels from 27 ALL and 11 AML patients.
We define a
More generally, for the
After constructing the first cluster, the distance between each vector that does not belong to the cluster and the cluster is calculated, and a new cluster
We use the definition by Ward as the linkage2. That is, when we newly construct a cluster
where
We see that the definition of Ward linkage is a natural extension of the Euclidean distance.
Continuing the operation of constructing a new cluster from the vectors or clusters in the nearest neighborhood using the metric and the linkage in this way, we obtain a cluster
Here we will implement the algorithm using numpy
to understand the algorithm.
First we define a dictionary
C = dict()
for i in range(X.shape[0]):
C[i] = 1
We also determine the matrix
class DistanceMatrix(object):
def __init__(self):
self.matrix = dict()
def __setitem__(self, key, value):
i, j = key
if i > j:
i, j = j, i
self.matrix[i, j] = value
def __getitem__(self, key):
i, j = key
if i == j:
return 0
if i > j:
i, j = j, i
return self.matrix[i, j]
Identifying Euclidean distance and Ward linkage, this matrix
D = DistanceMatrix()
for i in range(X.shape[0]):
for j in range(X.shape[0]):
if i < j:
D[i, j] = euclidean(X[i, :], X[j, :])
Create a new cluster from the closest vectors and clusters in
for k in range(X.shape[0] - 1):
# Find the two clusters in the closest neighborhood.
minimum = np.Infinity
for i in C.keys():
for j in C.keys():
if i < j and D[i, j] < minimum:
minimum = D[i, j]
x, y = i, j
# Create a new cluster from x and y.
C[X.shape[0] + k] = C[x] + C[y]
# Update the distance matrix.
for i in C.keys():
if i < X.shape[0] + k:
D[i, X.shape[0] + k] = ward(x, y, i, D, C)
# Clusters x and y are included in the new cluster.
del C[x], C[y]
The results computed on the Golub et al. dataset are shown in Figure 1. The clustering result can be represented as a phylogenetic tree or dendrogram due to its hierarchical structure.
Figure 1. Dendrogram representation of the results of hierarchical clustering by gene expression levels for each patient.
If we take three clusters with a threshold distance of 70.0, shown as a dashed line in the Figure 1, we can almost clearly classify the ALL T-cell/B-cell and AML phenotypes, despite unsupervised learning that doese not use phenotypes as labels.
As a comparison, the results of projecting the vector representations of the genotypes into two dimensions using principal component analysis, PCA, which is an unsupervised learning as well as hierarchical clustering, are shown in Figure 2.
Figure 2. Projection of the vectors of gene expression levels by PCA into two dimensions.
Although PCA, a linear method, can be used to classify ALL and AML patients, the discrimination between T-cell and B-cell patients of ALL is unclear.