A fuzzy clustering algorithm, where clusters
are derived from fuzzy sets [1],
can be considered as a generalization of the traditional
cluster procedures [2].
Such clusters are derived assigning to each compound a
number between zero and one called the membership degree
of the object. A compound is defined by the degrees
of membership, ranged between 0 and 1, to each cluster, and
the sum of all these
values is equal to 1. In the proposed procedure [3],
the fuzzy C-means algorithm [4]
has been designed to produce the fuzzy clustering of the data sets.
The fundamental role of cluster analysis is to achieve
the self partitioning of the data and to look for compounds
in the data base which can be considered
as close enough to be identified as a cluster.
Each cluster can be identified
as a set of compounds close to each other with regard
to the molecular descriptors
provided. A synthetic representation can be based on a clear
separation of the clusters.
Therefore, instead to inspect all the compounds in the database
to understand and analyze their chemical properties, it is
only required to select typical compounds
representing each cluster, for example those ones that show
the highest membership degrees.
The main steps of a fuzzy clustering algorithm are the following:
- choice of the number of clusters C;
- determination of the starting position of each cluster
center in the descriptor space;
- building an "ambiguity matrix" that includes, for each compound,
the degrees of membership of each cluster;
- calculation of the new positions of the cluster centers;
- comparing new and old center positions. If their distances
are lower than a cutoff value, the algorithm is stopped,
otherwise go to step 3.
In order to present the Fuzzy Clustering approach in a
simplified way, let us consider an example in which the series
S is constituted by nine compounds k and represented by
the vector V = (1, 2, 3, 4, 6, 8, 9, 10, 11);
each element V(k) defines the distance between the origin O
and the compound k.
Starting cluster positions
The number of clusters used is very important as it defines
their selectivity, i.e. the number of compound categories
represented in each cluster. Two clusters are used in this
example (Figure 1). The first step consists to choose
the initial position of each cluster and a usual way to define them is:
|
|
|
|
(1) |
where C1,0 and C2,0 represent the initial
center positions of the clusters 1 et 2, and Z is a random value
defined as V(K)min <
Z < V(K)max.
For example, let us use Z = 4.1, then C1,
0 = 2.55 and
C2,0 = 7.55.
Figure 1. Center positions of the clusters 1 and 2
for the series S.
Computing membership degrees
The next phase consists in computing the membership degrees for
each compounds. Two steps are necessary:
- evaluating the distances dCi(K) between all
compounds et each cluster center Ci, by using
a parameter m > 0 that tunes out noise from
the data. A high value of m allows reducing the contribution
of the compounds with low membership degrees to the cluster
center determination.
|
|
(2) |
- building an ambiguity matrix U (Table 1),
where the components elements UCi(K),
representing, for a given compound k,
the degree of membership of the clusters Ci,
are defined by:
|
|
(3) |
Table 1. Starting ambiguity matrix for the series
S with m = 2.
| Compound |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
| DC1(K) |
1.55 |
0.55 |
0.45 |
1.45 |
3.45 |
5.45 |
6.45 |
7.45 |
8.45 |
| DC2(K) |
6.55 |
5.55 |
4.55 |
3.551 |
0.55 |
0.45 |
1.45 |
2.45 |
3.45 |
| UC1(K) |
0.81 |
0.91 |
0.91 |
0.71 |
0.31 |
0.08 |
0.18 |
0.25 |
0.29 |
| UC2(K) |
0.19 |
0.09 |
0.09 |
0.29 |
0.69 |
0.92 |
0.82 |
0.75 |
0.71 |
Computing new cluster centers
After defining the ambiguity matrix, the new position of the
cluster centers are computed by:
|
|
(4) |
This procedure is repeated until the difference between the new and old
positions is lower than a cutoff value A (i.e., |Ci,0-Ci,1| < A). In the
latter case, the algorithm is stopped and the final ambiguity matrix is computed.
In the example proposed here, with A = 0.1, the end condition is filled
after 9 iterations.
The cluster centers are C1,9 = 5.25,
C2,9 = 6.75, and the related ambiguity matrix
is represented in Table 2.
Table 2. Final ambiguity matrix for the series S.
| Compound |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
| UC1(K) |
0.58 |
0.61 |
0.64 |
0.71 |
0.50 |
0.29 |
0.36 |
0.40 |
0.42 |
| UC2(K) |
0.42 |
0.39 |
0.36 |
0.29 |
0.50 |
0.71 |
0.64 |
0.61 |
0.58 |
The final step consists in defining a threshold for the membership
degrees, in order to define if a compound is really associated with
a given cluster. The ambiguity matrix is then changed in a binary
matrix (Table 3) and the threshold can be modified to make the
clusters more or less selective. In this case, it is possible
to analyze the cluster overlapping or if a compound belongs
to two or more clusters. For example, for the series S, the
second compound belongs to both clusters for a threshold of
0.4 and only to the first cluster for a threshold of 0.6.
Table 3. Binary representation of fuzzy clustering results by
using threshold values.
| Threshold
| Compound
| 1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
| 0.4 |
UC1(K) |
|
1 |
1 |
1 |
1 |
0 |
0 |
1 |
1 |
| UC2(K) |
1 |
0 |
0 |
1 |
1 |
1 |
1 |
1 |
| 0.6 |
UC1(K) |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
| UC2(K) |
0 |
0 |
0 |
0 |
1 |
1 |
1 |
0 |
References
- L.A Zadeh., Fuzzy sets and their applications to classification and clustering, in J. Van Ryzin (Ed.), Classification and Clustering, Academic Press, New York, 1977, pp. 251-299.
- R.C. Cubes, A.K. Jain, Clustering methodologies in exploratory data analysis, in M. Yovits (Ed.), Advances in computers, Academic Press, New York, 1980, pp. 113-228.
- F. Ros, K. Audouze, M. Pintore, J.R. Chretien, Hybrid System for Virtual Screening: Interest of Fuzzy Clustering Applied to Olfaction, SAR QSAR Env. Res. (2000) 11, 281-300.
- R.J. Hathaway, J.C. Bezdek, Nerf C-Means: Non Euclidean Relational Fuzzy Clustering, Pattern Recognition (1994) 27, 429-437.
|