>home     >contact     >sitemap     >login
BIOCHEMICS CONSULTING SAS BCX logo
  About us Our Team Expertise Services
Expertise


Data mining

    Fuzzy logic

    Genetic algorithms

    Neural networks

Molecular modeling

Examples of achievements

Selected publications

Fuzzy clustering


A fuzzy clustering algorithm, where clusters are derived from fuzzy sets [1], can be considered as a generalization of the traditional cluster procedures [2]. Such clusters are derived assigning to each compound a number between zero and one called the membership degree of the object. A compound is defined by the degrees of membership, ranged between 0 and 1, to each cluster, and the sum of all these values is equal to 1. In the proposed procedure [3], the fuzzy C-means algorithm [4] has been designed to produce the fuzzy clustering of the data sets.

The fundamental role of cluster analysis is to achieve the self partitioning of the data and to look for compounds in the data base which can be considered as close enough to be identified as a cluster. Each cluster can be identified as a set of compounds close to each other with regard to the molecular descriptors provided. A synthetic representation can be based on a clear separation of the clusters. Therefore, instead to inspect all the compounds in the database to understand and analyze their chemical properties, it is only required to select typical compounds representing each cluster, for example those ones that show the highest membership degrees.

The main steps of a fuzzy clustering algorithm are the following:

  1. choice of the number of clusters C;
  2. determination of the starting position of each cluster center in the descriptor space;
  3. building an "ambiguity matrix" that includes, for each compound, the degrees of membership of each cluster;
  4. calculation of the new positions of the cluster centers;
  5. comparing new and old center positions. If their distances are lower than a cutoff value, the algorithm is stopped, otherwise go to step 3.

In order to present the Fuzzy Clustering approach in a simplified way, let us consider an example in which the series S is constituted by nine compounds k and represented by the vector V = (1, 2, 3, 4, 6, 8, 9, 10, 11); each element V(k) defines the distance between the origin O and the compound k.

Starting cluster positions

The number of clusters used is very important as it defines their selectivity, i.e. the number of compound categories represented in each cluster. Two clusters are used in this example (Figure 1). The first step consists to choose the initial position of each cluster and a usual way to define them is:

   

(1)


where C1,0 and C2,0 represent the initial center positions of the clusters 1 et 2, and Z is a random value defined as V(K)min <  Z < V(K)max. For example, let us use Z = 4.1, then C1, 0 = 2.55 and C2,0 = 7.55.

Figure 1. Center positions of the clusters 1 and 2 for the series S.


Computing membership degrees

The next phase consists in computing the membership degrees for each compounds. Two steps are necessary:

  1. evaluating the distances dCi(K) between all compounds et each cluster center Ci, by using a parameter m > 0 that tunes out noise from the data. A high value of m allows reducing the contribution of the compounds with low membership degrees to the cluster center determination.

  2.  

    (2)


  3. building an ambiguity matrix U (Table 1), where the components elements UCi(K), representing, for a given compound k, the degree of membership of the clusters Ci, are defined by:

  4.  

    (3)


Table 1. Starting ambiguity matrix for the series S with m = 2.
Compound 1 2 3 4 5 6 7 8 9
DC1(K) 1.55 0.55 0.45 1.45 3.45 5.45 6.45 7.45 8.45
DC2(K) 6.55 5.55 4.55 3.551 0.55 0.45 1.45 2.45 3.45
UC1(K) 0.81 0.91 0.91 0.71 0.31 0.08 0.18 0.25 0.29
UC2(K) 0.19 0.09 0.09 0.29 0.69 0.92 0.82 0.75 0.71

Computing new cluster centers

After defining the ambiguity matrix, the new position of the cluster centers are computed by:

 

(4)


This procedure is repeated until the difference between the new and old positions is lower than a cutoff value A (i.e., |Ci,0-Ci,1< A). In the latter case, the algorithm is stopped and the final ambiguity matrix is computed. In the example proposed here, with A = 0.1, the end condition is filled after 9 iterations. The cluster centers are C1,9 = 5.25, C2,9 = 6.75, and the related ambiguity matrix is represented in Table 2.

Table 2. Final ambiguity matrix for the series S.
Compound 1 2 3 4 5 6 7 8 9
UC1(K) 0.58 0.61 0.64 0.71 0.50 0.29 0.36 0.40 0.42
UC2(K) 0.42 0.39 0.36 0.29 0.50 0.71 0.64 0.61 0.58

The final step consists in defining a threshold for the membership degrees, in order to define if a compound is really associated with a given cluster. The ambiguity matrix is then changed in a binary matrix (Table 3) and the threshold can be modified to make the clusters more or less selective. In this case, it is possible to analyze the cluster overlapping or if a compound belongs to two or more clusters. For example, for the series S, the second compound belongs to both clusters for a threshold of 0.4 and only to the first cluster for a threshold of 0.6.

Table 3. Binary representation of fuzzy clustering results by using threshold values.
   Threshold       Compound       1       2       3       4       5       6       7       8       9   
0.4    UC1(K)
1
1
0
0
1 1 1 1 0 0 1 1
   UC2(K) 1 0 0 1 1 1 1 1
0.6    UC1(K) 1 1 1 0 0 0 0 0
   UC2(K) 0 0 0 0 1 1 1 0


References

  1. L.A Zadeh., Fuzzy sets and their applications to classification and clustering, in J. Van Ryzin (Ed.), Classification and Clustering, Academic Press, New York, 1977, pp. 251-299.
  2. R.C. Cubes, A.K. Jain, Clustering methodologies in exploratory data analysis, in M. Yovits (Ed.), Advances in computers, Academic Press, New York, 1980, pp. 113-228.
  3. F. Ros, K. Audouze, M. Pintore, J.R. Chretien, Hybrid System for Virtual Screening: Interest of Fuzzy Clustering Applied to Olfaction, SAR QSAR Env. Res. (2000) 11, 281-300.
  4. R.J. Hathaway, J.C. Bezdek, Nerf C-Means: Non Euclidean Relational Fuzzy Clustering, Pattern Recognition (1994) 27, 429-437.

TOP
Biochemics Consulting SAS
Pépinière d’Entreprises - 111 bd Duhamel de Monceau
45 166 Olivet Cedex - France

® BCX 2004, All Rights Reserved.
- Biochemics Consulting SAS - 16 rue Leonard de Vinci - F 45074 ORLEANS Cedex 2 - Capital: 100.000€ - RCS SIRET 441 128 980 000 1 4 -