Skip to content

FCMeans

Adrian Quintana edited this page Dec 11, 2017 · 1 revision

classify_fcmeans

(syntaxis changed as of version 1.2)

Purpose

FCMeans stands for "Fuzzy c-Means". It is a standard clustering algorithm.

Usage


$ classify_fcmeans ...


Parameters

  • ``The input data file (raw file). It should be a text file with each row representing the data items and each column representing the variables:
 3 1000 12 34 54 -12 45 76 ... 32 45 76 

The first line indicates the dimension of the vectors (in this case 3) and the number of vectors (in this case 1000). Please note that vector components (variables) are separated by empty spaces. Additionally, the last column can also be used as a label for the vector. Example:

 3 1000 12 34 54     labelA -12 45 76   labelB ... 32 45 76     labelN 
  • `` The output cluster centers. This parameter will set the base name for the generated output files. FCMeans produces several files with different information and all of them will use this name but with different extensions. The generated files will be:
    • [basename.cod]resulting cluster centers. The generated cluster centers also follows the same format as the input data. Example:
 3   4 11 31 52     labelA -10 43 71    labelB -13 -1 10    labelC 29 39 71     labelD 

The first line first indicates the dimension of the vectors (in this case 3) and the number of clusters (in this case 4). The rest of the lines represent the cluster centers

  • [basename.inf]Information file about the parameters used, the resulting validity functionals and the resulting quantification error. It will look like this:
 Fuzzy c-Means Clustering Algorithm (FCMeans) Input data file : test.dat Cluster centers output file : test.cod Algorithm information output file : test.inf Number of feature vectors: 93 Number of variables: 3 Number of clusters: 4 Input data normalized Fuzzy constant (m) = 2 Total number of iterations = 1000 Stopping criteria (eps) = 1e-07 Quantization error : 0.864522  Validity Functionals : Partition coefficient (max) (F) : 0.95415 Partition entropy (min) (H) : 0.089624 Non-Fuzzy Index (max) (NFI) : 0.938866 Compactness and Separation index (min) (S) : 0.428229 

The validity functionals are criteria to evaluate the quality of the partition. They can be used to "select" the most appropiate number of clusters in a given problem. The different validity functionals implemented in FCMeans are the following: * Partition Coefficient (F) validity functional

For U in Mfc (fuzzy partition space) 1/C <= F <= 1 for F = 1, U is hard (zeros and ones only) for F = 1/C, U = 1/C*ones(C,n); (The maximum this value, the better the partition) For more information see "J.C. Bezdek, "Pattern Recognition with Fuzzy Objective Function Algorithms", Plenum Press, New York, 1981.

      • Partition Entropy (H) validity functional

For U in Mfc (fuzzy partition space) 0 <= H <= log(C) for H 0, U is hard for H = log(C), U = 1/C*ones(C,n); 0 < 1 - F <= H (strict inequality if U not hard) (The minimum this value, the better the partition) For more information see: J.C. Bezdek, "Pattern Recognition with Fuzzy Objective Function Algorithms", Plenum Press, New York, 1981.

      • Non-fuzzy index (NFI) validity functional

(The maximum this value, the better the partition) For more information see: M. Roubens, "Pattern Classification Problems and Fuzzy Sets", Fuzzy Sets and Systems, 1:239-253, 1978.

      • Compactness and separation index (S) validity functional

(The minimum this value, the better the partition) For more information see: X.L. Xie and G. Beni, "A Validity Measure for Fuzzy Clustering", IEEE Trans. PAMI, 13(8):841-847, 1991.

    • [basename.his] Information about the number of input vectors assigned to each cluster center. It is like an histogram of the resulting clusters. The file contains two columns, the first column is the number of the cluster and the second column is the number of input vectors assigned to it.
    • [basename.err] Average quantization error for each cluster. The file contains two columns:, the first column is the number of the cluster and the second column is the average quantization error for each cluster.
  • `` The input cluster centers file (optional) This is useful when the cluster centers are going to be initialized with a set of predefined values. Usually when a several runs of the algorithm are going to be used and the output of one run is going to be used as input to the next one
  • `` Save a file for each cluster with a list of the input items that were assigned to it. It will generate a file for each cluster containing a list of the indexes of the input vectors assigned to it. Example: If 5 clusters are used, then 5 files named[basename].[ClusterIndex] (`[baseneme].0`,`[basename].1`, etc...) will be generated.
  • `` Number of clusters (it should be1 < c < number of data)
  • `` Iterations number (Default = 1000)
  • `` Stopping criteria (Default 1e-7) This means that the algorithm will stop when the cluster centers values don't change more than =eps between iterations or when the number of iteration steps are reached. By default a value of`1e-7` is used.
  • `` Normalize input data (Default = No)
  • `` Information level while running:
    • `` No information (default)
    • `` Progress bar
    • `` Changes between iterations
  • `` Fuzzy constant. It should be a value > 1 (Default = 2)

Examples and notes

Example 1: Clustering a set of data stored in "test.dat" file into 5 clusters


$ classify_fcmeans -i test.dat -o test -c 5


In this case the following parameters are set by default:


Input data file : test.dat
Output file name : test
Number of clusters = 5
Fuzzy constant = 2
Total number of iterations = 1000
Stopping criteria (eps) = 1e-07
verbosity level = 0
Do not normalize input data 


So, we are going to generate 5 clusters (-c 5) with a fuzzy constant equal to 2 (-m 2.0) ). The algorithm will stop when the cluster centers don't change more than 1e-7 between iterations (-eps 1e-7). In this case no textual information will be given in the output console (-verb 0).

As results, the FCMeans application will generate the following output files:

  • test.cod The final cluster centers file in the format described above
  • test.inf Information file about the parameters used, the validity functionals and the resulting quantification error
  • test.his Information about the number of input vectors assigned to each cluster. It is like an histogram
  • test.err Average quantization error for each cluster

Example 2: Clustering a set of data stored in "test.dat" file into 4 clusters, saving the cluster information


$ classify_fcmeans -i test.dat -o test -c 4 -m 1.5 -saveclusters -norm -verb 1


In this case the following parameters are set by default:


Input data file : test.dat
Output file name : test
Number of clusters = 4
Fuzzy constant = 1.5
Total number of iterations = 1000
Stopping criteria (eps) = 1e-07
verbosity level = 1
Normalize input data


So, we are going to generate 4 clusters (-c 4) with a fuzzy constant equal to 1.5 (-m 1.5) ). The algorithm will stop when the cluster centers don't change more than 1e-7 between iterations (-eps 1e-7). Since the-saveclusters parameter is used a list of input data assigned to each cluster is stored in thetest.0 totest.3 files. In this case a progress bar and elpased/estimated time will be shown in the output console (-verb 1). The input data will be normalized (-norm).

In this case, the following files are going to be generated:

  • test.cod The final cluster centers file in the format described above
  • test.inf Information file about the parameters used, the validity functionals and the resulting quantification error
  • test.his Information about the number of input vectors assigned to each cluster. It is like an histogram
  • test.err Average quantization error for each cluster
  • test.0 totest.3 Each file is a list of the input data vectors assigned to each cluster

--Main.AlfredoSolano - 24 Jan 2007

Clone this wiki locally