Methods of cluster analysis. Cluster analysis is an algorithm for studying data divided into groups according to similar characteristics.


Cluster analysis is

Good day. Here I have respect for people who are fans of their work.

Maxim, my friend, belongs to this category. Constantly works with figures, analyzes them, makes relevant reports.

Yesterday we had lunch together, so for almost half an hour he told me about cluster analysis - what it is and in what cases its application is reasonable and expedient. Well, what about me?

I have a good memory, so I will provide you with all this data, by the way, which I already knew about in its original and most informative form.

Cluster analysis is designed to divide a set of objects into homogeneous groups (clusters or classes). This is a task of multivariate data classification.

There are about 100 different clustering algorithms, however, the most commonly used are hierarchical cluster analysis and k-means clustering.

Where is cluster analysis used? In marketing, this is the segmentation of competitors and consumers.

In management: the division of personnel into groups of different levels of motivation, the classification of suppliers, the identification of similar production situations in which marriage occurs.

In medicine, the classification of symptoms, patients, drugs. In sociology, the division of respondents into homogeneous groups. In fact, cluster analysis has proven itself well in all spheres of human life.

The beauty of this method is that it works even when there is little data and the requirements for the normality of distributions of random variables and other requirements of classical methods of statistical analysis are not met.

Let us explain the essence of cluster analysis without resorting to strict terminology:
Let's say you conducted a survey of employees and want to determine how you can most effectively manage your staff.

That is, you want to divide employees into groups and for each of them to allocate the most effective control levers. At the same time, the differences between groups should be obvious, and within the group, the respondents should be as similar as possible.

To solve the problem, it is proposed to use hierarchical cluster analysis.

As a result, we will get a tree, looking at which we must decide how many classes (clusters) we want to split the staff into.

Suppose that we decide to divide the staff into three groups, then to study the respondents who fell into each cluster, we get a tablet with the following content:


Let us explain how the above table is formed. The first column contains the number of the cluster — the group whose data is reflected in the row.

For example, the first cluster is 80% male. 90% of the first cluster fall into the age group from 30 to 50 years old, and 12% of respondents believe that benefits are very important. And so on.

Let's try to make portraits of respondents of each cluster:

  1. The first group is mainly men of mature age, occupying leadership positions. The social package (MED, LGOTI, TIME-free time) does not interest them. They prefer to receive a good salary, rather than help from the employer.
  2. Group two, on the contrary, prefers the social package. It consists mainly of "aged" people occupying low positions. Salary is certainly important for them, but there are other priorities.
  3. The third group is the "youngest". Unlike the previous two, there is an obvious interest in learning and professional growth opportunities. This category of employees has a good chance to replenish the first group soon.

Thus, when planning a campaign to introduce effective personnel management methods, it is obvious that in our situation it is possible to increase the social package for the second group to the detriment, for example, of wages.

If we talk about which specialists should be sent for training, then we can definitely recommend paying attention to the third group.

Source: http://www.nickart.spb.ru/analysis/cluster.php

Features of cluster analysis

A cluster is the price of an asset in a certain period of time during which transactions were made. The resulting volume of purchases and sales is indicated by a number within the cluster.

The bar of any TF contains, as a rule, several clusters. This allows you to see in detail the volumes of purchases, sales and their balance in each individual bar, for each price level.


A change in the price of one asset inevitably entails a chain of price movements on other instruments as well.

Attention!

In most cases, the understanding of the trend movement occurs already at the moment when it is rapidly developing, and entering the market along the trend is fraught with falling into a corrective wave.

For successful trades, it is necessary to understand the current situation and be able to anticipate future price movements. This can be learned by analyzing the cluster graph.

With the help of cluster analysis, you can see the activity of market participants inside even the smallest price bar. This is the most accurate and detailed analysis, as it shows the point distribution of transaction volumes for each asset price level.

In the market there is a constant confrontation between the interests of sellers and buyers. And every smallest price movement (tick) is the move to a compromise - the price level - which suits both parties at the moment.

But the market is dynamic, the number of sellers and buyers is constantly changing. If at one point in time the market was dominated by sellers, then the next moment, most likely, there will be buyers.

The number of completed transactions at neighboring price levels is also not the same. And yet, first, the market situation is reflected in the total volume of transactions, and only then on the price.

If you see the actions of the dominant market participants (sellers or buyers), then you can predict the price movement itself.

To successfully apply cluster analysis, you first need to understand what a cluster and a delta are.


A cluster is called a price movement, which is divided into levels at which transactions were made with known volumes. The delta shows the difference between buying and selling occurring in each cluster.

Each cluster, or group of deltas, allows you to figure out whether buyers or sellers dominate the market at a given time.

It is enough just to calculate the total delta by summing the sales and purchases. If the delta is negative, then the market is oversold, there are redundant sell transactions. When the delta is positive, the market is clearly dominated by buyers.

The delta itself can take on a normal or critical value. The value of the delta volume above the normal value in the cluster is highlighted in red.

If the delta is moderate, then this characterizes a flat state in the market. With a normal delta value, a trend movement is observed in the market, but a critical value is always a harbinger of a price reversal.

Forex trading with CA

To get the maximum profit, you need to be able to determine the transition of the delta from a moderate level to a normal one. Indeed, in this case, you can notice the very beginning of the transition from a flat to a trend movement and be able to get the most profit.

The cluster chart is more visual, on it you can see significant levels of accumulation and distribution of volumes, build support and resistance levels. This allows the trader to find the exact entry to the trade.

Using the delta, one can judge the predominance of sales or purchases in the market. Cluster analysis allows you to observe transactions and track their volumes inside the bar of any TF.

This is especially important when approaching significant support or resistance levels. Cluster judgments are the key to understanding the market.

Source: http://orderflowtrading.ru/analitika-rynka/obemy/klasternyy-analiz/

Areas and features of application of cluster analysis

The term cluster analysis (first introduced by Tryon, 1939) actually includes a set of different classification algorithms.

A common question asked by researchers in many fields is how to organize observed data into visual structures, i.e. expand taxonomies.

According to the modern system accepted in biology, man belongs to primates, mammals, amniotes, vertebrates and animals.

Note that in this classification, the higher the level of aggregation, the less similarity between members in the corresponding class.

Man has more similarities with other primates (i.e., apes) than with "distant" members of the mammal family (i.e., dogs), and so on.

Note that the previous discussion refers to clustering algorithms, but does not mention anything about testing for statistical significance.

In fact, cluster analysis is not so much an ordinary statistical method as a “set” of various algorithms for “distributing objects into clusters”.

There is a point of view that, unlike many other statistical procedures, cluster analysis methods are used in most cases when you do not have any a priori hypotheses about the classes, but are still in the descriptive stage of the study.

Attention!

It should be understood that cluster analysis determines the "most possibly meaningful decision".

Therefore, testing for statistical significance is not really applicable here, even in cases where p-levels are known (as, for example, in the K-means method).

The clustering technique is used in a wide variety of fields. Hartigan (1975) has provided an excellent overview of the many published studies containing results obtained by cluster analysis methods.

For example, in the field of medicine, the clustering of diseases, treatment of diseases, or symptoms of diseases leads to widely used taxonomies.

In the field of psychiatry, the correct diagnosis of symptom clusters such as paranoia, schizophrenia, etc. is critical to successful therapy. In archeology, using cluster analysis, researchers are trying to establish taxonomies of stone tools, funeral objects, etc.

There are wide applications of cluster analysis in marketing research. In general, whenever it is necessary to classify "mountains" of information into groups suitable for further processing, cluster analysis turns out to be very useful and effective.

Tree Clustering

The example in the Primary Purpose section explains the purpose of the join (tree clustering) algorithm.

The purpose of this algorithm is to combine objects (for example, animals) into sufficiently large clusters using some measure of similarity or distance between objects. A typical result of such clustering is a hierarchical tree.

Consider a horizontal tree diagram. The diagram starts with each object in the class (on the left side of the diagram).

Now imagine that gradually (in very small steps) you "weaken" your criterion for what objects are unique and what are not.

In other words, you lower the threshold related to the decision to combine two or more objects into one cluster.

As a result, you link more and more objects together and aggregate (combine) more and more clusters of increasingly different elements.

Finally, in the last step, all objects are merged together. In these charts, the horizontal axes represent the pooling distance (in vertical dendrograms, the vertical axes represent the pooling distance).

So, for each node in the graph (where a new cluster is formed), you can see the amount of distance for which the corresponding elements are linked into a new single cluster.

When the data has a clear "structure" in terms of clusters of objects that are similar to each other, then this structure is likely to be reflected in the hierarchical tree by various branches.

As a result of successful analysis by the join method, it becomes possible to detect clusters (branches) and interpret them.

The union or tree clustering method is used in the formation of clusters of dissimilarity or distance between objects. These distances can be defined in one-dimensional or multidimensional space.

For example, if you must cluster the types of food in a cafe, you can take into account the number of calories contained in it, the price, the subjective assessment of taste, etc.

The most direct way to calculate distances between objects in a multidimensional space is to calculate Euclidean distances.

If you have a two- or three-dimensional space, then this measure is the actual geometric distance between objects in space (as if the distances between objects were measured with a tape measure).

However, the pooling algorithm does not "care" about whether the distances "provided" for this are real or some other derived distance measures, which is more meaningful to the researcher; and the challenge for researchers is to select the right method for specific applications.

Euclidean distance. This seems to be the most common type of distance. It is simply a geometric distance in multidimensional space and is calculated as follows:

Note that the Euclidean distance (and its square) is calculated from the original data, not from the standardized data.

This is the usual way of calculating it, which has certain advantages (for example, the distance between two objects does not change when a new object is introduced into the analysis, which may turn out to be an outlier).

Attention!

However, distances can be greatly affected by differences between the axes from which the distances are calculated. For example, if one of the axes is measured in centimeters, and then you convert it to millimeters (by multiplying the values ​​by 10), then the final Euclidean distance (or the square of the Euclidean distance) calculated from the coordinates will change dramatically, and, as a result, the results of the cluster analysis can be very different from the previous ones.

The square of the Euclidean distance. Sometimes you may want to square the standard Euclidean distance to give more weight to more distant objects.

This distance is calculated as follows:

City block distance (Manhattan distance). This distance is simply the average of the differences over the coordinates.

In most cases, this measure of distance leads to the same results as for the usual Euclid distance.

However, note that for this measure the influence of individual large differences (outliers) decreases (because they are not squared). Manhattan distance is calculated using the formula:

Chebyshev distance. This distance can be useful when one wishes to define two objects as "different" if they differ in any one coordinate (any one dimension). The Chebyshev distance is calculated by the formula:

Power distance. It is sometimes desired to progressively increase or decrease the weight related to a dimension for which the corresponding objects are very different.

This can be achieved using a power-law distance. The power distance is calculated by the formula:

where r and p are user-defined parameters. A few examples of calculations can show how this measure "works".

The p parameter is responsible for the gradual weighting of differences in individual coordinates, the r parameter is responsible for the progressive weighting of large distances between objects. If both parameters - r and p, are equal to two, then this distance coincides with the Euclidean distance.

The percentage of disagreement. This measure is used when the data is categorical. This distance is calculated by the formula:

Association or association rules

At the first step, when each object is a separate cluster, the distances between these objects are determined by the chosen measure.

However, when several objects are linked together, the question arises, how should the distances between clusters be determined?

In other words, you need a join or link rule for two clusters. There are various possibilities here: for example, you can link two clusters together when any two objects in the two clusters are closer to each other than the corresponding link distance.

In other words, you use the "nearest neighbor rule" to determine the distance between clusters; this method is called the single link method.

This rule builds "fibrous" clusters, i.e. clusters "linked together" only by individual elements that happen to be closer to each other than the others.

Alternatively, you can use neighbors in clusters that are farthest from each other of all other feature pairs. This method is called the full link method.

There are also many other methods for joining clusters, similar to those that have been discussed.

Single connection (nearest neighbor method). As described above, in this method, the distance between two clusters is determined by the distance between the two closest objects (nearest neighbors) in different clusters.

This rule must, in a sense, string objects together to form clusters, and the resulting clusters tend to be represented by long "strings".

Full connection (method of the most distant neighbors). In this method, the distances between clusters are defined as the largest distance between any two objects in different clusters (i.e. "most distant neighbors").

Unweighted pairwise mean. In this method, the distance between two different clusters is calculated as the average distance between all pairs of objects in them.

The method is effective when objects actually form different "groves", but it works equally well in cases of extended ("chain" type) clusters.

Note that in their book Sneath and Sokal (1973) introduce the abbreviation UPGMA to refer to this method as the unweighted pair-group method using arithmetic averages.

Weighted pairwise mean. The method is identical to the unweighted pairwise average method, except that the size of the respective clusters (ie, the number of objects they contain) is used as a weighting factor in the calculations.

Therefore, the proposed method should be used (rather than the previous one) when unequal cluster sizes are assumed.

Sneath and Sokal (1973) introduce the abbreviation WPGMA to refer to this method as the weighted pair-group method using arithmetic averages.

Unweighted centroid method. In this method, the distance between two clusters is defined as the distance between their centers of gravity.

Attention!

Sneath and Sokal (1973) use the acronym UPGMC to refer to this method as the unweighted pair-group method using the centroid average.

Weighted centroid method (median). This method is identical to the previous one, except that weights are used in the calculations to take into account the difference between cluster sizes (i.e., the number of objects in them).

Therefore, if there are (or are suspected) significant differences in cluster sizes, this method is preferable to the previous one.

Sneath and Sokal (1973) used the abbreviation WPGMC to refer to it as the weighted pair-group method using the centroid average.

Ward method. This method is different from all other methods because it uses ANOVA methods to estimate distances between clusters.

The method minimizes the sum of squares (SS) for any two (hypothetical) clusters that can be formed at each step.

Details can be found in Ward (1963). In general, the method seems to be very efficient, but it tends to create small clusters.

Earlier this method was discussed in terms of "objects" that should be clustered. In all other types of analysis, the question of interest to the researcher is usually expressed in terms of observations or variables.

It turns out that clustering, both by observations and by variables, can lead to quite interesting results.

For example, imagine that a medical researcher is collecting data on various characteristics (variables) of patients' conditions (observations) with heart disease.

The investigator may wish to cluster observations (of patients) to identify clusters of patients with similar symptoms.

At the same time, the researcher may want to cluster variables to identify clusters of variables that are associated with a similar physical state.e

After this discussion regarding whether to cluster observations or variables, one might ask, why not cluster in both directions?

The Cluster Analysis module contains an efficient two-way join procedure to do just that.

However, two-way pooling is used (relatively rarely) in circumstances where both observations and variables are expected to contribute simultaneously to the discovery of meaningful clusters.

So, returning to the previous example, we can assume that a medical researcher needs to identify clusters of patients that are similar in relation to certain clusters of physical condition characteristics.

The difficulty in interpreting the results obtained arises from the fact that the similarities between different clusters may come from (or be the cause of) some difference in the subsets of variables.

Therefore, the resulting clusters are inherently heterogeneous. Perhaps it seems a bit hazy at first; indeed, compared to other cluster analysis methods described, two-way pooling is probably the least commonly used method.

However, some researchers believe that it offers a powerful tool for exploratory data analysis (for more information, see Hartigan's description of this method (Hartigan, 1975)).

K means method

This clustering method differs significantly from agglomerative methods such as Union (tree clustering) and Two-Way Union. Suppose you already have hypotheses about the number of clusters (by observation or by variable).

You can tell the system to form exactly three clusters so that they are as different as possible.

This is exactly the type of problem that the K-Means algorithm solves. In general, the K-means method builds exactly K distinct clusters spaced as far apart as possible.

In the physical condition example, a medical researcher may have a "hunch" from their clinical experience that their patients generally fall into three different categories.

Attention!

If so, then the means of the various measures of physical parameters for each cluster would provide a quantitative way of representing the investigator's hypotheses (eg, patients in cluster 1 have a high parameter of 1, a lower parameter of 2, etc.).

From a computational point of view, you can think of this method as an analysis of variance "in reverse". The program starts with K randomly selected clusters, and then changes the belonging of objects to them in order to:

  1. minimize variability within clusters,
  2. maximize variability between clusters.

This method is similar to reverse analysis of variance (ANOVA) in that the significance test in ANOVA compares between-group versus within-group variability in testing the hypothesis that group means differ from each other.

In K-means clustering, the program moves objects (i.e., observations) from one group (cluster) to another in order to obtain the most significant result when performing analysis of variance (ANOVA).

Typically, once the results of a K-means cluster analysis are obtained, one can calculate the means for each cluster for each dimension to assess how the clusters differ from each other.

Ideally, you should get very different means for most, if not all, of the measurements used in the analysis.

Source: http://www.biometrica.tomsk.ru/textbook/modules/stcluan.html

Classification of objects according to their characteristics

Cluster analysis (cluster analysis) is a set of multidimensional statistical methods for classifying objects according to their characteristics, dividing a set of objects into homogeneous groups that are close in terms of defining criteria, identifying objects of a certain group.

A cluster is a group of objects identified as a result of cluster analysis based on a given measure of similarity or difference between objects.

The object is the specific subjects of study that need to be classified. The objects in the classification are, as a rule, observations. For example, consumers of products, countries or regions, products, etc.

Although it is possible to carry out cluster analysis by variables. Classification of objects in multidimensional cluster analysis occurs according to several criteria simultaneously.

These can be both quantitative and categorical variables, depending on the method of cluster analysis. So, the main goal of cluster analysis is to find groups of similar objects in the sample.

The set of multivariate statistical methods of cluster analysis can be divided into hierarchical methods (agglomerative and divisive) and non-hierarchical (k-means method, two-stage cluster analysis).

However, there is no generally accepted classification of methods, and sometimes cluster analysis methods also include methods for constructing decision trees, neural networks, discriminant analysis, and logistic regression.

The scope of cluster analysis, due to its versatility, is very wide. Cluster analysis is used in economics, marketing, archeology, medicine, psychology, chemistry, biology, public administration, philology, anthropology, sociology and other areas.

Here are some examples of applying cluster analysis:

  • medicine - classification of diseases, their symptoms, methods of treatment, classification of patient groups;
  • marketing - the tasks of optimizing the company's product line, segmenting the market by groups of goods or consumers, identifying a potential consumer;
  • sociology - division of respondents into homogeneous groups;
  • psychiatry - correct diagnosis of symptom groups is crucial for successful therapy;
  • biology - classification of organisms by group;
  • economy - classification of subjects of the Russian Federation by investment attractiveness.

Source: http://www.statmethods.ru/konsalting/statistics-methody/121-klasternyj-analyz.html

General information about cluster analysis

Cluster analysis includes a set of different classification algorithms. A common question asked by researchers in many fields is how to organize observed data into visual structures.

For example, biologists aim to break down animals into different species in order to meaningfully describe the differences between them.

The task of cluster analysis is to divide the initial set of objects into groups of similar, close objects. These groups are called clusters.

In other words, cluster analysis is one of the ways to classify objects according to their characteristics. It is desirable that the classification results have a meaningful interpretation.

The results obtained by cluster analysis methods are used in various fields. In marketing, it is the segmentation of competitors and consumers.

In psychiatry, the correct diagnosis of symptoms such as paranoia, schizophrenia, etc. is crucial for successful therapy.

In management, the classification of suppliers is important, the identification of similar production situations in which marriage occurs. In sociology, the division of respondents into homogeneous groups. In portfolio investment, it is important to group securities according to their similarity in the trend of return in order to compile, based on the information obtained about the stock market, an optimal investment portfolio that allows maximizing return on investments for a given degree of risk.

In general, whenever it is necessary to classify a large amount of information of this kind and present it in a form suitable for further processing, cluster analysis turns out to be very useful and effective.

Cluster analysis allows considering a fairly large amount of information and greatly compressing large arrays of socio-economic information, making them compact and visual.

Attention!

Cluster analysis is of great importance in relation to sets of time series characterizing economic development (for example, general economic and commodity conditions).

Here it is possible to single out the periods when the values ​​of the corresponding indicators were quite close, as well as to determine the groups of time series, the dynamics of which are most similar.

In the problems of socio-economic forecasting, it is very promising to combine cluster analysis with other quantitative methods (for example, with regression analysis).

Advantages and disadvantages

Cluster analysis allows for an objective classification of any objects that are characterized by a number of features. There are a number of benefits to be derived from this:

  1. The resulting clusters can be interpreted, that is, to describe what kind of groups actually exist.
  2. Individual clusters can be culled. This is useful in cases where certain errors were made in the data set, as a result of which the values ​​of indicators for individual objects deviate sharply. When applying cluster analysis, such objects fall into a separate cluster.
  3. For further analysis, only those clusters that have the characteristics of interest can be selected.

Like any other method, cluster analysis has certain disadvantages and limitations. In particular, the composition and number of clusters depends on the selected partitioning criteria.

When reducing the initial data array to a more compact form, certain distortions may occur, and the individual features of individual objects may also be lost due to their replacement by the characteristics of the generalized values ​​of the cluster parameters.

Methods

Currently, more than a hundred different clustering algorithms are known. Their diversity is explained not only by different computational methods, but also by different concepts underlying clustering.

The Statistica package implements the following clustering methods.

  • Hierarchical algorithms - tree clustering. Hierarchical algorithms are based on the idea of ​​sequential clustering. At the initial step, each object is considered as a separate cluster. At the next step, some of the clusters closest to each other will be combined into a separate cluster.
  • K-means method. This method is the most commonly used. It belongs to the group of so-called reference methods of cluster analysis. The number of clusters K is set by the user.
  • Two way association. When using this method, clustering is carried out simultaneously both by variables (columns) and by observation results (rows).

The two-way join procedure is performed when it can be expected that simultaneous clustering on variables and observations will provide meaningful results.

The results of the procedure are descriptive statistics on variables and observations, as well as a two-dimensional color chart on which data values ​​are marked with color.

By the distribution of color, you can get an idea of ​​\u200b\u200bhomogeneous groups.

Normalization of variables

The division of the initial set of objects into clusters is associated with the calculation of distances between objects and the choice of objects, the distance between which is the smallest of all possible.

The most commonly used is the Euclidean (geometric) distance familiar to all of us. This metric corresponds to intuitive ideas about the proximity of objects in space (as if the distances between objects were measured with a tape measure).

But for a given metric, the distance between objects can be strongly affected by changes in scales (units of measurement). For example, if one of the features is measured in millimeters and then its value is converted to centimeters, the Euclidean distance between objects will change dramatically. This will lead to the fact that the results of cluster analysis may differ significantly from the previous ones.

If the variables are measured in different units of measurement, then their preliminary normalization is required, that is, the transformation of the initial data, which converts them into dimensionless quantities.

Normalization strongly distorts the geometry of the original space, which can change the results of clustering

In the Statistica package, any variable x is normalized according to the formula:

To do this, right-click on the variable name and select the sequence of commands from the menu that opens: Fill/ Standardize Block/ Standardize Columns. The values ​​of the normalized variable will become equal to zero, and the variances will become equal to one.

K-means method in Statistica

The K-means method splits a set of objects into a given number K of different clusters located at the greatest possible distance from each other.

Typically, once the results of a K-means cluster analysis are obtained, one can calculate the averages for each cluster for each dimension to assess how the clusters differ from each other.

Ideally, you should get very different means for most of the measurements used in the analysis.

The F-statistic values ​​obtained for each dimension are another indicator of how well the corresponding dimension discriminates between clusters.

As an example, consider the results of a survey of 17 employees of an enterprise on satisfaction with career quality indicators. The table contains the answers to the questionnaire questions on a ten-point scale (1 is the minimum score, 10 is the maximum).

The variable names correspond to the answers to the following questions:

  1. SLT - a combination of personal goals and the goals of the organization;
  2. OSO - a sense of fairness in wages;
  3. TBD - territorial proximity to the house;
  4. PEW - a sense of economic well-being;
  5. CR - career growth;
  6. ZhSR - the desire to change jobs;
  7. OSB is a sense of social well-being.

Using this data, it is necessary to divide the employees into groups and select the most effective control levers for each of them.

At the same time, the differences between groups should be obvious, and within the group, the respondents should be as similar as possible.

To date, most sociological surveys give only a percentage of votes: the main number of positive answers is considered, or the percentage of those who are dissatisfied, but this issue is not systematically considered.

Most often, the survey does not show trends in the situation. In some cases, it is necessary to count not the number of people who are “for” or “against”, but the distance, or the measure of similarity, that is, to determine groups of people who think about the same.

Cluster analysis procedures can be used to identify, on the basis of survey data, some really existing relationships of features and generate their typology on this basis.

Attention!

The presence of any a priori hypotheses of a sociologist when working with cluster analysis procedures is not a necessary condition.

In Statistica, cluster analysis is performed as follows.

When choosing the number of clusters, be guided by the following: the number of clusters, if possible, should not be too large.

The distance at which the objects of a given cluster were joined should, if possible, be much less than the distance at which something else joins this cluster.

When choosing the number of clusters, most often there are several correct solutions at the same time.

We are interested, for example, in how the answers to the questions of the questionnaire correlate with ordinary employees and the management of the enterprise. Therefore, we choose K=2. For further segmentation, you can increase the number of clusters.

  1. select observations with the maximum distance between cluster centers;
  2. sort distances and select observations at regular intervals (default setting);
  3. take the first observation centers and attach the rest of the objects to them.

Option 1 is suitable for our purposes.

Many clustering algorithms often “impose” a structure that is not inherent in the data and disorient the researcher. Therefore, it is extremely necessary to apply several cluster analysis algorithms and draw conclusions based on a general assessment of the results of the algorithms.

The results of the analysis can be viewed in the dialog box that appears:

If you select the Graph of means tab, a graph of the coordinates of the cluster centers will be plotted:


Each broken line on this graph corresponds to one of the clusters. Each division of the horizontal axis of the graph corresponds to one of the variables included in the analysis.

The vertical axis corresponds to the average values ​​of the variables for the objects included in each of the clusters.

It can be noted that there are significant differences in the attitude of the two groups of people to a service career on almost all issues. Only in one issue is there complete unanimity - in the sense of social well-being (OSB), or rather, the lack of it (2.5 points out of 10).

It can be assumed that cluster 1 represents workers, and cluster 2 represents management. Managers are more satisfied with career growth (CR), a combination of personal goals and organizational goals (SOLs).

They have a higher sense of economic well-being (SEW) and a sense of pay equity (SWA).

They are less concerned about proximity to home than workers, probably because of less transportation problems. Also, managers have less desire to change jobs (JSR).

Despite the fact that workers are divided into two categories, they give relatively the same answers to most questions. In other words, if something does not suit the general group of employees, the same does not suit senior management, and vice versa.

The harmonization of the graphs allows us to conclude that the well-being of one group is reflected in the well-being of another.

Cluster 1 is not satisfied with the territorial proximity to the house. This group is the main part of the workers who mainly come to the enterprise from different parts of the city.

Therefore, it is possible to offer the top management to direct part of the profit to the construction of housing for the employees of the enterprise.

Significant differences are seen in the attitude of the two groups of people to a service career. Those employees who are satisfied with career growth, who have a high coincidence of personal goals and the goals of the organization, do not have a desire to change jobs and feel satisfaction with the results of their work.

Conversely, employees who want to change jobs and are dissatisfied with the results of their work are not satisfied with the above indicators. Senior management should pay particular attention to the current situation.

The results of the analysis of variance for each feature are displayed by pressing the Analysis of variance button.

The sums of squares of deviations of objects from cluster centers (SS Within) and the sums of squares of deviations between cluster centers (SS Between), F-statistics values ​​and p significance levels are displayed.

Attention!

For our example, the significance levels for the two variables are quite large, which is explained by the small number of observations. In the full version of the study, which can be found in the work, the hypotheses about the equality of the means for the cluster centers are rejected at significance levels less than 0.01.

The Save classifications and distances button displays the numbers of objects included in each cluster and the distances of objects to the center of each cluster.

The table shows the case numbers (CASE_NO) that make up the clusters with CLUSTER numbers and the distances from the center of each cluster (DISTANCE).

Information about objects belonging to clusters can be written to a file and used in further analysis. In this example, a comparison of the results obtained with the questionnaires showed that cluster 1 consists mainly of ordinary workers, and cluster 2 - of managers.

Thus, it can be seen that when processing the results of the survey, cluster analysis turned out to be a powerful method that allows drawing conclusions that cannot be reached by constructing a histogram of averages or by calculating the percentage of those satisfied with various indicators of the quality of working life.

Tree clustering is an example of a hierarchical algorithm, the principle of which is to sequentially cluster first the closest, and then more and more distant elements from each other into a cluster.

Most of these algorithms start from a matrix of similarity (distances), and each individual element is considered at first as a separate cluster.

After loading the cluster analysis module and selecting Joining (tree clustering), you can change the following parameters in the clustering parameters entry window:

  • Initial data (Input). They can be in the form of a matrix of the studied data (Raw data) and in the form of a matrix of distances (Distance matrix).
  • Clustering (Cluster) observations (Cases (raw)) or variables (Variable (columns)), describing the state of the object.
  • Distance measures. Here you can select the following measures: Euclidean distances, Squared Euclidean distances, City-block (Manhattan) distance, Chebychev distance metric, Power ...), the percentage of disagreement (Percent disagreement).
  • Clustering method (Amalgamation (linkage) rule). The following options are possible here: Single Linkage (Single Linkage), Complete Linkage (Complete Linkage), Unweighted pair-group average, Weighted pair-group average ), Unweighted pair-group centroid, Weighted pair-group centroid (median), Ward's method.

As a result of clustering, a horizontal or vertical dendrogram is built - a graph on which the distances between objects and clusters are determined when they are sequentially combined.

The tree structure of the graph allows you to define clusters depending on the selected threshold - a given distance between clusters.

In addition, the matrix of distances between the original objects (Distance matrix) is displayed; mean and standard deviations for each source object (Distiptive statistics).

For the considered example, we will carry out a cluster analysis of variables with default settings. The resulting dendrogram is shown in the figure.


The vertical axis of the dendrogram plots the distances between objects and between objects and clusters. So, the distance between the variables SEB and OSD is equal to five. These variables at the first step are combined into one cluster.

The horizontal segments of the dendrogram are drawn at levels corresponding to the threshold distances selected for a given clustering step.

It can be seen from the graph that the question “desire to change jobs” (JSR) forms a separate cluster. In general, the desire to dump anywhere visits everyone equally. Further, a separate cluster is the question of territorial proximity to home (LHB).

In terms of importance, it is in second place, which confirms the conclusion about the need for housing construction, made according to the results of the study using the K-means method.

Feelings of economic well-being (PEW) and pay equity (PWA) are combined - this is a block of economic issues. Career progression (CR) and the combination of personal goals and organization goals (COL) are also combined.

Other clustering methods, as well as the choice of other types of distances, do not lead to a significant change in the dendrogram.

Results:

  1. Cluster analysis is a powerful tool for exploratory data analysis and statistical research in any subject area.
  2. The Statistica program implements both hierarchical and structural methods of cluster analysis. The advantages of this statistical package are due to their graphical capabilities. Two-dimensional and three-dimensional graphical representations of the obtained clusters in the space of the studied variables are provided, as well as the results of the hierarchical procedure for grouping objects.
  3. It is necessary to apply several cluster analysis algorithms and draw conclusions based on a general assessment of the results of the algorithms.
  4. Cluster analysis can be considered successful if it is performed in different ways, the results are compared and common patterns are found, and stable clusters are found regardless of the clustering method.
  5. Cluster analysis allows you to identify problem situations and outline ways to solve them. Therefore, this method of non-parametric statistics can be considered as an integral part of system analysis.

10.1.1 Basic concepts.

Let the collection objects, each of which is characterized measured traits. It is required to break this collection into groups that are homogeneous in a certain sense. At the same time, there is practically no a priori information about the nature of the distribution -dimensional vector
inside classes.
The resulting groups are usually called clusters (taxa, images), methods for finding them - cluster analysis(numerical taxonomy or self-learning pattern recognition).

The solution of the problem consists in determining the natural stratification of the results of observations into clearly defined clusters lying at a certain distance from each other. (It may turn out that the set of observations does not show a natural stratification into clusters, i.e. forms one cluster).

The usual form of representation of initial data in problems of cluster analysis is the matrix

,

each line of which represents measurement results considered features of one of the objects.

Clustering is designed to divide a set of objects into homogeneous groups ( clusters or classes). If the sample data is represented as points in the feature space, then the problem clustering reduces to the definition of "point condensings".

The concept of a cluster (cluster) is translated as "cluster", "bunch". Synonyms for the term "clustering" are "automatic classification", "unsupervised learning" and "taxonomy".

The purpose of clustering is to search for existing structures. Clustering is a descriptive procedure, it does not draw any statistical conclusions, but it provides an opportunity to conduct exploratory analysis and study the "data structure". Classes are not predetermined, the search for the most similar, homogeneous groups is carried out. A cluster can be described as a group of objects that have common properties.

There are two characteristics of a cluster:

    internal homogeneity;

    external isolation.

Clusters can be non-overlapping, or exclusive (non-overlapping, exclusive), and intersecting (overlapping). A schematic representation of non-intersecting and intersecting clusters is given in fig. 10.1.

Rice. 10.1 Disjoint and overlapping clusters

The term "cluster analysis", first introduced by Tryon in 1939, combines more than 100 different algorithms.

Unlike classification problems, cluster analysis does not require a priori assumptions about the data set, does not impose restrictions on the representation of the objects under study, and allows you to analyze indicators of various types of data (interval data, frequencies, binary data). It must be remembered that the variables must be measured on comparable scales.

10.1.2 Cluster characteristics

The cluster has the following mathematical characteristics: center, radius, standard deviation, cluster size.

Each object of the population in cluster analysis is considered as a point in a given feature space. The value of each of the attributes of a given unit serves as its coordinate in this space.

The cluster center is the locus of points in the space of variables.

Cluster radius - the maximum distance of points from the center of the cluster.

If it is impossible to unambiguously assign an object to one of the two clusters using mathematical procedures, then such objects are called disputable, and an overlap of clusters is detected. A disputed object is an object that can be assigned to several clusters based on similarity.

The size of a cluster can be determined either by the radius of the cluster or by the standard deviation of the features for that cluster. An object belongs to a cluster if the distance from the object to the center of the cluster is less than the radius of the cluster. If this condition is met for two or more clusters, the object is disputable. The ambiguity of this problem can be eliminated by an expert or an analyst.

Each of the groups includes many approaches and algorithms.

Using various cluster analysis methods, an analyst can obtain different solutions for the same data. This is considered normal. Consider hierarchical and non-hierarchical methods in detail.

The essence of hierarchical clustering is the sequential merging of smaller clusters into larger clusters or the division of large clusters into smaller ones.

Hierarchical agglomerative methods (Agglomerative Nesting, AGNES) This group of methods is characterized by a consistent union of the original elements and a corresponding decrease in the number of clusters.

At the beginning of the algorithm, all objects are separate clusters. At the first step, the most similar objects are combined into a cluster. In subsequent steps, the merging continues until all objects form one cluster. Hierarchical divisive (divisible) methods (DIvisive ANAlysis, DIANA) These methods are the logical opposite of agglomerative methods. At the beginning of the algorithm, all objects belong to one cluster, which is divided into smaller clusters at subsequent steps, as a result, a sequence of splitting groups is formed.

Non-hierarchical methods reveal higher resistance to noise and outliers, incorrect choice of metric, inclusion of insignificant variables in the set involved in clustering. The price to be paid for these advantages of the method is the word "a priori". The analyst must predetermine the number of clusters, the number of iterations, or the stopping rule, as well as some other clustering parameters. This is especially difficult for beginners.

If there are no assumptions about the number of clusters, it is recommended to use hierarchical algorithms. However, if the sample size does not allow this, a possible way is to conduct a series of experiments with a different number of clusters, for example, start splitting the data set from two groups and, gradually increasing their number, compare the results. Due to this "variation" of the results, a sufficiently large clustering flexibility is achieved.

Hierarchical methods, unlike non-hierarchical ones, refuse to determine the number of clusters, but build a complete tree of nested clusters.

Complexities of hierarchical clustering methods: limitation of the volume of the data set; choice of measure of proximity; inflexibility of the obtained classifications.

The advantage of this group of methods in comparison with non-hierarchical methods is their clarity and the ability to get a detailed idea of ​​the data structure.

When using hierarchical methods, it is possible to identify outliers in a data set quite easily and, as a result, improve data quality. This procedure underlies the two-step clustering algorithm. Such a data set can later be used for non-hierarchical clustering.

There is another aspect that has already been mentioned in this lecture. This is a matter of clustering the entire population of data or its sample. This aspect is essential for both considered groups of methods, but it is more critical for hierarchical methods. Hierarchical methods cannot work with large data sets, and the use of some selection, i.e. part of the data could allow these methods to be applied.

Clustering results may not have sufficient statistical justification. On the other hand, when solving clustering problems, a non-statistical interpretation of the results obtained is acceptable, as well as a fairly large variety of options for the concept of a cluster. Such a non-statistical interpretation enables the analyst to obtain satisfactory clustering results, which is often difficult when using other methods.

1) The method of complete connections.

The essence of this method is that two objects belonging to the same group (cluster) have a similarity coefficient that is less than some threshold value S. In terms of the Euclidean distance d, this means that the distance between two points (objects) of the cluster should not exceed some threshold value h. Thus, h determines the maximum allowable diameter of a subset forming a cluster.

2) Method of maximum local distance.

Each object is considered as a one-point cluster. Objects are grouped according to the following rule: two clusters are combined if the maximum distance between the points of one cluster and the points of another is minimal. The procedure consists of n - 1 steps and results in partitions that match all possible partitions in the previous method for any threshold values.

3) Word method.

In this method, the intragroup sum of squared deviations is used as an objective function, which is nothing more than the sum of the squared distances between each point (object) and the average for the cluster containing this object. At each step, two clusters are combined that lead to the minimum increase in the objective function, i.e. intragroup sum of squares. This method is aimed at combining closely spaced clusters.

4) Centroid method.

The distance between two clusters is defined as the Euclidean distance between the centers (averages) of these clusters:

d2 ij = (`X -`Y)Т(`X -`Y) Clustering proceeds step by step, at each of n-1 steps, two clusters G and p are united, having the minimum value d2ij If n1 is much greater than n2, then the centers of union of two clusters are close to each other and the characteristics of the second cluster are practically ignored when clusters are combined. Sometimes this method is sometimes also called the method of weighted groups.

We know that the Earth is one of the 8 planets that revolve around the Sun. The sun is just a star among about 200 billion stars in the Milky Way galaxy. It is very difficult to understand this number. Knowing this, one can make an assumption about the number of stars in the universe - approximately 4X10^22. We can see about a million stars in the sky, although this is only a small fraction of the actual number of stars. So we have two questions:

  1. What is a galaxy?
  2. And what is the connection between galaxies and the topic of the article (cluster analysis)


A galaxy is a collection of stars, gas, dust, planets, and interstellar clouds. Usually galaxies resemble a spiral or oedeptic figure. In space, galaxies are separated from each other. Huge black holes are most often the centers of most galaxies.

As we will discuss in the next section, there are many similarities between galaxies and cluster analysis. Galaxies exist in three-dimensional space, cluster analysis is a multidimensional analysis carried out in n-dimensional space.

The note: A black hole is the center of a galaxy. We will use a similar idea for centroids for cluster analysis.

cluster analysis

Let's say you're the head of marketing and customer relations at a telecommunications company. You understand that all customers are different and that you need different strategies to reach different customers. You will appreciate the power of such a tool as customer segmentation to optimize costs. To brush up on your knowledge of cluster analysis, consider the following example, which illustrates 8 customers and their average conversation duration (local and international). Below is the data:

For better perception, let's draw a graph where the x-axis will be the average duration of international calls, and the y-axis - the average duration of local calls. Below is the chart:

The note: This is similar to analyzing the position of the stars in the night sky (here the stars are replaced by consumers). In addition, instead of a 3D space, we have a 2D one, defined by the duration of local and international calls, as the x and y axes.
Now, speaking in terms of galaxies, the problem is formulated as follows - to find the position of black holes; in cluster analysis they are called centroids. To detect centroids, we will start by taking arbitrary points as the position of the centroids.

Euclidean Distance for Finding Centroids for Clusters

In our case, we will randomly place two centroids (C1 and C2) at the points with coordinates (1, 1) and (3, 4). Why did we choose these two centroids? Visual display of points on the graph shows us that there are two clusters that we will analyze. However, we will see later that the answer to this question will not be so simple for a large dataset.
Next, we will measure the distance between the centroids (C1 and C2) and all points on the graph using Euclid's formula to find the distance between two points.

Note: Distance can also be calculated using other formulas, for example,

  1. the square of the Euclidean distance - to give weight to objects that are more distant from each other
  2. Manhattan distance - to reduce the impact of emissions
  3. power distance - to increase / decrease the influence on specific coordinates
  4. percent disagreement - for categorical data
  5. and etc.
Column 3 and 4 (Distance from C1 and C2) is the distance calculated using this formula. For example, for the first user

The belonging to centroids (last column) is calculated according to the principle of proximity to centroids (C1 and C2). The first consumer is closer to centroid #1 (1.41 compared to 2.24) hence belongs to the cluster with centroid C1.

Below is a graph illustrating the C1 and C2 centroids (depicted as a blue and orange diamond). Consumers are shown in the color of the corresponding centroid to which they were assigned.

Since we have arbitrarily chosen centroids, the second step is to make this choice iterative. The new position of the centroids is chosen as the average for the points of the corresponding cluster. So, for example, for the first centroid (these are consumers 1, 2 and 3). Therefore, the new x-coordinate for the centroid C1 is the average of the x-coordinates of these consumers (2+1+1)/3 = 1.33. We will get new coordinates for C1 (1.33, 2.33) and C2 (4.4, 4.2). The new plot is below:

Finally, we will place the centroids at the center of the respective cluster. Chart below:

The positions of our black holes (cluster centers) in our example are C1 (1.75, 2.25) and C2(4.75, 4.75). The two clusters above are like two galaxies separated in space from each other.

So, let's look at examples further. Let us face the task of segmenting consumers according to two parameters: age and income. Suppose we have 2 consumers aged 37 and 44 with incomes of $90,000 and $62,000 respectively. If we want to measure the Euclidean distance between the points (37, 90000) and (44, 62000), we will see that in this case the income variable “dominates” the age variable and its change strongly affects the distance. We need some kind of strategy to solve this problem, otherwise our analysis will give an incorrect result. The solution to this problem is to bring our values ​​to comparable scales. Normalization is the solution to our problem.

Data normalization

There are many approaches to normalize data. For example, minimum-maximum normalization. For this normalization, the following formula is used

in this case, X* is a normalized value, min and max are the minimum and maximum coordinates over the entire set X
(Note, this formula places all coordinates on the segment )
Consider our example, let the maximum income be $130,000 and the minimum be $45,000. The normalized value of income for consumer A is

We will do this exercise for all points for each variable (coordinate). The income for the second consumer (62000) will become 0.2 after the normalization procedure. Additionally, let the minimum and maximum ages be 23 and 58 respectively. After normalization, the ages of our two consumers will be 0.4 and 0.6.

It's easy to see that now all of our data is between 0 and 1. Therefore, we now have normalized datasets on comparable scales.

Remember, before the cluster analysis procedure, it is necessary to perform normalization.

Clustering tasks in Data Mining

Introduction to Cluster Analysis

From the entire vast field of application of cluster analysis, for example, the problem of socio-economic forecasting.

When analyzing and forecasting socio-economic phenomena, the researcher often encounters the multidimensionality of their description. This happens when solving the problem of market segmentation, building a typology of countries according to a sufficiently large number of indicators, predicting the market situation for individual goods, studying and predicting economic depression, and many other problems.

Multivariate analysis methods are the most effective quantitative tool for studying socio-economic processes described by a large number of characteristics. These include cluster analysis, taxonomy, pattern recognition, and factor analysis.

cluster analysis most clearly reflects the features of multivariate analysis in the classification, factor analysis - in the study of communication.

Sometimes the cluster analysis approach is referred to in the literature as numerical taxonomy, numerical classification, self-learning recognition, etc.

Cluster analysis found its first application in sociology. The name cluster analysis comes from the English word cluster - bunch, accumulation. For the first time in 1939, the subject of cluster analysis was defined and its description was made by the researcher Trion. The main purpose of cluster analysis is to divide the set of objects and features under study into groups or clusters that are homogeneous in the appropriate sense. This means that the problem of classifying data and identifying the corresponding structure in it is being solved. Cluster analysis methods can be applied in a variety of cases, even in cases where we are talking about a simple grouping, in which everything comes down to the formation of groups according to quantitative similarity.

The great advantage of cluster analysis in that it allows splitting objects not by one parameter, but by a whole set of features. In addition, cluster analysis, unlike most mathematical and statistical methods, does not impose any restrictions on the type of objects under consideration, and allows us to consider a set of initial data of an almost arbitrary nature. This is of great importance, for example, for conjuncture forecasting, when indicators have a diverse form, which makes it difficult to use traditional econometric approaches.

Cluster analysis makes it possible to consider a fairly large amount of information and drastically reduce, compress large arrays of socio-economic information, make them compact and visual.

Cluster analysis is of great importance in relation to sets of time series characterizing economic development (for example, general economic and commodity conditions). Here it is possible to single out the periods when the values ​​of the corresponding indicators were quite close, as well as to determine the groups of time series, the dynamics of which are most similar.

Cluster analysis can be used cyclically. In this case, the study is carried out until the desired results are achieved. At the same time, each cycle here can provide information that can greatly change the direction and approaches of further application of cluster analysis. This process can be represented as a feedback system.

In the tasks of socio-economic forecasting, it is very promising to combine cluster analysis with other quantitative methods (for example, with regression analysis).

Like any other method , cluster analysis has certain disadvantages and limitations: In particular, make up the number of clusters depends on the selected partitioning criteria. When reducing the initial data array to a more compact form, certain distortions may occur, and the individual features of individual objects may also be lost due to their replacement by the characteristics of the generalized values ​​of the cluster parameters. When classifying objects, very often the possibility of the absence of any cluster values ​​in the considered set is ignored.

In cluster analysis, it is considered that:

a) the selected characteristics allow, in principle, the desired clustering;

b) units of measurement (scale) are chosen correctly.

The choice of scale plays a big role. Typically, data is normalized by subtracting the mean and dividing by the standard deviation so that the variance is equal to one.

1. The task of clustering

The task of clustering is to, based on the data contained in the set X, split a lot of objects G on the m (m– whole) clusters (subsets) Q1,Q 2 , …,Qm, so that each object Gj belong to one and only one partition subset and that objects belonging to the same cluster are similar, while objects belonging to different clusters are heterogeneous.

For example, let G includes n countries, any of which is characterized by GNP per capita ( F1), number M cars per 1,000 people F2), per capita electricity consumption ( F3), per capita steel consumption ( F4) etc. Then X 1(measurement vector) is a set of specified characteristics for the first country, X 2- for the second, X 3 for the third, and so on. The challenge is to break down countries by level of development.

The solution to the problem of cluster analysis are partitions that satisfy a certain optimality criterion. This criterion can be some functional that expresses the levels of desirability of various partitions and groupings, which is called the objective function. For example, the intragroup sum of squared deviations can be taken as the objective function:

where xj- represents measurements j-th object.

To solve the problem of cluster analysis, it is necessary to define the concept of similarity and heterogeneity.

It is clear that the objects i -th and j-th would fall into one cluster when the distance (remoteness) between points X i and X j would be small enough and would fall into different clusters when this distance would be large enough. Thus, hitting one or different clusters of objects is determined by the concept of the distance between X i and X j from yer, where yer - R-dimensional Euclidean space. Non-negative function d(X i, Х j) is called a distance function (metric) if:

a) d(Xi , Х j)³ 0 , for all X i and X j from yer

b) d(Xi , Х j) = 0, if and only if X i= Х j

in) d(Xi , X j) = d(X j , X i)

G) d(Xi , Х j)£ d(Xi , X k) + d(X k , X j), where X j ; Xi and Х k- any three vectors from yer.

Meaning d(Xi , Х j) for Xi and X j is called the distance between Xi and X j and is equivalent to the distance between Gi and Gj according to the selected characteristics (F 1, F 2, F 3, ..., F p).

The most commonly used distance functions are:

1. Euclidean distance d 2 (Xi , Х j) =

2. l 1- norm d 1 (Xi , Х j) =

3. Supremum - the norm d ¥ (Xi , Х j) = sup

k = 1, 2, ..., p

4. lp- norm d p ​​(Xi , Х j) =

The Euclidean metric is the most popular. The l 1 metric is the easiest to calculate. The supremum norm is easy to calculate and includes an ordering procedure, a lp- the norm covers the functions of distances 1, 2, 3,.

Let n measurements X 1, X 2,..., Xn are presented in the form of a data matrix with the size p´ n:

Then the distance between the pairs of vectors d(X i, Х j) can be represented as a symmetrical distance matrix:

The concept opposite to distance is the concept of similarity between objects. G i . and Gj. Non-negative real function S(X i; X j) = S i j is called a similarity measure if:

1) 0 £ S(X i , X j)< 1 for X i ¹ X j

2) S( Xi, Xi) = 1

3) S( Xi, Xj) = S(Xj, X i )

Pairs of similarity measure values ​​can be combined into a similarity matrix:

the value Sij called the coefficient of similarity.

2. Clustering methods

Today there are many methods of cluster analysis. Let us dwell on some of them (the methods given below are usually called the methods of minimum variance).

Let X- observation matrix: X \u003d (X 1, X 2, ..., X u) and the square of the Euclidean distance between X i and X j is determined by the formula:

1) Full connection method.

The essence of this method is that two objects belonging to the same group (cluster) have a similarity coefficient that is less than a certain threshold value S. In terms of Euclidean distance d this means that the distance between two points (objects) of the cluster should not exceed some threshold valueh. In this way, hdefines the maximum allowable diameter of a subset forming a cluster.

2) Method of maximum local distance.

Each object is considered as a one-point cluster. Objects are grouped according to the following rule: two clusters are combined if the maximum distance between the points of one cluster and the points of another is minimal. The procedure consists of n - 1 steps and results in partitions that match all possible partitions in the previous method for any thresholds.

3) Word method.

In this method, the intragroup sum of squared deviations is used as an objective function, which is nothing more than the sum of the squared distances between each point (object) and the average for the cluster containing this object. At each step, two clusters are combined that lead to the minimum increase in the objective function, i.e. intragroup sum of squares. This method is aimed at combining closely spaced clusters.

4) centroid method.

The distance between two clusters is defined as the Euclidean distance between the centers (averages) of these clusters:

d2ij =(` X-` Y) T (` X-` Y) Clustering proceeds in stages on each of n–1 steps unite two clusters G and p having the minimum value d2ij If a n 1 much more n 2, then the merging centers of two clusters are close to each other, and the characteristics of the second cluster are practically ignored when clusters are merged. Sometimes this method is sometimes also called the method of weighted groups.

3. Sequential clustering algorithm

Consider Ι = (Ι 1 , Ι 2 , … Ιn) as many clusters (Ι 1 ), (Ι 2 ),…(Ιn). Let's choose two of them, for example, Ι i and Ιj, which are in some sense closer to each other and combine them into one cluster. The new set of clusters, already consisting of n -1 clusters, will be:

(Ι 1 ), (Ι 2 )…, i, Ι j ), …, (Ιn).

Repeating the process, we obtain successive sets of clusters consisting of (n-2), (n-3), (n-4) etc. clusters. At the end of the procedure, you can get a cluster consisting of n objects and coinciding with the original set Ι = (Ι 1 , Ι 2 , … Ιn).

As a measure of distance, we take the square of the Euclidean metric d i j2. and calculate the matrix D = (di j 2 ), where di j 2 is the square of the distance between

Ι i and Ιj:

….

Ι n

d 12 2

d 13 2

….

d 1n 2

d 23 2

….

d 2n 2

….

d 3n 2

….

….

….

Ι n

Let the distance between Ι i and Ι j will be minimal:

d i j 2 = min (d i j 2 , i¹ j). We form with Ι i and Ι j new cluster

i , Ι j ). Let's build a new ((n-1), (n-1)) distance matrix

( Ι i , Ι j )

….

Ι n

( Ι i ; Ι j )

d i j 2 1

d i j 2 2

….

d i j 2 n

d 12 2

d 1 3

….

d 1 2 n

….

d2n

….

d3n

(n-2) the rows for the last matrix are taken from the previous one, and the first row is recomputed. Computations can be kept to a minimum if one can express d i j 2 k ,k = 1, 2,…,n (k¹ i¹ j) through the elements of the original matrix.

Initially, the distance was determined only between single-element clusters, but it is also necessary to determine the distances between clusters containing more than one element. This can be done in various ways, and depending on the chosen method, we get cluster analysis algorithms with different properties. One can, for example, put the distance between the cluster i + j and some other cluster k, equal to the arithmetic mean of the distances between clusters i and k and clusters j and k:

d i+j,k = ½ (d i k + d j k).

But one can also define d i+j,k as the minimum of these two distances:

d i+j,k = min(d i k + d j k).

Thus, the first step of the agglomerative hierarchical algorithm operation is described. The next steps are the same.

A fairly wide class of algorithms can be obtained if the following general formula is used to recalculate distances:

d i+j,k = A(w) min(d ik d jk) + B(w) max(d ik d jk), where

A(w) = ifdik£ djk

A(w) = ifdik> djk

B(w) = ifd i k £ djk

B(w ) =, ifdik> djk

where n i and n j- number of elements in clusters i and j, a w is a free parameter, the choice of which determines a particular algorithm. For example, when w = 1 we get the so-called "average connection" algorithm, for which the formula for recalculating distances takes the form:

d i+j,k =

In this case, the distance between two clusters at each step of the algorithm turns out to be equal to the arithmetic mean of the distances between all pairs of elements such that one element of the pair belongs to one cluster, the other to another.

The visual meaning of the parameter w becomes clear if we put w® ¥ . The distance conversion formula takes the form:

d i+j,k =min (d i,kdjk)

This will be the so-called “nearest neighbor” algorithm, which makes it possible to select clusters of an arbitrarily complex shape, provided that the various parts of such clusters are connected by chains of elements close to each other. In this case, the distance between two clusters at each step of the algorithm turns out to be equal to the distance between the two closest elements belonging to these two clusters.

Quite often it is assumed that the initial distances (differences) between the grouped elements are given. In some cases, this is true. However, only the objects and their characteristics are specified, and the distance matrix is ​​built based on these data. Depending on whether the distances between objects or between characteristics of objects are calculated, different methods are used.

In the case of cluster analysis of objects, the most common measure of difference is either the square of the Euclidean distance

(where x ih , x jh- values h-th sign for i th and j-th objects, and m is the number of characteristics), or the Euclidean distance itself. If features are assigned different weights, then these weights can be taken into account when calculating the distance

Sometimes, as a measure of difference, the distance is used, calculated by the formula:

which are called: "Hamming", "Manhattan" or "city-block" distance.

A natural measure of the similarity of characteristics of objects in many problems is the correlation coefficient between them

where m i ,m j ,d i ,d j- respectively, the average and standard deviations for the characteristics i and j. A measure of the difference between the characteristics can be the value 1-r. In some problems, the sign of the correlation coefficient is insignificant and depends only on the choice of the unit of measure. In this case, as a measure of the difference between the characteristics, ô 1-r i j ô

4. Number of clusters

A very important issue is the problem of choosing the required number of clusters. Sometimes m number of clusters can be chosen a priori. However, in the general case, this number is determined in the process of splitting the set into clusters.

Studies were carried out by Fortier and Solomon, and it was found that the number of clusters must be taken to achieve the probability a finding the best partition. Thus, the optimal number of partitions is a function of the given fraction b the best or, in some sense, admissible partitions in the set of all possible ones. The total scattering will be the greater, the higher the fraction b allowable splits. Fortier and Solomon developed a table from which one can find the number of partitions needed. S(a , b ) depending on the a and b (where a is the probability that the best partition is found, b is the share of the best partitions in the total number of partitions) Moreover, as a measure of heterogeneity, not the scattering measure is used, but the membership measure introduced by Holzenger and Harman. Table of values S(a , b ) below.

Table of valuesS(a , b )

b \ a

0.20

0.10

0.05

0.01

0.001

0.0001

0.20

8

11

14

21

31

42

0.10

16

22

29

44

66

88

0.05

32

45

59

90

135

180

0.01

161

230

299

459

689

918

0.001

1626

2326

3026

4652

6977

9303

0.0001

17475

25000

32526

55000

75000

100000

Quite often, the criterion for combining (the number of clusters) is the change in the corresponding function. For example, sums of squared deviations:

The grouping process must correspond here to a sequential minimum increase in the value of the criterion E. The presence of a sharp jump in the value E can be interpreted as a characteristic of the number of clusters that objectively exist in the population under study.

So, the second way to determine the best number of clusters is to identify the jumps determined by the phase transition from a strongly coupled to a weakly coupled state of objects.

5. Dendograms

The best known method of representing a distance or similarity matrix is ​​based on the idea of ​​a dendogram or tree diagram. Dendogram can be defined as a graphical representation of the results of a sequential clustering process, which is carried out in terms of a distance matrix. With the help of a dendogram, it is possible to graphically or geometrically depict the clustering procedure, provided that this procedure operates only with elements of the distance or similarity matrix.

There are many ways to construct dendrograms. In the dendrogram, the objects are located vertically on the left, the clustering results are on the right. Distance or similarity values ​​corresponding to the structure of new clusters are displayed along a horizontal straight line over dendrograms.

Fig1

Figure 1 shows one example of a dendrogram. Figure 1 corresponds to the case of six objects ( n=6) and kcharacteristics (signs). Objects BUT and FROM are the closest and therefore are combined into one cluster at the proximity level equal to 0.9. ObjectsDand E combined at the level of 0.8. Now we have 4 clusters:

(A, C), (F), ( D, E), ( B) .

Further clusters are formed (A, C, F) and ( E, D, B) , corresponding to the proximity level equal to 0.7 and 0.6. Finally, all objects are grouped into one cluster at a level of 0.5.

The type of dendogram depends on the choice of similarity measure or distance between the object and the cluster and the clustering method. The most important point is the choice of a measure of similarity or a measure of distance between an object and a cluster.

The number of cluster analysis algorithms is too large. All of them can be divided into hierarchical and non-hierarchical.

Hierarchical algorithms are associated with the construction of dendograms and are divided into:

a) agglomerative, characterized by a consistent combination of initial elements and a corresponding decrease in the number of clusters;

b) divisible (divisible), in which the number of clusters increases, starting from one, as a result of which a sequence of splitting groups is formed.

Cluster analysis algorithms today have a good software implementation that allows solving problems of the highest dimension.

6. Data

Cluster analysis can be applied to interval data, frequencies, binary data. It is important that the variables change on comparable scales.

The heterogeneity of units of measurement and the ensuing impossibility of a reasonable expression of the values ​​of various indicators on the same scale leads to the fact that the distance between points, reflecting the position of objects in the space of their properties, turns out to depend on an arbitrarily chosen scale. To eliminate the heterogeneity of the measurement of the initial data, all their values ​​are preliminarily normalized, i.e. are expressed through the ratio of these values ​​to some value that reflects certain properties of this indicator. The normalization of initial data for cluster analysis is sometimes carried out by dividing the initial values ​​by the standard deviation of the corresponding indicators. Another way is to calculate the so-called standardized contribution. It is also called Z-contribution.

Z -contribution shows how many standard deviations a given observation separates from the mean:

Where x iis the value of this observation,- average, S- standard deviation.

Average for Z -contribution is zero and the standard deviation is 1.

Standardization allows comparison of observations from different distributions. If the distribution of a variable is normal (or close to normal) and the mean and variance are known or estimated from large samples, then Z -observation input provides more specific information about its location.

Note that normalization methods mean the recognition of all features as equivalent from the point of view of elucidating the similarity of the objects under consideration. It has already been noted that in relation to the economy, the recognition of the equivalence of various indicators does not always seem justified. It would be desirable, along with normalization, to give each of the indicators a weight that reflects its significance in the course of establishing similarities and differences between objects.

In this situation, one has to resort to the method of determining the weights of individual indicators - a survey of experts. For example, when solving the problem of classifying countries according to the level of economic development, we used the results of a survey of 40 leading Moscow experts on the problems of developed countries on a ten-point scale:

generalized indicators of socio-economic development - 9 points;

indicators of sectoral distribution of the employed population - 7 points;

indicators of the prevalence of hired labor - 6 points;

indicators characterizing the human element of the productive forces - 6 points;

indicators of the development of material productive forces - 8 points;

indicator of public spending - 4 points;

"military-economic" indicators - 3 points;

socio-demographic indicators - 4 points.

The experts' estimates were relatively stable.

Expert assessments provide a well-known basis for determining the importance of indicators included in a particular group of indicators. Multiplying the normalized values ​​of indicators by a coefficient corresponding to the average score makes it possible to calculate the distances between points that reflect the position of countries in a multidimensional space, taking into account the unequal weight of their features.

Quite often, when solving such problems, not one, but two calculations are used: the first, in which all signs are considered equivalent, the second, where they are given different weights in accordance with the average values ​​of expert estimates.

7. Application of cluster analysis

Let's consider some applications of cluster analysis.

1. The division of countries into groups according to the level of development.

65 countries were studied according to 31 indicators (national income per capita, the share of the population employed in industry in %, savings per capita, the share of the population employed in agriculture in %, average life expectancy, the number of cars per 1 thousand inhabitants, the number of armed forces per 1 million inhabitants, the share of GDP in industry in%, the share of GDP in agriculture in%, etc.)

Each of the countries acts in this consideration as an object characterized by certain values ​​of 31 indicators. Accordingly, they can be represented as points in a 31-dimensional space. Such a space is usually called the property space of the objects under study. Comparison of the distance between these points will reflect the degree of proximity of the countries under consideration, their similarity to each other. The socio-economic meaning of this understanding of similarity means that countries are considered the more similar, the smaller the differences between the same indicators with which they are described.

The first step of such an analysis is to identify the pair of national economies included in the similarity matrix, the distance between which is the smallest. These will obviously be the most similar, similar economies. In the following consideration, both of these countries are considered a single group, a single cluster. Accordingly, the original matrix is ​​transformed so that its elements are the distances between all possible pairs of not 65, but 64 objects - 63 economies and a newly transformed cluster - a conditional union of the two most similar countries. Rows and columns corresponding to the distances from a pair of countries included in the union to all the others are discarded from the original similarity matrix, but a row and column containing the distance between the cluster obtained during the union and other countries are added.

The distance between the newly obtained cluster and the countries is assumed to be equal to the average of the distances between the latter and the two countries that make up the new cluster. In other words, the combined group of countries is considered as a whole with characteristics approximately equal to the average of the characteristics of its constituent countries.

The second step of the analysis is to consider a matrix transformed in this way with 64 rows and columns. Again, a pair of economies is identified, the distance between which is of the least importance, and they, just as in the first case, are brought together. In this case, the smallest distance can be both between a pair of countries, and between any country and the union of countries obtained at the previous stage.

Further procedures are similar to those described above: at each stage, the matrix is ​​transformed so that two columns and two rows containing the distance to objects (pairs of countries or associations - clusters) brought together at the previous stage are excluded from it; the excluded rows and columns are replaced by a column with a row containing the distances from the new joins to the rest of the objects; further, in the modified matrix, a pair of the closest objects is revealed. The analysis continues until the complete exhaustion of the matrix (i.e., until all countries are brought together). The generalized results of the matrix analysis can be represented in the form of a similarity tree (dendogram), similar to that described above, with the only difference that the similarity tree, which reflects the relative proximity of all 65 countries we are considering, is much more complicated than the scheme in which only five national economies appear. This tree, according to the number of matched objects, includes 65 levels. The first (lower) level contains points corresponding to each country separately. The connection of these two points at the second level shows a pair of countries that are closest in terms of the general type of national economies. At the third level, the next most similar pair ratio of countries is noted (as already mentioned, either a new pair of countries or a new country and an already identified pair of similar countries can be in this ratio). And so on up to the last level, at which all the studied countries act as a single set.

As a result of applying cluster analysis, the following five groups of countries were obtained:

Afro-Asian group

Latin-Asian group;

Latin-Mediterranean group;

group of developed capitalist countries (without the USA)

US

The introduction of new indicators beyond the 31 indicators used here, or their replacement by others, naturally leads to a change in the results of the country classification.

2. The division of countries according to the criterion of proximity of culture.

As you know, marketing must take into account the culture of countries (customs, traditions, etc.).

The following groups of countries were obtained through clustering:

· Arabic;

Middle Eastern

· Scandinavian;

· German-speaking;

· English-speaking;

Romanesque European;

· Latin American;

Far East.

3. Development of a zinc market forecast.

Cluster analysis plays an important role at the stage of reduction of the economic-mathematical model of the commodity conjuncture, contributing to the facilitation and simplification of computational procedures, ensuring greater compactness of the results obtained while maintaining the required accuracy. The use of cluster analysis makes it possible to divide the entire initial set of market indicators into groups (clusters) according to the relevant criteria, thereby facilitating the selection of the most representative indicators.

Cluster analysis is widely used to model market conditions. In practice, the majority of forecasting tasks are based on the use of cluster analysis.

For example, the task of developing a forecast of the zinc market.

Initially, 30 key indicators of the global zinc market were selected:

X 1 - time

Production figures:

X 2 - in the world

X 4 - Europe

X 5 - Canada

X 6 - Japan

X 7 - Australia

Consumption indicators:

X 8 - in the world

X 10 - Europe

X 11 - Canada

X 12 - Japan

X 13 - Australia

Producer stocks of zinc:

X 14 - in the world

X 16 - Europe

X 17 - other countries

Consumer stocks of zinc:

X 18 - in the USA

X 19 - in England

X 10 - in Japan

Import of zinc ores and concentrates (thousand tons)

X 21 - in the USA

X 22 - in Japan

X 23 - in Germany

Export of zinc ores and concentrates (thousand tons)

X 24 - from Canada

X 25 - from Australia

Import of zinc (thousand tons)

X 26 - in the USA

X 27 - to England

X 28 - in Germany

Export of zinc (thousand tons)

X 29 - from Canada

X 30 - from Australia

To determine specific dependencies, the apparatus of correlation and regression analysis was used. Relationships were analyzed on the basis of a matrix of paired correlation coefficients. Here, the hypothesis of the normal distribution of the analyzed indicators of the conjuncture was accepted. It is clear that r ij are not the only possible indicator of the relationship between the indicators used. The need to use cluster analysis in this problem is due to the fact that the number of indicators affecting the price of zinc is very large. There is a need to reduce them for a number of the following reasons:

a) lack of complete statistical data for all variables;

b) a sharp complication of computational procedures when a large number of variables are introduced into the model;

c) the optimal use of regression analysis methods requires the excess of the number of observed values ​​over the number of variables by at least 6-8 times;

d) the desire to use statistically independent variables in the model, etc.

It is very difficult to carry out such an analysis directly on a relatively bulky matrix of correlation coefficients. With the help of cluster analysis, the entire set of market variables can be divided into groups in such a way that the elements of each cluster are strongly correlated with each other, and representatives of different groups are characterized by a weak correlation.

To solve this problem, one of the agglomerative hierarchical cluster analysis algorithms was applied. At each step, the number of clusters is reduced by one due to the optimal, in a certain sense, union of two groups. The criterion for joining is to change the corresponding function. As a function of this, the values ​​of the sums of squared deviations calculated by the following formulas were used:

(j = 1, 2, …,m ),

where j- cluster number, n- the number of elements in the cluster.

rij-coefficient of pair correlation.

Thus, the grouping process must correspond to a sequential minimum increase in the value of the criterion E.

At the first stage, the initial data array is presented as a set consisting of clusters, including one element each. The grouping process begins with the union of such a pair of clusters, which leads to a minimum increase in the sum of squared deviations. This requires estimating the values ​​of the sum of squared deviations for each of the possible cluster associations. At the next stage, the values ​​of the sums of squared deviations are considered already for clusters, etc. This process will be stopped at some step. To do this, you need to monitor the value of the sum of squared deviations. Considering a sequence of increasing values, one can catch a jump (one or more) in its dynamics, which can be interpreted as a characteristic of the number of groups "objectively" existing in the studied population. In the above example, jumps took place when the number of clusters was 7 and 5. Further, the number of groups should not be reduced, because this leads to a decrease in the quality of the model. After the clusters are obtained, the variables most important in the economic sense and most closely related to the selected market criterion - in this case, the London Metal Exchange quotes for zinc - are selected. This approach allows you to save a significant part of the information contained in the original set of initial indicators of the conjuncture.

Editor's Choice
Fish is a source of nutrients necessary for the life of the human body. It can be salted, smoked,...

Elements of Eastern symbolism, Mantras, mudras, what do mandalas do? How to work with a mandala? Skillful application of the sound codes of mantras can...

Modern tool Where to start Burning methods Instruction for beginners Decorative wood burning is an art, ...

The formula and algorithm for calculating the specific gravity in percent There is a set (whole), which includes several components (composite ...
Animal husbandry is a branch of agriculture that specializes in breeding domestic animals. The main purpose of the industry is...
Market share of a company How to calculate a company's market share in practice? This question is often asked by beginner marketers. However,...
The first mode (wave) The first wave (1785-1835) formed a technological mode based on new technologies in textile...
§one. General data Recall: sentences are divided into two-part, the grammatical basis of which consists of two main members - ...
The Great Soviet Encyclopedia gives the following definition of the concept of a dialect (from the Greek diblektos - conversation, dialect, dialect) - this is ...