📚 Clustering sources

This new filter groups all the sources in the file (as currently filtered) into clusters of sources which are similar in terms of the maps they tend to produce.

Filtered clusters

You can cluster the sources remaining in a map as currently filtered, at the point at which you apply the filter cluster sources.

To use it, add a filter like this to the advanced editor:

cluster sources n_clusters=4


Here you can see the new field appearing in the Sources table.


You can also ask for cluster sets with 2, 3 and 4 members all at once:

cluster sources n_clusters=all

These are three attempts to find interesting sets of clusters - one set with only two clusters, one set with three and one set with four.

You should pick the set which you find most interesting and concentrate on that.

The clusters themselves get automatic names a, b, c etc as appropriate. So the set #cluster_set_2 has just two clusters called a and b. They don’t mean anything in particular but you should explore each one to perhaps find your own names for them - perhaps “mainly younger farmers” or “mainly older urban residents” or even something more imaginative like “stars” or “stragglers”.

The most interesting output will be in the Differences - sample subtab of the Reports tab, where you should see some interesting differences in how members of the different cluster sets responded to different questions – that is the main point of the clusters.

The Sample subtab of the Reports tab has, after the initial tables for each important field (those beginning with #), now has another longer section with tables of each important field cross-tabulated against the other. If you scroll down to the sections for #cluster_set_2, - #cluster_set_3 etc, you will be able to see if these sets of clusters are easy to interpret in terms of the other fields. Ideally you’ll be able to, for example, understand the clusters as “mainly younger farmers” etc.

It is always trivially possible to find at least one cluster set which differs in one link (i.e. in one simple link bundle) – though whether these differences are significant depends on the numbers involved. It is only interesting when we find common differences between clusters across several different links simultaneously, i.e. the people in cluster A mentioned this and that and this other thing.

In future we will provide output to automatically crosstabulate the clusters with other key fields like gender and district, and to automatically produce maps for each cluster, surprise maps, differences tables, etc., but you can already do this manually.

Here the Sources table has been set up to list the sources according to cluster:


The clustering algorithm

The algorithm is quite simple, it simply treats the links table for each source as a one-dimensional binary vector of length equal to the square of the number of factors. It’s a binary vector, so multiple mentions of the same bundle of links by the same source are no different from a single mention.

Finding many solutions at once

You can ask for a specific number of clusters:

cluster sources n_clusters=5

But how are you going to know if you will get a more interesting set if you ask for five rather than four or three or seven clusters?

This filter with n_clusters=all finds three separate solutions with two, three and four clusters all at once. It’s your job to compare and decide which (if any) is the solution which is best and easiest to interpret.

There is also a shortcut button to add this line to your current filters.