Cluster analysis and segmentation

Updated at July 25th, 2022

Here’s how to to compute candidate market segmentations using R and profile them in Protobi.


The StackOverflow Developer Survey asked software developers what percent of time they spend on various tasks. Results are show below:

The most frequent task (31%) is "Developing new features",  and the least frequent task is "Looking for a new job" (0.8%).  But these are averages, hiding a lot of individual diversity. Can we group developers into different camps based on how they spend their time?

K-means segmentation

There are many good ways to develop candidate segmentations.  One common method is K-means clustering

Here we use Latent Class Analysis using the poLCA library in R to derive candidate segmentations, and attach predicted segment membership back to the original datafile for evaluation. (An analogous process can be done using LatentGold, QUICK CLUSTER in SPSS or FASTCLUS in SAS.)

The basic steps are:

  1. Import data
  2. Recode basis variables
  3. Segment respondents into various numbers of clusters
  4. Marge back predicted segment membership
  5. Export data

Step 1: Load the relevant packages and read the input dataset. You can find the data here in SAV format at 2013_StackOverflowRecoded.sav and 2013_StackOverflowRecoded.csv.

Step 2: A quirk of poLCA is that the variables used as the segmentation basis must be coded as a sequence of integers starting with 1 (i.e. 1, 2, 3, ...). Here the values are already coded as a sequence of integers, but starting at 0, so we increment them by 1 using the recode method in the car library.

Step 3: Run cluster analyses to create solutions with 2-, 3-, 4-, 5- and 6-clusters, respectively. The poLCA algorithm treats all basis variables as categorical, not ordinal or continuous. There’s an inherent ordinality in our coding, which it thus can’t recognize whereas the more sophisticated algorithms in LatentGold from Statistical Innovations can.

Note that we set na.rm=TRUE which means that respondents with missing values will be included, and NA treated as its own category.

Step 4: Finally, we merge the predicted class memberships from each solution back to the main data frame

Step 5: Export as a new CSV file. That’s the data we’ll view in Protobi.

The complete R program is below:

# Step 1:  Load packages,  libraries and data
install.packages("scatterplot3d");  # required by poLCA
install.packages("MASS"); # required by poLCA
library(car); # for recoding
library(poLCA); # for segmentation

so <- read.csv("2013_StackOverflowRecoded.csv", header=TRUE, sep=",")

# Step 2: Recode basis variables to positive integers starting at one
so$rs14_1 <- recode(so$q14_1,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_2 <- recode(so$q14_2,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_3 <- recode(so$q14_3,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_4 <- recode(so$q14_4,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_5 <- recode(so$q14_5,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_6 <- recode(so$q14_6,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_7 <- recode(so$q14_7,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_8 <- recode(so$q14_8,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_9 <- recode(so$q14_9,"5=6;4=5;3=4;2=3;1=2;0=1;")

# Step 3: Compute segmentation using above columns as the segmentation basis
q14rs <- cbind(rs14_1, rs14_2, rs14_3, rs14_4, rs14_5, rs14_6, rs14_7, rs14_8, rs14_9) ~ 1

q14clu2 <- poLCA(q14rs, so, nclass=2, na.rm=FALSE); # BIC(2): 163650.3
q14clu3 <- poLCA(q14rs, so, nclass=3, na.rm=FALSE); # BIC(3): 161445.0
q14clu4 <- poLCA(q14rs, so, nclass=4, na.rm=FALSE); # BIC(4): 160172.9
q14clu5 <- poLCA(q14rs, so, nclass=5, na.rm=FALSE); # BIC(5): 159428.3
q14clu6 <- poLCA(q14rs, so, nclass=6, na.rm=FALSE); # BIC(6): 159209.8

# Step 4: merge estimated segment membership back to main data frame
so$q14_clu2 <- q14clu2$predclass
so$q14_clu3 <- q14clu3$predclass
so$q14_clu4 <- q14clu4$predclass
so$q14_clu5 <- q14clu5$predclass
so$q14_clu6 <- q14clu6$predclass

# Step 5: export augmented data as a new CSV
write.table(so, file="2013_StackOverflowRecoded_lca.csv", sep=",", col.names=TRUE,qmethod="double", na="", row.names=FALSE)

Visualize segments in Protobi

We had first created a project based on the original dataset, and organized that view nicely. So we updated the project in-place with the new augmented dataset. This allows us to keep the same map but add/drop fields with possibly new records or field values.

There are several new fields, corresponding to each cluster solution, including the 3-cluster solution, q14clu_3. At first it’s unnamed, with just the values 1, 2 and 3. We can get a sense of their character by drilling into each value and looking for significant differences.

Candidate segment 1

For instance, below we press into value q14clu_3 = 1:


Here the values for respondents in this segment are shown in blue. The baseline distribution for all respondents is shown as a light grey shadow for comparison.

We can see that Segment 1 is significantly less likely (as indicated by the gray arrow icon) to spend a lot of time on new features or refactoring, and a lot more time on meetings, technical support, new skills and everything else. We might call these “All but dev”.

Candidate segment 2

Below is segment 2. These respondents are quite the opposite, focused almost exclusively on new features and code quality:


Candidate segment 3

Finally is segment 3. These respondents are even more likely than Segment 2 to spend a lot of time on new features and quality, yet even more likely than segment 1 to spend a lot of time in meetings, tech support and learning new skills. We might call this segment “Dev and growth” (to be literal) or perhaps “Entrepreneur” (to apply a bit of descriptive license).


Profile a segmentation

So now we can name the segments:

pressing and contrasting is fun and informative for exploratory analysis. But to present it to the client, we might aim for a more concise crosstab (which we can copy to Excel and create a stylized custom chart):

Compare alternative segmentations

Wait … what about the four-cluster solution? How’s that different? Might that be better? Let’s take a look!

One thing we can do is crosstab two candidate solutions. That’s easy to do in Protobi by dragging the header of one to the header of the other.

For instance, we can compare the 4-cluster solution to the 3-cluster solution. Here we can see that

  • segment 4-1 corresponds to 3-3 (“Dev and growth”)
  • segment 4-4 corresponds to 3-2 (“Primarily dev”).
  • segments 4-2 and 4-3 split 3-1 (“All but dev.”)

This post provides a brief tutorial on how to estimate candidate segmentations in an external software package, and visualize the resulting segmentations in Protobi.


Try Protobi with your next segmentation project, and let our expert analysts show you how.

Was this article helpful?