Here’s how to to compute candidate market segmentations using R and profile them in Protobi.

Example

The StackOverflow Developer Survey asked software developers what percent of time they spend on various tasks. Results are show below:

Horizontal bar chart showing percentage of time developers spend on nine tasks: Developing new features (31%), Improving code quality (21%), Looking for documentation (17%), In meetings (10%), Technical support (7%), Learning new skills (7%), Communicating with team (6%), and Looking for a new job (0.8%).

The most frequent task (31%) is "Developing new features", and the least frequent task is "Looking for a new job" (0.8%). But these are averages, hiding a lot of individual diversity. Can we group developers into different camps based on how they spend their time?

K-means segmentation

There are many good ways to develop candidate segmentations. One common method is K-means clustering

Here we use Latent Class Analysis using the poLCA library in R to derive candidate segmentations, and attach predicted segment membership back to the original datafile for evaluation. (An analogous process can be done using LatentGold, QUICK CLUSTER in SPSS or FASTCLUS in SAS.)

The basic steps are:

Import data
Recode basis variables
Segment respondents into various numbers of clusters
Marge back predicted segment membership
Export data

Step 1: Load the relevant packages and read the input dataset. You can find the data here in SAV format at 2013_StackOverflowRecoded.sav and 2013_StackOverflowRecoded.csv.

Step 2: A quirk of poLCA is that the variables used as the segmentation basis must be coded as a sequence of integers starting with 1 (i.e. 1, 2, 3, ...). Here the values are already coded as a sequence of integers, but starting at 0, so we increment them by 1 using the recode method in the car library.

Step 3: Run cluster analyses to create solutions with 2-, 3-, 4-, 5- and 6-clusters, respectively. The poLCA algorithm treats all basis variables as categorical, not ordinal or continuous. There’s an inherent ordinality in our coding, which it thus can’t recognize whereas the more sophisticated algorithms in LatentGold from Statistical Innovations can.

Note that we set na.rm=TRUE which means that respondents with missing values will be included, and NA treated as its own category.

Step 4: Finally, we merge the predicted class memberships from each solution back to the main data frame

Step 5: Export as a new CSV file. That’s the data we’ll view in Protobi.

The complete R program is below:

# Step 1:  Load packages,  libraries and data
install.packages("car");
install.packages("poLCA");
install.packages("scatterplot3d");  # required by poLCA
install.packages("MASS"); # required by poLCA
library(car); # for recoding
library(poLCA); # for segmentation

so <- read.csv("2013_StackOverflowRecoded.csv", header=TRUE, sep=",")

# Step 2: Recode basis variables to positive integers starting at one
so$rs14_1 <- recode(so$q14_1,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_2 <- recode(so$q14_2,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_3 <- recode(so$q14_3,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_4 <- recode(so$q14_4,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_5 <- recode(so$q14_5,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_6 <- recode(so$q14_6,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_7 <- recode(so$q14_7,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_8 <- recode(so$q14_8,"5=6;4=5;3=4;2=3;1=2;0=1;")
so$rs14_9 <- recode(so$q14_9,"5=6;4=5;3=4;2=3;1=2;0=1;")

# Step 3: Compute segmentation using above columns as the segmentation basis
q14rs <- cbind(rs14_1, rs14_2, rs14_3, rs14_4, rs14_5, rs14_6, rs14_7, rs14_8, rs14_9) ~ 1

q14clu2 <- poLCA(q14rs, so, nclass=2, na.rm=FALSE); # BIC(2): 163650.3
q14clu3 <- poLCA(q14rs, so, nclass=3, na.rm=FALSE); # BIC(3): 161445.0
q14clu4 <- poLCA(q14rs, so, nclass=4, na.rm=FALSE); # BIC(4): 160172.9
q14clu5 <- poLCA(q14rs, so, nclass=5, na.rm=FALSE); # BIC(5): 159428.3
q14clu6 <- poLCA(q14rs, so, nclass=6, na.rm=FALSE); # BIC(6): 159209.8

# Step 4: merge estimated segment membership back to main data frame
so$q14_clu2 <- q14clu2$predclass
so$q14_clu3 <- q14clu3$predclass
so$q14_clu4 <- q14clu4$predclass
so$q14_clu5 <- q14clu5$predclass
so$q14_clu6 <- q14clu6$predclass

# Step 5: export augmented data as a new CSV
write.table(so, file="2013_StackOverflowRecoded_lca.csv", sep=",", col.names=TRUE,qmethod="double", na="", row.names=FALSE)

Visualize segments in Protobi

We had first created a project based on the original dataset, and organized that view nicely. So we updated the project in-place with the new augmented dataset. This allows us to keep the same map but add/drop fields with possibly new records or field values.

There are several new fields, corresponding to each cluster solution, including the 3-cluster solution, q14clu_3. At first it’s unnamed, with just the values 1, 2 and 3. We can get a sense of their character by drilling into each value and looking for significant differences.

Candidate segment 1

For instance, below we press into value q14clu_3 = 1:

Segmentation variable view showing 'q14clu_3' with three segments. Segment 1 is selected (highlighted in blue) representing 46% of respondents (3,940 cases), Segment 2 shows 43% (3,697 cases), and Segment 3 shows 11% (904 cases).

Horizontal bar chart comparing Segment 1 (blue bars) against baseline (gray shadow) for nine developer tasks. Segment 1 shows lower percentages for developing features (23%) and code quality (15%), but higher percentages for meetings (13%), technical support (9%), and learning skills (9%).

Here the values for respondents in this segment are shown in blue. The baseline distribution for all respondents is shown as a light grey shadow for comparison.

We can see that Segment 1 is significantly less likely (as indicated by the gray arrow icon) to spend a lot of time on new features or refactoring, and a lot more time on meetings, technical support, new skills and everything else. We might call these “All but dev”.

Candidate segment 2

Below is segment 2. These respondents are quite the opposite, focused almost exclusively on new features and code quality:

Segmentation variable view showing 'q14clu_3' with Segment 2 selected (highlighted in blue) representing 43% of respondents (3,697 cases). Segment 1 shows 46% (3,940 cases) and Segment 3 shows 11% (904 cases).

Horizontal bar chart comparing Segment 2 (blue bars) against baseline (gray shadow). Segment 2 shows very high percentages for developing features (40%) and code quality (29%), with down arrows indicating significantly lower values for all other tasks including meetings (5%), documentation (13%), and technical support (3%).

Candidate segment 3

Finally is segment 3. These respondents are even more likely than Segment 2 to spend a lot of time on new features and quality, yet even more likely than segment 1 to spend a lot of time in meetings, tech support and learning new skills. We might call this segment “Dev and growth” (to be literal) or perhaps “Entrepreneur” (to apply a bit of descriptive license).

Segmentation variable view showing 'q14clu_3' with Segment 3 selected (highlighted in blue) representing 11% of respondents (904 cases). Segment 1 shows 46% (3,940 cases) and Segment 2 shows 43% (3,697 cases).

Horizontal bar chart comparing Segment 3 (blue bars) against baseline (gray shadow). Shows high values for developing features (38%), code quality (25%), meetings (15%), technical support (9%), and learning skills (9%), with down arrows on documentation (10%) and everything else (1%).

Profile a segmentation

So now we can name the segments:

Segmentation variable view showing renamed segments: '3-1 All but dev' (46%, 3,940 cases), '3-2 Primarily dev' (43%, 3,697 cases), and '3-3 Dev and growth' (11%, 904 cases), with Segment 1 selected in blue.

pressing and contrasting is fun and informative for exploratory analysis. But to present it to the client, we might aim for a more concise crosstab (which we can copy to Excel and create a stylized custom chart):

Large crosstab showing all nine task questions crossed by the three named segments. Each cell contains percentages and horizontal bars. The table clearly shows segment differences: 'All but dev' has lower percentages for development tasks, 'Primarily dev' shows high percentages (40%, 29%) for features and quality, and 'Dev and growth' shows balanced high values across multiple activities.

Compare alternative segmentations

Wait … what about the four-cluster solution? How’s that different? Might that be better? Let’s take a look!

One thing we can do is crosstab two candidate solutions. That’s easy to do in Protobi by dragging the header of one to the header of the other.

For instance, we can compare the 4-cluster solution to the 3-cluster solution. Here we can see that

segment 4-1 corresponds to 3-3 (“Dev and growth”)
segment 4-4 corresponds to 3-2 (“Primarily dev”).
segments 4-2 and 4-3 split 3-1 (“All but dev.”)

Crosstab comparing 4-cluster solution (rows) against 3-cluster solution (columns). The table shows how segment '4-1' maps to '3-3' (904 cases, 100%), '4-4' maps to '3-2' (3,697 cases, 100%), while '4-2' and '4-3' split the '3-1' segment with 2,085 and 1,855 cases respectively.

This post provides a brief tutorial on how to estimate candidate segmentations in an external software package, and visualize the resulting segmentations in Protobi.

Summary

Try Protobi with your next segmentation project, and let our expert analysts show you how.

New to Protobi?

Organizing the view

New and updated articles

Basics for viewers

Basics for editors

Intermediate topics for editors

Charts

Text open-end questions

Tracking studies

For project admins

Process data in Protobi

Advanced topics

How to...

Videos

Troubleshooting

API References

Cluster analysis and segmentation