Word clouds

Create word clouds from text verbatims

Word cloud visualization showing frequently occurring words from survey responses about a medical device. Prominent words include 'looks', 'easy', 'device', 'use', 'good', and 'patients' displayed in various sizes and colors, with larger words indicating higher frequency.

Let's say your study asked respondents to provide text answers to an open ended question:

Protobi data table showing question Q7 asking 'What are your initial thoughts about the device you just reviewed?' with a filter box and Apply button at top. Below is a two-column table with 'Value' showing full text verbatim responses and 'Freq' showing each response appears only 1.0% of the time, indicating high response diversity.

The answers are too long to read in one line, and there are too many to show in one bar chart. Word clouds can be a simple and whimsical way to quickly convey the gist of the answers.

Create a word cloud

To turn responses in an element into a word cloud press the circle edit icon to bring up the context menu. Select "More properties...," then under "Chart type" select "Word cloud". This will create a simple cloud showing all verbatim answers:

Initial word cloud showing complete verbatim responses as individual entries, resulting in very small, difficult-to-read text scattered across the visualization. The entire sentences and phrases are displayed, making the cloud cluttered and ineffective since each unique response has equal tiny representation.

By default, it represents the frequency of each answer with font size equal to the percentage frequency. Here, this chart doesn't make sense because the fonts are too small , each answer is unique,  and represents a tiny percentage of the sample.

When displayed this way, the visual data is not interesting because each independent response is unique. 

Edit word case

Protobi is case sensitive, so "easy" and "Easy" are counted as two different response values. If you don't want the same word with different casing to be counted as separate responses, select "Edit properties..." and next to "string" select a case to use for all values. 

Edit properties dialog showing the 'string' field with a dropdown menu expanded, displaying case conversion options: (none), lower case (currently selected in blue), UPPER CASE, Title Case, and camelCase.

split phrases into words

Word clouds are more effective when the design clearly represents prevalence of key words. To exhibit the frequencies of individual words, rather than complete responses, select "Edit properties..." and under "split" type:

  • " " (i.e., just a space, without the quotes) or
  • "word" (i.e., just the word 'word' without the quotes)

The first will split the sentences at each space but only at spaces. The second will be a little smarter and will split strings at certain punctuations and word boundaries but avoid splitting at underscores and hyphens.

This will now split strings at each space into shorter strings, and show frequencies of each word:

Improved word cloud after splitting text into individual words, showing clear prominent terms like 'looks', 'easy', 'use', 'device', 'good', and 'patients' in larger fonts with better readability. The cloud now effectively highlights the most frequently mentioned concepts from the verbatim responses.

Exclude common words

By default Protobi word clouds exclude the following set of words that occur frequently in English, "of,the,and,to,an,are,is,for,do,a,it,be,i,with,in,that,have,on,so".

These words are specified under the suppress property. This can be a list of words separated by spaces, commas, hyphens, or any regex word boundary. So you could just copy/paste the question text at the end to also exclude any words from the question text.

To add more words to this list via the user interface, right click on a word and press Ok when asked to confirm:

Dialog box with blue header 'Word cloud' asking 'Exclude "seems" from WordCloud?' with Cancel and Ok buttons. This shows how users can interactively remove common or undesired words from the word cloud visualization.

To remove words from the list, edit the property exclude in the Additional Options dialog above.

Customize the chart

You can customize many aspects of the chart including the maximum and minimum font sizes, cloud shape, etc.  From the Chart Type... dialog, select "Additional options" to bring up a dialog with more options:

Edit chartOptions dialog showing advanced word cloud configuration settings including height (300), width (400), font options, maxFontSize (60), maxWords (60), minBasis, minFontSize (5), rotateRatio, shape (circle), and various checkbox options. The suppress field contains the default list of common English words to exclude.

Resize the chart

By default Protobi scales the word cloud to match the element inner size.  To adjust the chart size, select the element so its header is highlighted.   Dashed outlines showing the outer and inner size.  Resize chart outer size by moving the blue resize handle.   Adjust the inner size by dragging the red margin handles.

Word cloud element Q7_cloud shown with element selection handles visible, including dashed border outlines indicating outer and inner sizing areas. The title 'Translate and recode text verbatims' appears above. Resize handles are shown: blue corner handle for outer size and red margin handles for adjusting inner content area.

If desired Protobi can size words exactly so frequent word has font size specified by the option  maxFontSize and other words sized relative to that**.**  In the Additional Options dialog select the checkbox for scaleToM****axFontSize and specify a maxFontSize value:

Word cloud with sizing annotations showing the concept of maxFontSize. The word cloud displays text verbatim responses with a label and arrows pointing to demonstrate that the largest word is sized according to the maxFontSize setting, with other words scaled proportionally.

Combine similar words

After values are split into words, you can combine similar words into one code using the Recode feature. 

From the context menu to bring up the Recode dialog. Search for similar words, then select and drag them to a new or existing code on the left.

Recode dialog showing two-column interface with 'Codes' on left (displaying 'New code' in blue and 'easy' with count 0) and 'Uncoded values' on right showing 413 uncoded items. The search box contains 'eas' filtering the list to show related words like 'easy', 'easier', 'reasonable', 'pleased', and 'pleasing.' with green highlighting on the matching text portions.

Set word colors

It's possible to setup rules to color words.  The simplest is to select Color... from the context menu and choose a color theme.  You can add your companies primary and alternate color themes to your project, and these color schemes can define color sequences or specific colors for specific values; this is covered in a separate tutorial.

Colors dialog overlay showing color scheme options for word cloud. The dialog displays icon color choices (white, blue, green, dark blue, yellow, gray, orange, purple, maroon) and three color scheme options for chart values: 'default' (selected, showing 8 colors), 'ascending' (showing 2 colors), and 'descending' (showing 2 colors). The word cloud is visible in the background.

If you want to assign specific colors to specific words for one element only you can specify this mapping in the element. Select Edit JSON... from the context menu and create an attribute colors which maps words to colors as shown below:

JSON code editor showing lines 51-58 with a 'colors' object mapping specific words to colors: 'good' to 'green', 'older' to 'red', 'easy' to 'green', 'friendly' to 'green', 'difficult' to 'red', and 'complicated' to 'red'. This demonstrates custom color assignment for sentiment-based word coloring.

The matching colors will appear and all other words appear as grey:

Word cloud showing the result of custom color mapping where specific words are highlighted in color (green for positive words like 'good', 'easy', 'friendly' and red for negative words like 'difficult', 'complicated', 'older') while all other words appear in gray, creating a sentiment-focused visualization.

Keep certain phrases together

The split feature in Edit properties is useful, but what if there are words that have to be kept together, like "Staten Island", or "COVID 19".  The challenge is such logic has to be applied before splitting values into words. Protobi is already smart enough not to split hyphens like "Ocasio-Cortez", and we're working to make this feature prettier and accessible via the user interface. For now, you can keep certain phrase together by replacing spaces with hyphens or underscores.

 First, select Edit JSON... from context menu to bring up the JSON editor for the element. Then create an attribute replace which is an object mapping expressions on the left to alternate values on the right.  

In the example below, the attribute is set to replace all instances of "staten island" with "staten_island" before  words are split.  

JSON editor showing Edit element properties dialog with a 'replace' attribute on line 35 that maps 'staten island' to 'staten_island', preserving this two-word phrase as a single unit by replacing the space with an underscore before word splitting occurs.

Expressions on the left are "regular expressions" and can be super expressive.  If you use a dot in "staten.island" is a wild card that matches any one character, like a space or hyphen or other character.  

JSON editor showing Edit element properties with the replace attribute using a regular expression pattern 'staten.island' (with dot wildcard) to match variations like 'staten island', 'staten-island', or 'staten_island', all replacing to 'staten_island' for consistent handling of this multi-word location name.

Expressions on the right are the replacement.  Protobi doesn't consider  underscores or hyphens to be word boundaries when splitting, when split is set to "word, so "staten_island" won't be split. And Protobi word clouds display underscores as spaces, so the underscore is a good character to represent a non-breaking space.

Limit the number of words

You can set an absolute limit on the number of words that appear by setting maxWords in the Additional Options dialog.  The default value is 60.  In this case no more than 60 words will appear.  This limit is applied after excluding common words.

You can also set a threshold for the minimum frequency for a word to appear by setting minBasis in the Additional Options dialog.  The default is zero so that any words that occurs even once could be included in the word cloud (subject to the maxWords limit).  If you set it to 5, then only words that occur 5 or more times will be considered for the word cloud.

Available options

Word clouds are powered by the engine by Timothy Chien. The chartOptions block is passed straight to the rendering engine, allowing you to set these options:

  • minFontSize: 5 minimum font size (pixels) 
  • maxFontSize: 60 maximum font size (pixels)
  • scaleToMaxFontSize: true|false whether to scale
  • limit : 60 maximum number of values to draw
  • fontFamily: font to use
  • fontWeight: font weight to use, e.g. "bold" or 600
  • color: color of the text, can be any CSS color
  • weightFactor: number to multiply size of each word
  • backgroundColor: color of the background
  • drawOutOfBound: true allows words to extend outside the box
  • shape: "ellipse"The shape of the "cloud" to draw,
    • "circle" (default),
    • "cardioid",
    • "diamond",
    • "square",
    • "triangle",and
    • "star".
  • ellipticity: degree of "flatness" of the word cloud.
  • shuffle: true (default) randomizes points
  • rotateRatio:
  • Probability for word to rotate (1=always, 0=never)

Word clouds aren't just for text verbatims

Many distributions, not just text, can be drawn as word clouds. Values appears as words, and the font size is proportional to its frequency.   For instance, respondent state could be drawn as a word cloud rather than a map:

Word cloud showing U.S. state names sized by frequency, with largest states being 'New York', 'California', 'Florida', 'Texas', 'New Jersey', and 'Illinois'. The title reads 'State of primary practice' under element S2, demonstrating that word clouds can visualize categorical data beyond just text verbatims.

Video Tutorial

Let's make a word cloud