Numeric variables with long-tail distributions

A theoretical heavy-tailed distribution curve showing a steep decline from left to right, with the highest concentration at the left (shown in bright green) gradually tapering off to a long tail extending far to the right. The curve demonstrates the characteristic shape of distributions where most values cluster at lower ranges while a small number of extreme values extend far into the upper range.

Protobi sets "Round by..." to auto which bins values into equal ranges. You can set "Round by..." to log for distributions with heavy tails. Learn more about the "Round by..." dialog in our "Bin numeric values" tutorial.

Protobi automatically bins numeric variables into ranges. Numeric variables come in a few varieties:

  • Constants (e.g. π = 3.141... )
  • Light-tailed distributions
  • Heavy-tailed distributions

Many variables we encounter in market research have light-tailed or even distributions, such as percentages, preference ratings, etc. 

Other variables such as number of patients, income, book sales, frequent flier miles, etc. have heavy-tail distributions. Benoit Mandelbrot coined the terms "mild" versus "wild" randomness to describe the difference.

Example

Below is an example where customers are asked for their purchase budget in dollars. This has a classic heavy-tail distribution with a small number of individuals with very large values.

A frequency table titled 'BUDGET_LINEAR' showing purchase budget data with Round By set to 'auto' (linear binning). The table displays value ranges from [NA] (20.1%) through bins like '1 to 10,000' (45.1%), '10,001 to 20,000' (19.9%), with decreasing frequencies as budget ranges increase. Most responses cluster in the lower ranges, with very small percentages in higher ranges like '60,001 to 70,000' (0.3%). The mean is 11,946.

By default, Protobi sets Round By = auto, which chooses linear bin sizes for numeric variables based on the standard deviation. We can see that many people have budgets of $1,000 to $5,000, and very few have budgets much over $30,000.

The second version uses default binning with Round By set to log ,which chooses logarithmic bin sizes. In the graph below we can see that there are quite a number of customers willing to spend under $1,000, and also a substantial number that are willing to spend a lot more.

A frequency table titled 'BUDGET_LOG' showing the same purchase budget data with Round By set to 'log' (logarithmic binning). The table displays logarithmically-scaled value ranges from 0 (20.1%) through ranges like '100 to 249' (1.0%), '250 to 499' (1.7%), '2,500 to 4,999' (11.4%), '5,000 to 9,999' (18.3%), '10,000 to 24,999' (25.3%), with notable percentages distributed across all ranges. The mean is 9,546. This binning reveals a more nuanced distribution pattern than the linear version.

A product strategy might be radically different with this perspective, selling differently to customers with $250 versus $2,500 to spend, rather than lumping them all into an "Under $5,000" category.