Extreme values: Winsorize, trim or retain?

Normal distribution bell curve showing standard deviation ranges. The curve displays percentage distributions with 34.1% in each section one standard deviation from the mean, 13.6% in each section between one and two standard deviations, and 2.2% beyond two standard deviations on each tail. Orange vertical lines mark the boundaries at 2 sigma.

Sometimes your data has outliers. Trimming and Winsorizing are two ways to mitigate the effect of extreme values on your analysis.

"Trimming" data excludes the outlier values from your analysis. "Winsorizing" retains the responses in your basis, but caps numeric outliers so they fall at the edge of the main distribution.

A common request is to either trim or Winsorize data to within the [5%, 95%] percentiles. However, in practice survey data is often highly asymmetric, so clipping the data at just the high end may be reasonable.

Example

In the example below, most physicians report under 100 patients per month. But a few (4%) report much higher numbers.

The screener termination criteria already bound the responses to be at least 5. We might clip answers above 100, as shown by the gold line.

Protobi data view showing question S8 asking physicians how many patients they manage per month with Condition Y. The distribution shows most responses concentrated in lower ranges (1-100 patients), with 96% of the 100 respondents reporting under 100 patients. Four extreme outlier values appear at 111-120, 171-180, 191-200, and 291-300, each representing 1%. Mean is 43.7, median is 30.0. An orange horizontal line marks the 100-patient threshold.

Winsorizing

We can cap those answers to within a defined range by setting the "ceiling" and "floor" attributes. Press the circle edit icon, choose "Edit JSON..." and set "ceiling": 100.

Protobi JSON editor dialog showing element properties for question S8v2. The JSON configuration includes key, title, type (number), sorting, and importantly shows the ceiling and floor attributes set to 100 and 5 respectively. The dialog demonstrates how to configure Winsorizing parameters through direct JSON editing.

The data is now bounded to the range [5,100]. The outlying values are not dropped, but are now counted as if they were equal to 100 and thus fall in the range "81 to 100" which has increased from 8% to 12%. The N size is still 100, but the mean is a bit lower now.

Protobi data view showing question S8 after Winsorizing with ceiling set to 100. The distribution now shows all values capped at the "91 to 100" range, which has increased from 8% to 12% as the four outlier responses have been folded into this top bucket. The sample size remains N=100, but the mean has decreased from 43.7 to 39.7, while the median remains unchanged at 30.0.

Note that the median did not change at all. In all but the most extreme cases, the median is robust to outliers and unaffected by Winsorizing because the extreme values stay on their side of the median.

Trimming

Another approach is to ignore responses outside the main range. To do this we can set a filter which includes only responses that fall within the range (5, 100].

Protobi data view showing question S8 after trimming to exclude outliers outside the range (5, 100]. The distribution shows the same percentage patterns but now with a reduced sample size of N=96 instead of 100, as the four extreme outlier responses have been completely excluded. The mean has decreased further to 37.2, while the median remains at 30.0.

Here the basis is lower, N=96, reflecting that the outliers are ignored from the distribution. The mean is a little lower still. The median happens to stay at 30, but trimming may change the median if more values are removed from one end than the other. In Protobi, select "Edit JSON..." from the context menu and enter the following:

"filter": {
        "S8v2": {
            "$gt": 5,
            "$lte": 100
        }
    }

This is MongoDB query syntax, "$gt" means "greater than" and "$lte" means "less than or equal". This says to include only responses where data column S8v2 is greater than 5 or less than or equal to 100. 

Protobi JSON editor dialog showing element properties for question S8v2 with a filter configuration added. The JSON includes a "filter" object using MongoDB query syntax with operators "$gt": 5 and "$lte": 100 to include only responses where S8v2 is greater than 5 and less than or equal to 100, implementing the trimming approach.

Recode outliers

Sometimes responses are entered honestly but in error. For instance, a respondent may have written they purchased their Tesla in the year "2081".

We might prefer to believe they meant to write "2018" rather than time traveled from the future to complete the survey. In this case one could recode "2081" to "2018" using the "Recode..." dialog in Protobi.

Retain outliers and use a log scale

Just because numbers are atypical doesn't mean they are unreasonable. Here, it's possible a few physicians really do treat many more patients of this condition than do most doctors.

Many phenomena yield "long-tail" distributions where a few outliers legitimately exist. For instance in economics most people have modest wealth but a few have very high net worth, and to exclude them from analysis would be misleading.

Many "long-tail" distributions have normal distributions when plotting the logarithm of values. In Protobi you can set "Round By..." to log, which chooses small bin ranges for smaller numbers and larger bin ranges for larger numbers.

Protobi data view showing question S8 using logarithmic binning to accommodate the full range of values including outliers. The distribution displays six logarithmically-scaled ranges (5-9, 10-24, 25-49, 50-99, 100-249, 250-499), with most responses concentrated in the middle ranges. This approach retains all 100 responses including the extreme values, with mean=43.7 and median=30.0 matching the original unmodified data.

Summary

This tutorial demonstrated three approaches to handling extreme values: trimming , Winsorizing, and retaining but plotting on a logarithmic scale.

Trimming makes a lot of sense when you simply don't believe the answers, e.g., a traveler who says he makes 999 commercial flights per year

Retaining the data makes sense when there legitimately may be high values, e.g., a few business travelers may actually take 100+ flights per year. A log scale may be useful.

Winsorizing makes senses when we want to retain the high-value responses but not take them too literally, such as when weighting physicians by self-reported patient volumes.

See related

To remove certain respondents completely from the project view, see Filter or remove respondents.