Back to chapter

5.8:

What Are Outliers?

JoVE Core
Statistics
This content is Free Access.
JoVE Core Statistics
What Are Outliers?

Languages

Share

Outliers are one or more values in a data set that stand out from the others.

For example, the five best horses are determined by their average lap time. An unusual lap time, either too good or too poor, is considered an outlier.

But, how can one identify outliers from a large data set?

One way is to find the interquartile range. Values above or below 1.5 times the IQR are considered outliers.

The second method uses scores. The values within minus two and plus two z scores are generally considered usual values, covering approximately 95% of data values. Anything outside this range is an outlier.     

The third method is using boxplots. Any data point that lies outside the whiskers of a box plot is considered an outlier.

Outliers can affect the mean, standard deviation, and range of data, but some outliers can be ignored without affecting the sample statistic. So, careful considerations are made to consider outliers in calculations or trim them away.

5.8:

What Are Outliers?

Outliers are observed data points that are far from the least squares line. They have unusual values and need to be examined carefully. Though an outlier may result from erroneous data, at other times, it may hold valuable information about the population under study and should be included in the data. Hence, it is crucial to examine what causes a data point to be an outlier.

The z score is used to find outliers or unusual values. It should be noted that any values beyond -2 and +2 are considered unusual values or outliers and are far away from the other data values.

Identifying Outliers

We could guess at outliers by looking at a scatterplot graph and best fit-line graph. However, we would require a guideline to understand how far away a point needs to be so it can be considered an outlier. As a rough rule of thumb, we can flag any point that exceeds two standard deviations above or below the best-fit line as an outlier. The standard deviation used is the standard deviation of the residuals or errors.

We can do this visually in the scatter plot by drawing an extra pair of lines that are two standard deviations above and below the best-fit line. Any data points outside this extra pair of lines are flagged as potential outliers. Additionally, we can identify outliers numerically by calculating each residual and comparing it to twice the standard deviation.

This text is adapted from Openstax, Introductory Statistics, Section 12.5 Outliers