Chapter 11: Correlation and Regression

Back to chapter

11.5:

Outliers and Influential Points

JoVE Core
Statistics

A subscription to JoVE is required to view this content. Sign in or start your free trial.

JoVE Core Statistics

Outliers and Influential Points

Previous Video
11.4: Regression Analysis

Next Video
11.6: Residuals and Least-Squares Property

Languages

Share

English العربية 中文 Nederlands français Deutsch עברית italiano 日本語 한국어 português русский español Türkçe

Consider the scatter plot of annual income versus years of schooling, fitted with a regression line.

One person with only a few years of schooling has an exceptionally high income compared to others.

Such a data point that does not follow the trend and is far away from the regression line in the vertical direction is called an outlier.

Quantitatively, outliers can be identified using residuals, which is the difference between the observed y-value of the data point and the y-value as predicted from the regression equation.

Now, the standard deviation of the residual is calculated using its formula.

As a rule of thumb, data points located at least two residual standard deviations above and below the regression line are flagged as potential outliers.

In addition, data sets may also have influential points. These points are located horizontally, far away from the rest of the points. The addition or removal of the influential points significantly changes the regression line.

11.5:

Outliers and Influential Points

An outlier is an observation of data that does not fit the rest of the data. It is sometimes called an extreme value. When you graph an outlier, it will appear not to fit the pattern of the graph. Some outliers are due to mistakes (for example, writing down 50 instead of 500), while others may indicate that something unusual is happening. Outliers are present far from the least squares line in the vertical direction. They have large "errors," where the "error" or residual is the vertical distance from the line to the point.

Outliers need to be examined closely. Sometimes, for some reason or another, they should not be included in the data analysis. An outlier may be a result of erroneous data. Other times, an outlier may hold valuable information about the population under study and should remain included in the data. The key is carefully examining what causes a data point to be an outlier.

Besides outliers, a sample may contain one or a few points that are called influential points. Influential points are observed data points that are far from the other observed data points in the horizontal direction. These points may have a significant effect on the slope of the regression line. To identify an influential point, you can remove it from the data set and see if the slope of the regression line is changed significantly.

Computers and many calculators can be used to identify outliers from the data. Computer output for regression analysis will often identify both outliers and influential points so that you can examine them.

This text is adapted from Openstax, Introductory Statistics, Section 12.6 Outliers

Tags

Outliers Influential Points Extreme Value Data Analysis Errors Residual Vertical Distance Data Point Population Data Set Regression Line Horizontal Direction Slope