Back to chapter

11.8:

Variation

JoVE Core
Statistics
A subscription to JoVE is required to view this content.  Sign in or start your free trial.
JoVE Core Statistics
Variation

Languages

Share

In an uncorrelated data set, for a given value of x, the best-predicted value of y is the mean.

If the variables have a linear correlation, a y-value can be predicted by substituting the x-value in the regression equation.

The vertical distance between the predicted y-value and the sample mean, y-bar, is known as the explained deviation. The relationship between the two variables can explain this deviation.

The vertical distance between the data point and the predicted y-value is known as the unexplained deviation or the residual. The relationship between the variables cannot explain this deviation; it may be due to chance alone or the involvement of other variables.

The sum of the unexplained and explained deviations gives the total deviation.

Squaring the deviations and summing them for all data points yields the amount of unexplained, explained, and total variation.

The ratio of the explained variation to the total variation is the r-square value, also known as the coefficient of determination. It indicates the proportion of the variation in the y-value that the regression line can explain.

11.8:

Variation

An important characteristic of any set of data is the variation in the data. In some data sets, the data values are concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. The most common measure of variation, or spread, is the standard deviation, which is the square root of variance.

When independent and dependent variables are plotted on a scatter plot, the slope of a line is a value that describes the rate of change between the two variables. The slope tells us how the dependent variable (y) changes for every one unit increase in the independent (x) variable, on average. The y-intercept describes the dependent variable when the independent variable equals zero. A regression line, or a line of best fit, can be drawn on a scatter plot and used to predict outcomes for the x and y variables in a given data set or sample data.

The difference between the observed sample value, y, and the predicted value,Equation1 from the regression equation, is known as unexplained deviation. Whereas the difference between the predicted value  and the sample mean, y̅, is called the explained deviation. The difference between the observed value, y, and the sample mean, , is the total deviation.

If you add the squares of the explained deviations for all data points, we get the explained variation. In the same way, if we add the squares of the unexplained deviations for all data points, we get the unexplained variation. Also, if we add the squares of the total deviations for all data points, we get the total variation. Dividing the explained variation by the total deviation gives us the value of the coefficient of determination, r2, which represents the percent of the variation in the dependent variable y that can be explained by variation in the independent variable x using the regression line.

This text is adapted from Openstax, Introductory Statistics, Section 12, Linear Regression and Correlation.