An important characteristic of any set of data is the variation in the data. In some data sets, the data values are concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. The most common measure of variation, or spread, is the standard deviation, which is the square root of variance.
When independent and dependent variables are plotted on a scatter plot, the slope of a line is a value that describes the rate of change between the two variables. The slope tells us how the dependent variable (y) changes for every one unit increase in the independent (x) variable, on average. The y-intercept describes the dependent variable when the independent variable equals zero. A regression line, or a line of best fit, can be drawn on a scatter plot and used to predict outcomes for the x and y variables in a given data set or sample data.
The difference between the observed sample value, y, and the predicted value, from the regression equation, is known as unexplained deviation. Whereas the difference between the predicted value and the sample mean, y̅, is called the explained deviation. The difference between the observed value, y, and the sample mean, y̅, is the total deviation.
If you add the squares of the explained deviations for all data points, we get the explained variation. In the same way, if we add the squares of the unexplained deviations for all data points, we get the unexplained variation. Also, if we add the squares of the total deviations for all data points, we get the total variation. Dividing the explained variation by the total deviation gives us the value of the coefficient of determination, r2, which represents the percent of the variation in the dependent variable y that can be explained by variation in the independent variable x using the regression line.
This text is adapted from Openstax, Introductory Statistics, Section 12, Linear Regression and Correlation.