Simple Linear Regression
The quantity r, called the linear correlation coefficient, measures the strength and the direction of a linear relationship between two variables. The linear correlation scatter plot, it would be able to explain all of the variation. The further the line. Linear relationship is a statistical term used to describe the directly would be expressed on the top right quadrant of a graph with an X and Y axis. Example 4 : Assume that the independent variable is the size of a house. The relationship between x and y is called a linear relationship because the Consider the relationship described in the last line of the table, the height x of a man .. Assuming that the total distance the scooter is driven is 34 miles, predict the.
Correlation and Linear Regression
This analysis assumes that there is a linear association between the two variables. If a different relationship is hypothesized, such as a curvilinear or exponential relationship, alternative regression analyses are performed.
The figure below is a scatter diagram illustrating the relationship between BMI and total cholesterol. Each point represents the observed x, y pair, in this case, BMI and the corresponding total cholesterol measured in each participant.
Simple Linear Regression
Note that the independent variable BMI is on the horizontal axis and the dependent variable Total Serum Cholesterol on the vertical axis. BMI and Total Cholesterol The graph shows that there is a positive or direct association between BMI and total cholesterol; participants with lower BMI are more likely to have lower total cholesterol levels and participants with higher BMI are more likely to have higher total cholesterol levels.
For either of these relationships we could use simple linear regression analysis to estimate the equation of the line that best describes the association between the independent variable and the dependent variable. The simple linear regression equation is as follows: The Y-intercept and slope are estimated from the sample data, and they are the values that minimize the sum of the squared differences between the observed and the predicted values of the outcome, i.
These differences between observed and predicted values of the outcome are called residuals.
The estimates of the Y-intercept and slope minimize the sum of the squared residuals, and are called the least squares estimates. That would mean that variability in Y could be completely explained by differences in X. However, if the differences between observed and predicted values are not 0, then we are unable to entirely account for differences in Y based on X, then there are residual errors in the prediction.
The residual error could result from inaccurate measurements of X or Y, or there could be other variables besides X that affect the value of Y. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line if a point lies on the fitted line exactly, then its vertical deviation is 0.
Because the deviations are first squared, then summed, there are no cancellations between positive and negative values. Example The dataset "Televisions, Physicians, and Life Expectancy" contains, among other variables, the number of people per television set and the number of people per physician for 40 countries. Since both variables probably reflect the level of wealth in each country, it is reasonable to assume that there is some positive association between them.
After removing 8 countries with missing values from the dataset, the remaining 32 countries have a correlation coefficient of 0. Suppose we choose to consider number of people per television set as the explanatory variable, and number of people per physician as the dependent variable. The regression equation is People. To view the fit of the model to the observed data, one may plot the computed regression line over the actual data points to evaluate the results.
For this example, the plot appears to the right, with number of individuals per television set the explanatory variable on the x-axis and number of individuals per physician the dependent variable on the y-axis. While most of the data points are clustered towards the lower left corner of the plot indicating relatively few individuals per television set and per physicianthere are a few points which lie far away from the main cluster of the data.Dependent and Independent Variables - X or Y - Science & Math - Linear, Inverse, Quadratic
These points are known as outliers, and depending on their location may have a major impact on the regression line see below. Outliers and Influential Observations After a regression line has been computed for a group of data, a point which lies far from the line and thus has a large residual value is known as an outlier.
Statistics 2 - Correlation Coefficient and Coefficient of Determination
Such points may represent erroneous data, or may indicate a poorly fitting regression line. If a point lies far from the other data in the horizontal direction, it is known as an influential observation.
- Linear Regression
- Hypothesis Test for Regression Slope
The reason for this distinction is that these points have may have a significant impact on the slope of the regression line.
Notice, in the above example, the effect of removing the observation in the upper right corner of the plot: With this influential observation removed, the regression equation is now People. The correlation between the two variables has dropped to 0.
Influential observations are also visible in the new model, and their impact should also be investigated.