Regression is a versatile technique used in statistics. Correlation analysis is used to determine the correlation between two variables and give a measure of strength and direction of the correlation.

The significance of correlation leads to the next natural step, which is Regression. Regression analysis provides methods to describe the relationship and use the relationship in forecasting. Regression analysis measures and uses the predictive power of one or more independent variables in predicting the values of the dependent variable.

Regression analysis broadly involves three steps.
  1. Finding the regression equation describing the relationship between the predictor and response variables.
  2. Testing the goodness of fit of the regression equation.
  3. Understanding the trend and making predictions and forecasts using the regression equation.

The predictor variable is commonly known as the independent variable and the response variable is called the dependent variable. Often changes in more than one predictor variables causes the change in the response variable.

Regression analysis can be classified as Simple Regression and Multiple Regression.

Simple Regression:
In simple regression the relationship between the dependent variable and one independent variable is found. Simple regression can further be divided into two types Linear and Non linear.
In simple linear regression the relationship is estimated as a linear function. There are several types of non linear regression like exponential and polynomial models. Which type of approximation is to be done is determined by studying the scatter plot and the strength and significance of linear correlation.

Multiple Regression:
In multiple or multivariate regression two or more predictor variables are involved in finding a fit for predicting the dependent variable. Due to the complexity involved in this type of analysis the regression equation generally formed on a linear model.
Simple Linear Regression analysis is based on the linear relation between the response variable Y and a single predictor variable X. The model investigated is Y = α + βx + ε, where ε is called the error which refers to factors that contribute to Y value other than X. A linear regression line is used when the linear correlation coefficient calculated for the sample data is high enough and its significance accepted by a Hypothesis test. The first step in the analysis is to find the equation to the line of best fit. There are many methods to find the line of best fit. But the Least Square Regression line is accepted as the reliable tool to be used in prediction and forecasting.

Least square regression line is got by minimizing the squared deviations of all the data points from the fitting line.

Statistically the line of best fit is written in the form Y' = a + bX the equivalent form to a linear equation Y = mX + b.

The values of a, the Y' intercept and b the slope are found using formulas similar to the one used for finding the correlation coefficient r for a sample data.

a = $\frac{(\sum y)(\sum x^{2})-(\sum x)(\sum xy)}{n(\sum x^{2})-(\sum x)^{2}}$

b = $\frac{n(\sum xy)-(\sum x)(\sum y)}{n(\sum x^{2})-(\sum x)^{2}}$

The slope b can also be defined as b = $\frac{S_{x,y}}{S_{x}^{2}}$ where Sx,y is the sample covariance between x and y and $S_{x}^{2}$ is the sample variance of x.
The Y intercept a can also be defined as a = y - bx where x and y are correspondingly the sample means of x group and the y group.
Here the response variable is denoted by Y' as estimate to distinguish from the actual value Y.The regression line is used for estimation purposes after testing the significance of its slope and intercept.
In multivariate regression analysis, the regression model defines the relationship between one response variable against many (more than one) predictor variables. The general model considered is Y = β0 + β1X1 + β2X2 +.......+ βpXp + ε. The task is to find the values of all the weights βi, the constant β0 and the residual ε.
The regression equation used for the purpose is of the form
Y = b0 + b1X + bX2 +.....+bpXp.
Often a multivariate regression situation can be reduced to a simple regression model with one predictor variable, by considering the variances caused by other predictor variables as the residual.

Solved Example

Question: The selling price of an item and sales volume in of thousands of items is given in the table below.

a) Find the equation to line of best fit, using least square regression.
  
Selling Price
in dollars

 60   80   100 
 120 
 140   160   180 
 200 
 220 
 240 
Sales in
thousands of Numbers
 400   350    300    275    250    210    190    150    100    50

b) Also estimate the sales volume when the selling price is 175 dollars.

Solution:
 
We need to find the line of best fit of the model y' = a + bx. That is the task is to find the values of the intercept 'a' and slope 'b'.
Let us rewrite the table in order to find the summed up values that can be plugged in the formulas for the slope 'b' and Y' intercept 'a'.

    Price 
  x dollars

     Sales
         y

          xy 
          x2  
          y2  
       60
     400        24,000           3,600    160,000
       80
     350        28,000           6,400    122,500
     100
     300        30,000         10,000
     90,000
     120
     275        33,000        14,400      75,625
     140
     250
       35,000        19,600      62,500
     160
     210        33,600        25,600      44,100
     180
     190        34,200        32,400      36,100
     200
     150        30,000        40,000      22,500
     220
     100        22,000        48,400      10,000
     240
       50        12,000        57,600        2,500
  ∑x = 1,500    ∑y = 2,275    ∑xy = 281,800    ∑x2 = 258,000   ∑y2 = 625,825

We have the required sums to be substituted in the formula.

a = $\frac{(\sum y)(\sum x^{2})-(\sum x)(\sum xy)}{n(\sum x^{2})-(\sum x)^{2}}$

  = $\frac{(2275)(258000)-(1500)(281800)}{10(258000)-(1500)^{2}}$  = 497.73

b = $\frac{n(\sum xy)-(\sum x)(\sum y)}{n(\sum x^{2})-(\sum x)^{2}}$

   = $\frac{10(281800)-(1500)(2275)}{10(25800)-(1500)^{2}}$   = -1.80

Hence the required regression line is y' = 497.73 - 1.80x.

b) To estimate the sales when the price x = 175 dollars, substitute x =175 in the regression equation and calculate y'.
    y' = 497.73 - 1.80(175) = 182.73 
    This means 182,730 items are expected to be sold at a price of 175 dollars.