Regression is a statistical technique that determines the strength of the relationship between dependent variable and a series of other changing (independent) variables. It estimates the relationship among variables and in simple regression, it examines the relationship between one independent and one dependent variable. Regression statistics is used to predict the dependent variable, when the independent variable is known.
A regression equation is of the form Y = a + bx + c, where,
Y: Variable you wish to predict. (Dependent variable)
x: Variable you're using to predict 'Y'. (Independent variable)
a: Y-intercept of the line.
b: Slope.
c: Value called the Regression residual.
Regressions are of several types namely linear regression, logistic regression, stepwise regression, robust regression, non linear regression, non parametric regression, multiple regression. Regression is used on an intuitive level everyday. In business, a well dressed man is thought to be financially successful. Quantitative regression adds precision by developing a mathematical formula, that can be used for predictive purposes. The purpose of regression is to find a formula, that fits the relationship between the two variables.

## Regression Line

A regression line is a plot of the expected value of the dependent variable, for all the values of the independent variable. It is a line which minimizes the squared residuals. A regression line should be a best fit data on the scatterplot. It has lots of data about two variables, having a reasonable degree of connection between them.

Regression line can be of the type,
• Y on X - Estimate Y values from known X values.
• X on Y - Estimate X values from known Y values.

## Regression Coefficient

Regression coefficient is the slope ('b') of the line of the regression equation and is obtained using linear square fitting. (The sum of the squared difference between the observed dependent variable and its estimates should be minimum). If the regression is linear, the regression coefficient is the constant that represents the rate of change of one variable ('Y') as a function changes in the other ('X').
Equation of the regression line is
Y = a + $b_{1}X_{1}$
where,
Y: Value of the dependent variable
a: Intercept
$b_{1}$: Slope for $X_{1}$
$X_{1}$: First independent variable explaining the variance in Y.

The formula for regression coefficient is

$b_{1}$ = r $\frac{s_{y}}{s_{x}}$

$b_{1}$: Regression coefficient
r: Correlation between the x variable and the y variable.
$s_{y}$: Standard deviation of the variable 'y'.
$s_{x}$: Standard deviation of the variable 'x'.

## Regression Assumptions

Given below are the assumptions of regression:
• The relationship is linear between dependent and independent variable.
• Dependent variable is continuous on the real line.
• Error terms are independent and identically distributed in normal distribution.
• Error terms are uncorrelated and the error is assumed to be a random variable with a mean of zero conditional on the explanatory variables.
• Independent variables should be error free. Else, the modeling should be done using error in variable model technique.
• From the population of interest, the sample is selected at random.
• Variation around the line of regression be constant for all the values of X.

## Multiple Regression

Multiple regression is a statistical technique that predicts the value of one variable based on two or more variables. It is the simultaneous combination of multiple factors, to assess how and to what extent they affect a certain outcome. It is unreliable in instances, when there is a high chance of outcome being affected by unmeasurable factors.

Multiple regression equation is of the form

Y = a + $b_{1}X_{1}$ + $b_{2}X_{2}$ +  $b_{3}X_{3}$ + $\epsilon$

Y: Value of the dependent variable
a: Intercept
$b_{1}$: Slope for $X_{1}$
$X_{1}$: First independent variable explaining the variance in Y.
$b_{2}$: Slope for $X_{2}$
$X_{2}$: First independent variable explaining the variance in Y.
$b_{3}$: Slope for $X_{3}$
$X_{3}$: First independent variable explaining the variance in Y.
$\epsilon$: Error term

For example, suppose we are interested to predict how many individual enjoys their job. Variables such as salary, age, sex, number of years in full-time employment, academic qualification all contribute towards job satisfaction. When data is collected on these variables and surveyed, we might see job satisfaction is most accurately predicted by type of occupation, salary, years in full employment, while the other variables might not help us to predict job satisfaction.

## Linear Regression

Linear regression models the relationship between two variables, by fitting a linear equation to observed data and is known as the line of best fit. The most common form of linear regression is least squares fitting. Method of least squares calculates the best fitting line for the observed data by minimizing the sum of the square of the deviations from each data point to the line. Deviations will first be squared, then summed. Linear regression fits a straight line through the set of given points, that makes the sum of the squared residuals of the model as small as possible. A straight line will depict a linear trend in the data. Slope of the fitted line will be equal to the correlation between x and y corrected by the ratio of standard deviation of these variables.

A linear regression is of the form
Y = a + $b_{1}X_{1}$
where,
Y: Value of the dependent variable
a: Intercept
$b_{1}$: Slope for $X_{1}$
$X_{1}$: First independent variable explaining the variance in Y.

## Non Linear Regression

In a non linear regression model, the regression function is not a linear function of the unknown parameters. In a non linear regression model,  atleast one of the parameters appear non linearly. Model parameters depends on one or more independent variables. The data is fitted by a method of successive approximations.

A non linear regression model is of the form

$Y_{i}$ = f($x_{i}$, $\theta$) + $\epsilon_{i}$  ; i = 1, 2, 3, ....., n
where,
$Y_{i}$: Responses
f: known function of the covariate vector $x_{i}$ = ($x_{i1}, x_{i2},...., x_{ik}$)
$\theta$: Parameter vector, $\theta$ = ($\theta_{1}$, $\theta_{2}$,...., $\theta_{p}$)
$\epsilon_{i}$: Random errors uncorrelated with mean zero and constant variance.

Often, least squares approach is used to fit a non linear regression model, where new variables will be added to the data. When the new variables are constructed properly, the curved function of the original data can be expressed as a linear function for the new variables.

## Logistic Regression

Logistic regression is an approach for prediction, and is used to predict the outcome of a categorical dependent variable based on one or more predictor variables. Logistic regression is used when the dependent variable has only two possible outcomes. Did the student pass? (Answer is yes or no). There can be one or several independent variables. A chi square test is used to indicate, how well the logistic regression model fits the data. Logistic regression measures the relationship between a categorical dependent variable and one or more independent variables. Logistic regression does not assume a linear relationship between the dependent and independent variables. Dependent variable is a logit, the natural log of the odds.

log(odds) = logit(p) = ln $\frac{p}{1 - p}$
logit is a log of odds and odds are a function of p, probability of 1.

Through logistic regression,
logit(p) = a + bx
log odds is assumed to be linearly related to x.

## Nonparametric Regression

Non parametric regression is a form of regression analysis in which the predictor does not take a predetermined form. However, is constructed based on the given data.

Non parametric regression is written in the form

$y_{i}$ = f($x_{i1},x_{i2},....., x_{ik}$) + $\epsilon_{i}$
where,
$y_{i}$: Responses
f is unspecified, smooth, continuous function
($x_{i1},x_{i2},....., x_{ik}$) is a vector of predictions
$\epsilon_{i}$: Random errors uncorrelated with mean zero and constant variance.

## Robust Regression

Robust regression is a form of regression analysis, where the regression methods are not as sensitive to outliers. Robust regressions are designed to be not affected of assumptions by the underlying data - generating process. Robust regression is an alternative to least squares regression, when data is affected by outliers and is used for the purpose of detecting influential observations.

Regression coefficients found using M-estimators are close to least squares estimators, if the errors are normal. It is more robust, if the error distribution has heavy tails.

## Stepwise Regression

Step wise regression is a semi automated process of building a model by successively adding or removing variables based on t-tests, F-test, adjusted R square. Step wise regression is useful, when the number of explanatory variables in the maximum model is large. There are two main approaches in step wise regression
• Forward selection
• Backward Elimination
Forward selection: Starts with an empty model with no explanatory variables. Variables will be added one by one until the model cannot be improved significantly by adding another variable. Each time a new variable is added to the model, the significance of the variables present in the model is re examined. Remove the variable with the highest p-value. Model will be re-fitted without this variable before going to the next step in the step wise regression. Step wise regression procedure is continued until no more variables can be added or removed.

Backward elimination: Backward elimination is a sequence of tests for significance of explanatory variables. Starts with the maximum model.

Y = $\beta_{0}$ + $\beta_{1}x_{1}$ + ....... + $\beta_{k}x_{k}$ + $\epsilon$
where,
Y: Response
$\beta_{0}$: Intercept
($\beta_{i1},\beta_{i2},....., \beta_{ik}$) is a vector of predictions
$\epsilon_{i}$: Random errors uncorrelated with mean zero and constant variance.

Eliminate the variable with the highest p value for the test of significance of the variable. Fit the reduced model by removing the variable with the highest p value for the test of significance of the variable. (p > 0.10). Stop the procedure when no variables can be removed from the model at significance level 10%.

## Anova and Regression

Analysis of variance is an analysis of the variation present in an experiment. It is used to see, if there is any difference between groups on some variable. It is available for both parametric and non parametric and forms a basis for tests of significance. Regression is the process of determining the relationship between two variables. Anova and regression do many of the same things. However, regression is more powerful than anova. They both are applicable, only if there is a continuous outcome variable. While Anova focuses on random variables, regression focuses on  fixed(independent) or continuous variables. Importance of anova is to determine if data from various groups have a common means or not. Regression is used for forecasting and predictions.

## Poisson Regression Model

Poisson regression is used to model count data and contingency tables. A Poisson regression model is also known as log linear model, when it is used to model contingency tables. Counts are all positive integers. Poisson distribution is more appropriate, since the Poisson mean is greater than 0. Poisson regression model expresses the log outcome rate as a linear function of a set of predictors.

Assumptions in poisson regression:
1. Observations are independent.
2. Changes in the rate from combined effects of risk factors are multiplicative.