Chi Square distribution is a family of curves based on the degrees of freedom. The Greek alphabet $\chi$ (read as chi) is used to denote a Chi Square variable whose values are found using the formula $\chi^{2}$ = $(n-1)$$\frac{s^{2}}{\sigma ^{2}}$ by taking random samples of size $n$ from a normally distributed population with variance σ2.
The shape of the curve is determined by the value of $n-1$ which is known as the degrees of freedom for the distribution.Chi Square Test
The Chi Square distribution has the following properties:

  1. It is a continuous distribution.
  2. It is not symmetrical but skewed to the right.
  3. The shape of the distribution depends on the degrees of freedom df = n -1 where n is the sample size.
  4. The value of $\chi^{2}$ random variable is always positive.
  5. There are infinitely many distributions, each being uniquely defined by its degrees of freedom. As the value of $n$ increases the shape of the distribution becomes more and more symmetrical, meaning the distribution approaches normalcy.

Let us see how Chi Square distributions are used in Hypothesis testing.

The fact the $\chi^{2}$ random variables are related to the sample and population variances ($\chi^{2}$ = $(n-1)$$\frac{s^{2}}{\sigma ^{2}}$) makes it possible to test the claims on a single variance using χ2 distributions. The assumptions made for Chi Square Test on single variance are,
  1. The sample is randomly selected from the population.
  2. The population for the variable under study is normally distributed.
  3. The observations are independent of one another.

For the above test, the test statistic is calculated using the formula, $\chi^{2}$ = $(n-1)$$\frac{s^{2}}{\sigma ^{2}}$. The critical values are suitable found from $\chi^{2}$ tables, using the degrees of freedom and the type of the test left, right tailed or two-tailed.

Chi square test is also used for tests related to frequency distributions or data displayed in a contingency table. The three common types of tests are

  1. Test for Goodness of fit:
  2. Test for independence
  3. Test for homogeneity of Proportions.

Test for Goodness of fit:
This test is conducted to test whether a frequency distribution follows a specific pattern or not. The null hypothesis is made with the existence of a specific pattern. The null hypothesis and hence the goodness of fit of the data are rejected when the calculated test value is greater than the critical value found from the $\chi^{2}$ table, that is the value $\chi _{\alpha ,n-1}^{2}$. The test statistic is calculated using the formula
$\chi^{2}$ = $(n-1)$$\frac{s^{2}}{\sigma ^{2}}$, where obs and exp represent the observed and expected frequencies.

The assumptions made for Chi Square Goodness of fit test are

  1. The data are obtained from a random sample.
  2. The expected frequency of each category must be 5 or more.

Test for Independence:
The Chi Square test for independence is used to test the independence of two variables. The null hypothesis for this test assumes independence of the two variables. The null hypothesis is rejected when the calculated test statistic is greater than the $\chi^{2}$ critical value found from the table against the degrees of freedom and significance level α. The test statistic is calculated using the same formula given under the Goodness of fit test.

Test for Homogeneity of Proportions:
This Chi Square test also makes use of a contingency table. The situation here is that the samples are selected from different populations and it is to be determined whether the common characteristics among the different populations defined as a proportion are same for all populations under study. While the null hypothesis assumes the equality of proportions, the alternate hypothesis states that at least one proportion is different from others. The null hypothesis is rejected when the calculated test statistic is greater than the critical value $\chi _{\alpha ,n-1}^{2}$ found from the table. The test statistic is calculated in a manner similar to the method used to find the test statistic for Chi Square test for independence.

In addition to the above tests, Chi Square distributions are also used to test the normality of a variable.

The formula used to calculate the test statistic in the test of single variance is
$\chi^{2}$ = $(n-1)$$\frac{s^{2}}{\sigma ^{2}}$
The formula used to calculate the test statistic in Chi Square test of Goodness of fit, test for independence or Homogeneity of Proportion is
$\chi^{2}$ = $\sum $$\frac{(obs-exp)^{2}}{exp}$
where 'obs' represents the observed frequency and 'exp' the expected frequency. The observed frequency is given in the frequency or contingency table for the categories. The expected frequencies for Goodness of fit test are calculated assuming the absence of a pattern or fit. That means the total frequency is divided equally among the categories.
When the data is given in a contingency table, the expected frequency for each cell is calculated for test of independence or Homogeneity of proportion using the formula
Expected frequency = $\frac{row\ sum \times column\ sum}{grand\ total}$.

Let us see how the contingency table is made for expected values when the observed values are also given in a tabular format.

The following table shows amount of alcohol consumption categorized as Low, Moderate and High surveyed on 50 men and 50 women.

Alcohol Consumption (Observed Values)
Gender Low Moderate High Total
Male 22 18
10
50
Female
10
25 15
50
Total 32
43
25
100

You may notice that the row and column totals tally for each category and the grand total tallies with row and column totals.

The observed frequencies corresponding to the row category Male and column category Low is 22. We can indicate this as
ObsML = 22 and the corresponding expected frequency can be be written as ExpML, for the assumption that the alcohol consumption is gender independent.

ExpML = $\frac{row\ sum \times column\ sum}{grand\ total}$ where row sum corresponds to the total for the row Male and column sum corresponds to the total for the column Low.

ExpML = $\frac{50\times 32}{100}$ = 16
In a similar manner the other expected values are calculated.

ExpMM = $\frac{50\times 43}{100}$ = 21.5 ExpMH = $\frac{50\times 25}{100}$ = 12.5

ExpFL = $\frac{50\times 32}{100}$ = 16 ExpFM = $\frac{50\times 43}{100}$ = 21.5 ExpFH = $\frac{50\times 25}{100}$ = 12.5

Now the contingency table for expected frequencies is as follows:

Alcohol Consumption (Expected Values)
Gender Low Moderate High Total
Male 16 21.5
12.5
50
Female
16
21.5 12.5
50
Total 32
43
25
100

The totals for rows and columns tally for each category and the grand total tallies with row and column totals, here as well.
The following table shows the brand preference of three fruit beverages amongst young people of three different age group. At α = 0.05 test the claim that the brand preference is independent of age group.

Brand Preference for Fruit Beverages (Observed Frequencies)
Brand Under 15 Age 15 -25
Age 25 - 35
Total
Brand A 150 100
200
450
Brand B 200
175
125
500
Brand C 250 125
175
550
Total 600
400
500
1500

The expected value for each cell is calculated using the formula,
Expected value = $\frac{row\ sum \times column\ sum}{grand\ total}$
The expected values are again shown in a contingency table as follows:

Brand Preference for Fruit Beverages (Expected Frequencies)
Brand Under 15 Age 15 -25
Age 25 - 35
Total
Brand A 180 120
150
450
Brand B 200
133.3
166.7
500
Brand C 220 146.7
183.3
550
Total 600
400
500
1500

Step 1: State the null and alternate hypothesis.
H0: Brand Preference for fruit beverages is independent of the age group. (claim)
H1: Brand Preference for fruit beverages is dependent of the age group.

Step 2: Find the critical value.
There are three brands and three age groups. Hence the degrees of freedom = (3 -1)(3 - 1) = 4.
The α level for the test is given as 0.05.
The critical value for the test is found from the table as $\chi _{0.05,4}^{2}$ = 9.48773.
Step 3: Calculate the test value using the formula
$\chi _{test}^{2}$ = $\sum $$\frac{(obs-exp)^{2}}{exp}$
= $\frac{(150-180)^{2}}{180}$ + $\frac{(100-120)^{2}}{120}$ + $\frac{(200-150)^{2}}{150}$ + $\frac{(200-200)^{2}}{200}$ + $\frac{(175-133.3)^{2}}{133.3}$ + $\frac{(125-166.7)^{2}}{166.7}$ + $\frac{(250-220)^{2}}{220}$ + $\frac{(125-146.7)^{2}}{146.7}$ + $\frac{(175-183.3)^{2}}{183.3}$
= 56.15
Step 4: Make Decision:
The test statistic 56.15 > the critical value 9.48773. Hence the null hypothesis is rejected.
Step 5: Summarize the results:
At significance level α = 0.05, there is sufficient evidence to reject the claim that the brand preference for fruit beverages is independent of age group.