A univariate data set consists on only one variable, like Income of individual families, Heights of children in a given age group, Test Scores or ages of employees in an organization. But there are many situations where we need to observe two variables for the required study. Also we may be interested to know whether the subject under study is related to another variable. Hence the study of Bivariate Data provides tools, techniques and methods for the purpose of analysis and inference of Bivariate Data Distribution.

A contingency table is used to display the bivariate data when both the variables are classified as categorical.

## Bivariate Data Definition

Bivariate data consists of two variables, whose relationship is to be analyzed.

The Variables in a Bivariate data distribution can both be numerical, can both be categorical or one numerical and one categorical. If the analysis shows that one variable is influenced by the second variable, then the two variables are correspondingly known as dependent and independent variables.

## Bivariate Data Analysis

The techniques applied in the analysis of Bivariate Data depend on the types of data involved in the distribution.

Scatter Plot and Regression Line

When both the variables in a Bivariatle data set are quantitative or numerical type, a scatter plot is used to study the relationship between the two variables. Each pair of variables is considered as an ordered pair and plotted on a graph. The independent variable is measured along the X - axis (Horizontal axis) and the dependent variable is measured along the Vertical Y-axis. From the pattern of the plots, we can analyze the correlation between the two variables.

The above scatter Plot shows the relationship between the average number of hours studied per week and the final score.
A positive correlation can be recognized from the pattern seen. Using the data set a regression line or trend line can be found using various methods. The equation of the regression line is useful in forecasting future behavior.

Numerical Variable and a Categorical Variable

A back to back stem plot or a Histogram is used to display Bivariate data consisting of a numerical variable and a categorical variable with categories.
The following table shows the weights of new born babies in a hypotherical Hospital during the course of a month.

 Weights in Kg Boys 3.5 4.3 5.0 3.6 4.9 3.5 3.8 4.8 3.6 4.2 Girls 3.0 2.8 3.8 3.2 4.1 3.1 2.7 3.3 3.6 3.2

The back to back stem plot is shown above, which can be used for further analysis of finding the median and the quartiles.

When the categorical data consists of more than two categories parallel box plots can be constructed displaying the five point summary of each category.

Example:

The following contingency table shows the ice cream flavor preferences between male and female students

 Flavor Male Female Total Vanilla 9 5 14 Chocolate 12 20 32 Strawberry 12 15 27 Caramel 15 12 27 Banana Split 12 8 20 Total 60 60 120

This contingency table can be used for analyzing the bivariate data using different techniques. The frequencies here can be expressed as percentages and compared. Or this can be used in testing the claim on population behavior using advanced techniques like Hypothesis testing.

## Bivariate Data Examples

Below you could see some examples

### Solved Examples

Question 1: The table below shows the height of a player and the average number of points made in a single basket ball match.

 Height   in cm   x AveragePoints Scored        y Height   in cm    x AveragePoints Scored        y Height     in cm       x AveragePoints Scored         y 184 12 200 20 199 18 194 22 188 18 177 6 185 6 184 14 184 16 174 5 188 12 178 8 186 14 182 14 190 20 183 10 185 10 193 24 175 8 183 18 204 24

Use technology to draw a scatter plot of the data given and discuss the correlation between the height and average points scored. Also use the technology to find the line of best fit for the bivariate data plotted.
Solution:

From the scatter plot pattern a positive correlation between the height of the player and the points scored can be
inferred. The Equation of the Regression line is y = 0.611X -99.699. The correlation coefficient r = 0.82 which
tells that a moderate positive correlation exists between the two variables.

Question 2: The heights (in cm)of students in three grades in a High School are given below. Find the five point summary for each group, plot the summary in parallel box plots.

 Grade 10 120, 126, 131, 138, 140, 143, 146, 147, 150, 156, 157, 158, 158, 160, 162, 164, 168, 170 Grade 11 140, 143, 146, 147, 149, 151, 153, 156, 162, 164, 165, 167, 168, 170, 173, 177, 178, 180 Grade12 151, 153, 154, 158, 160, 163, 164, 166, 167, 169, 169, 172, 175, 180, 187, 189, 193, 195

Solution:
The five point summary for each group is as follows:

 Grade Minimum Maximum Median Q1 Q3 10 120 170 153 140 160 11 140 180 163 149 170 12 151 195 168 160 180

The parallel box plot shown above can be used to compare the distributions of heights among the three grade students.