- Introduction
- Scatter Diagram: Graphic method
- Correlation
- Types of Correlation coefficient
- Interpretation of correlation coefficients
- Coefficient of determination
- Properties of coefficient of Correlation
- Pearson's product moment correlation
- Spearman’s Rank Correlation
- Regression equations of two variables
- The Method of Least Square
- Coefficient of Regression equation
- Properties of Regression Coefficients
- Exercise
Introduction
The data below gives marks obtained by 10 students taking exam on math and computer test.
Students | A | B | C | D | E | F | G | H | I | J |
X Marks (Math) | 15 | 18 | 3 | 24 | 9 | 6 | 6 | 15 | 12 | 12 |
Y Marks (Computer) | 10 | 15 | 1 | 13 | 13 | 6 | 2 | 11 | 13 | 16 |
Is there a connection between the marks obtained by 10 students in Math and computer test? A starting point would be to plot the marks of both subjects in a scatter diagram.
Now calculating the means, we get
\(\bar{X}=\frac{120}{10}=12\)
\(\bar{Y}=\frac{100}{10}=10\)
And using them to divide the graph into four slots. It clearly shows that, the areas in the bottom right and top left of the graph are largely vacant.
So there is a tendency for the points to run from bottom left to top right.
In this example, most of the points (1st and 3rd quadrants) give positive value of
\((x-\bar{X})(y-\bar{Y})\)
The problem is to find a way to measure how strong this tendency is. To answer this question, we proceed further.
Here
\(\frac{x-\bar{x}}{s_x}\)
gives normalized distance of each x from \(\bar{x}\) and makes it unit free.
Also
\(\frac{y-\bar{y}}{s_y}\)
gives normalized distance of each y from \(\bar{y}\) and makes it unit free.
So
\(\frac{1}{n} \displaystyle \sum \left (\frac{x-\bar{x}}{s_x} \right ) \left ( \frac{y-\bar{y}}{s_y} \right )\)
gives normalized product moment, which is the value of correlation.
The value of correlation (r) gives a measure of how close the points are to lying on a straight line.
- r= 1 indicates that all the points lie exactly on a state line with positive gradient
- r=-1 gives the same information with a line having negative gradient
- r=0 tells us that there is no connection at all between the two sets of data
The illustration shows that, quantification of the relationship between variables is very essential to take the benefit of study of relationship, called corelation. For this, we find there are two basic methods of measurement of correlation, which can be represented as graphical method and algebraic method.
Scatter Diagram: Graphic method
Scatter Diagram is graphic method of measurement of correlation. It is a diagrammatic representation of bivariate data to ascertain the relationship between two variables. Under this method the given data are plotted on a graph paper. Once the values are plotted on the graph it reveals the type of the correlation between variable X and Y. However please note that, the correlation is affected by each point.
Strong Positive Correlation | Low Positive Correlation | No Correlation | Low Negative Correlation | Strong Negative Correlation |
Correlation
Correlation is a technique to measures strength of association (सम्बन्धको मापन) between two variables, say X and Y. The intensity (मात्रा) of the correlation is expressed by a number, called the coefficient of correlation, and it is denoted by r. The value of correlation lies between -1 to 1 (inclusive).
- Coefficient of correlation is first introduced by Galton (1886)
- Formalized by Karl Pearson (1896)
- Developed/extended by Fisher (1935)
- The main idea is to compute an index (number) which reflects how much two variables are related to each other.
- f two variables are related such that both increase or both decrease, then the correlation is positive
- If increase in any one variable is associated with decrease in the other variable, the correlation is negative
Types of Correlation coefficient
There are two main types of correlation coefficients: Pearson's product moment correlation coefficient and Spearman's rank correlation coefficient. The correct usage of correlation coefficient type depends on the types of variables being used. However, the different types of correlation are given in the table below.Quantative | Ordinal | Nominal | |
Quantitative | Pearson's | Biserial | Point Biserial |
Ordinal | Biserial | Spearman rho | Rank Biserial |
Nominal | Point Biserial | Rank Biserial | Phi |
Interpretation of correlation coefficients
Generally, the coefficient of correlation is either positive or negative or zero. If the correlation is positive, then the variables are related such that both increase or both decrease. If the correlation is negative , then increase in any one variable is associated with decrease in the other variable and vice-versa. If the correlation is zero then the variables are not related. On top, the correlation is interpreted as following
Correlation coefficients whose magnitude \(r\) lies
between 0.8 and 1.0 | is very high (perfect) correlation |
between 0.6 and 0.8 | is high correlation |
between 0.4 and 0.6 | is moderate correlation |
between 0.2 and 0.4 | is low correlation |
between 0.0 and 0.2 | is very low (no) correlation |
Coefficient of determination
Correlation coefficient measuring a linear relationship between the two
variables indicates the amount of variation one variable accounted for by the other
variable. A better measure for this purpose is provided by the square of the
correlation coefficient, known as “coefficient of determination”. This can be
interpreted as the ratio between the explained variance to total variance:
\(r^2 =\frac{\text{explained variance}}{\text{total variance}}\)
Similarly, Coefficient of non-determination is
\(1-r^2 \)
Thus
The square of correlation coefficient is called coefficient of determination. It r is obtained by two variables \(X\) and \(Y\) then \(r^2\) is the fraction of variation in \(Y\) that is explained by \(X\).
For example, if correlation between “Math score” and “Anxiety” is \(r=-0.4\), then \(r^2=0.16\), it means 16% of the variability in Math score and anxiety “overlaps” in opposite manner.
Based on this example, a coefficient of determination of 0.16 is obtained. It can be interpreted that the variation in Math Score can explain 16% of the variation in Anxiety score. The remaining 84% represents the variation in Math Score explained by other variables not included in the model.
Properties of coefficient of Correlation
As correlation measure the strength of association between two variables, the major properties of such correlation coefficients can be summarized into following bullets:
- The correlation coefficient lies between \(-1\) and \(+1\)
- Independent from unit of measurement
- Independent of origin and scale
- Symmetrical i.e., \(r_{xy} = r_{yx}\)
Limitation of Correlation
A key thing to remember is that correlation is not responsible in change in one variable causes a change in another. Sales of personal computers and athletic shoes have both risen strongly over the years and there is a high correlation between them, it cannot be assumed that buying computers causes people to buy athletic shoes (or vice versa).
The second caution is that the Pearson correlation technique works best with linear relationships: as one variable gets larger, the other gets larger (or smaller) in direct proportion. It does not work well with curvilinear relationships (in which the relationship does not follow a straight line). An example of a curvilinear relationship is age and health care. They are related, but the relationship doesn't follow a straight line. Young children and older people both tend to use much more health care than teenagers or young adults.
- r is measure of linear relationship only. There may be an exact connection between X and Y, but if it is no straight line, there is no help.
- correlation does not imply causality. A survay may result that strong correlation between left feet people and mental mathematics.
- An usual freak result may have strong effect on the value of r
Pearson's product moment correlation:Algebraic Method
Karl Pearson’s method of calculating coefficient of correlation is based on the
covariance of the two variables in a series. This method is widely used in practice and
the coefficient of correlation is denoted by the symbol \(r\). It is used when both variables being studied are normally distributed and Quantitative in scale . For a correlation between variables \(X\) and \(Y\) the formula for calculating the Pearson's correlation coefficient is given by
\( r=\frac{Cov(X,Y)}{\sigma_x \sigma_y}\) where \(Cov(X,Y) =\frac{1}{n} \sum (x-\bar{x})(y-\bar{y})\) Variance method
\( r=\frac{ \sum xy}{\sqrt{\sum x^2} \sqrt{\sum y^2}}\)Deviation method
\( r=\frac{n \sum XY-\sum X \sum Y}{\sqrt{n\sum X^2-(\sum X)^2} \sqrt{n\sum Y^2-(\sum Y)^2}}\)Raw method
Example 1
Calculate the correlation cofficient of the marks in Mathematics and Statistics for eight students as below.Marks in Math (X) | 67 | 68 | 65 | 68 | 72 | 72 | 69 | 71 |
Marks in Stat (Y) | 65 | 66 | 67 | 67 | 68 | 69 | 70 | 72 |
Based on the data given above, we can calculate the correlation subtracting \(65\) from each data both in X and Y. Now, the table of calculation is given below.
X | Y | X | Y | \(X^2\) | \(Y^2\) | XY |
67 | 65 | 2 | 0 | 4 | 0 | 0 |
68 | 66 | 3 | 1 | 9 | 1 | 3 |
65 | 67 | 0 | 2 | 0 | 4 | 0 |
68 | 67 | 3 | 2 | 9 | 4 | 6 |
72 | 68 | 7 | 3 | 49 | 9 | 21 |
72 | 69 | 7 | 4 | 49 | 16 | 28 |
69 | 70 | 4 | 5 | 16 | 25 | 20 |
71 | 72 | 6 | 7 | 36 | 49 | 42 |
\(\sum X=32\) | \(\sum Y=24\) | \(\sum X^2=172\) | \(\sum Y^2=108\) | \(\sum XY=120\) |
Now, using formula, the correlation cofficient is
\( r=\frac{N \sum XY-\sum X \sum Y}{\sqrt{N\sum X^2-(\sum X)^2} \sqrt{N\sum Y^2-(\sum Y)^2}}\)
or \( r=\frac{8 .120-32.24}{\sqrt{8.172-(32)^2} \sqrt{8.108-(24)^2}}=0.60\)
Spearman’s Rank Correlation
When quantification of variables becomes difficult such as beauty of female, leadership
ability, knowledge of person etc, then this method of rank correlation is useful which
was developed by British psychologist Charles Edward Spearman in 1904. In this
method ranks are allotted to each element either in ascending or descending order.
The correlation coefficient between these allotted two series of ranks is popularly
called as “Spearman’s Rank Correlation” and denoted by \(\rho\). It is appropriate when one or both variables are skewed or ordinal in scale. For a correlation between variables \(X\) and \(Y\) the formula for calculating the Spermans' rho correlation coefficient is given by
\(\rho=1-\frac{6\displaystyle \sum_{i=1}^n d_i^2}{n(n^2-1)}\)
where
\(d_i\)= the difference between the ranks of corresponding variables
\(n=\) number of observations
NOTE
If there are tied ranks, we give mean of the ranks they would have if they were not tied. In this case, we use the formula as below.
\(\rho=1-\frac{6 \left [\displaystyle \sum_{i=1}^n d_i^2+ \sum_k \frac{m_k^2(m_k^2-1)}{12}\right ]}{n(n^2-1)}\)
where
\(k=\)repeated items
Proof of \(\rho=1-\frac{6\displaystyle \sum_{i=1}^n d_i^2}{n(n^2-1)}\)
Consider a bivariate sample \(x_i,y_i\) for \(i=1,2, \cdots,n\), then, \(x_i\) and \(y_i\) get ranks, each is a permutation of the same sequence of numbers \(1,2,3,\cdots,n\)
Thus
\(\bar{x}=\frac{\displaystyle \sum_{i=1}^n i}{n}\)
or\(\bar{x}=
\frac{1+2+\cdots+n}{n}\)
or\(\bar{x}=\frac{n+1}{2}\)
Similarly,
\(s_x^2=\frac{1}{n} \displaystyle \sum_{i=1}^n (i^2)-(\bar{x})^2\)
or\(s_x^2=\frac{1}{n}\frac{n(n+1)(2n+1)}{6}-(\frac{n+1}{2})^2\)
or\(s_x^2=\frac{(n+1)(2n+1)}{6}-(\frac{n+1}{2})^2\)
or\(s_x^2=(\frac{n+1}{2}) \left [\frac{(2n+1)}{3}-\frac{n+1}{2} \right ]\)
or\(s_x^2=(\frac{n+1}{2})[\frac{n-1}{6}] \)
or\(s_x^2=\frac{n^2-1}{12}\)
So, we have
\(\bar{x}=\bar{y}=\frac{n+1}{2}\)
\(s_x^2=s_y^2=\frac{n^2-1}{12}\)
Next, we consider that
\(d_i= (x_i-\bar{x})(y_i-\bar{y})\)
Therefore
\(\displaystyle \frac{1}{n} \sum_{i=1}^n d_i^2= \frac{1}{n} \sum_{i=1}^n [(x-\bar{x})(y-\bar{y})]^2\)
or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= s_x^2+s_y^2-2.r. s_xs_y\)
or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= 2s_x^2-2.r. s_x^2\)
or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= 2s_x^2(1-r)\)
or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= 2 \frac{n^2-1}{12}. (1-r)\)
or\(\displaystyle \frac{1}{n}\sum_{i=1}^n d_i^2= \frac{n^2-1}{6}. (1-r)\)
or\( \frac{ \displaystyle 6\sum_{i=1}^n d_i^2}{n(n^2-1)}= (1-r)\)
or\(r=1- \frac{ \displaystyle 6\sum_{i=1}^n d_i^2}{n(n^2-1)}\)
Example 2
Calculate the rank correlation score in Mathematics and IQ for seven students as below.Score in Math (X) | 52 | 51 | 53 | 55 | 54 | 56 | 57 |
Score in IQ (Y) | 61 | 63 | 62 | 64 | 67 | 65 | 66 |
Based on the data given above, we can calculate the correlation subtracting \(50\) from each data in \(X\) and subtracting \(60\)( from each data in \(Y\).
Now, the table of calculation is given below.
X | Y | X | Y | Rank of X | Rank of Y | d:Rx-Ry | Square of Rank difference: \(d^2\) |
52 | 61 | 2 | 1 | 6 | 7 | -1 | 1 |
51 | 63 | 1 | 3 | 7 | 5 | 2 | 4 |
53 | 62 | 3 | 2 | 5 | 6 | -1 | 1 |
55 | 64 | 5 | 4 | 3 | 4 | -1 | 1 |
54 | 67 | 4 | 7 | 4 | 1 | 3 | 9 |
56 | 65 | 6 | 5 | 2 | 3 | -1 | 1 |
57 | 66 | 7 | 6 | 1 | 2 | -1 | 1 |
\(\sum d^2=18\) |
Now, using formula, the correlation cofficient is
\(\rho=1-\frac{6\displaystyle \sum_{i=1}^n d_i^2}{n(n^1-1)}\)
or \(\rho=1-\frac{6 \times 18}{7(7^2-1)}=0.68\)
Regression equations of two variables
Regression analysis is a statistical tool to estimate (or predict) the
unknown values of dependent variable from the known values of independent
variable.
The variable that forms the basis for predicting another variable is known as
the Independent Variable and the variable that is predicted is known as dependent
variable.
For Example,
in \(Y=a+bX\)
one can obtain value of \(Y\) by putting the value of \(X\)
So,
X is called independent variable
Y is called dependent variable
Therefore
Regression is a technique to measure “dependence of one variable upon other variable”.
Regression Equation
Let ‘X’ is a independent variable and ‘Y’ is an dependent variable, then regression equation of Y on X is
\(Y=a +b X\) (1)
where ‘a’ and ‘b’ are constants, where
\(a\) represent y-intercept
\(b\) represent slope of line
To better understand it, just compare \(Y=b X+a\) with \(Y=m X+c\), then, it can be said that
\(a=c\) represent y-intercept
\(b=m\) represent slope of line
To compute the values of these constant ‘a’ and ‘b’, corresponding normal equations are given below
Normal Equation for ‘a’ is [Taking sum of both sides, we get]
\( \sum Y=na+b \sum X \)(2)
Normal Equation for ‘b’ is [Multiply equation (1) by X and take sum of both sides we get]
\( \sum XY=a \sum X+ b \sum X^2\) (3)
Solving (2) and (3) to find ‘a’ and ‘b’ , we get
\( b =\frac{\sum XY -\frac{\sum X \sum Y}{n}}{\sum X^2-\frac{(\sum X)^2}{n}}\)
or\( b =\frac{\sum XY -n \bar{X}\bar{Y}}{\sum X^2-n \bar{X}^2} \)
And
\( a =\bar{Y}-n \bar{X}\)
The Method of Least Square
Let us consider a data set given asX | 4 | 8 | 12 |
Y | 6 | 1 | 6 |
No comments:
Post a Comment