|
Chi square: Two-way Classification
In this type of problem, the observations are classified by two characteristics, and we wish to examine the null hypothesis that the two characteristics are independent of each other. That is, the distribution of one characteristic should be the same regardless of the distribution of the other characteristic.
For example, if size class and number of flowers produced are independent (factors have nothing to do with each other), then the proportion of large plants having 10 flowers should be the same as the proportion of small plants having 10 flowers, and so on. (These theoretical distributions or proportions are for the population, not for the sample.) Another way of starting this is that if weight class and number of flowers are independent, then the probability of having a large plant and the production of >10 flowers together should be equal to the probability of having a large plant times the probability of having >10 flowers together should be equal to the probability of having a large plant times the probability of having >10 flowers. Our procedure is to examine the sample and to decide whether the proportions of observations in the various categories are significantly different from the values expected if the two characteristics were independent. Again, we use chi-square to test goodness of fit. However, in this case the expected frequencies must be calculated from the data (they are not given, as in the previous problem). To do this, we fill in the following table.
Size Class
Large
Small
Total
|
>10
32
14
46
|
<10
12
22
34
|
Total
44
35
80
|
Examine the totals for each characteristic. We note that 46 out of 80 plants had >10 flowers. If plant size and flower production are independent, we would expect to find the same proportion of large plants with >10 flowers as small plants with >10. Or, since we observe 44 large plants, we expect to find (46/80) x 44 = 25.3 large plants with >10 flowers. By the same reasoning, (34/80) x 44 = 18.7 large plants should have <10 flowers. Continuing with this reasoning, we can compare a table of expected frequencies in each category.
|
Size Class
|
Number of Flowers
|
Total
|
Large
Small
Total
|
(O)
32
14
46
|
>10
(E)
25.3
20.7
46
|
| | | | | | |
|
(O)
12
22
34
|
<10
(E)
18.7
15.3
80
|
| | | | | | |
|
44
36
|
The x² value is computed as before: (6.7)² /25.3 + (-6.7)²/18.7 + (-6.7)²/20.7 + (6.7)²/15.3 = 1.77 + 2.40 +2.17 +2.93 = 9.27. degrees of freedom are calculated in a slightly different manner than before. If r is the number of rows in the table and c is the number of columns, then degrees of freedom given by (r-1) x (c-1). For our problem, the degrees of freedom = (2-1) (2-1) = 1. Again we need to decide the level of significance and refer to the table. The decision rule will be to reject the null hypothesis of independence if our calculated value is greater the table value for the appropriate degrees of freedom and significance level. The table value for 1 degree of freedom at the .95 level is 3.84. Our calculated value was 9.27. This, we reject the null hypothesis that two characteristics are independent. From a casual glance at the data, it appears that there are more large plants with 10 flowers and also more small plants with < ten flowers than would be expected if there were no association between plant size and flower production.
Two-way classifications of this sort are often called contingency tables. Contingency tables, goodness of fit, and t-tests are only a few of the many statistical tests that can be used to analyze data. One of the most difficult steps in research is to determine which statistical test to use. The choice of test depends on what hypothesis you wish to test, what statistical assumptions need to be made about the data, the type of variables and several other factors such as the types of variables.
Variables may be classified as measurement variables (continuous and discontinuous), ranked variables, or nominal variables.
- Continuous variables can (theoretically) assume an infinite number of values between any two fixed points. Many variables we measure in science are continuous such as lengths, areas, volumes, weights, temperatures, and period of time.
- Discontinuous variables can only have certain numerical values, with no intermediate variables. Examples of discontinuous variables include numbers of a certain structure such a number legs, teeth, leaves, number of offspring, or number of cells in a certain area.
- Ranked variables cannot be measured but can be ordered or ranked. Thus the order of emergence of chickens from eggs might be recorded, without specifying the exact time when each chick emerged. In ranked variables, the difference between two ranks is not identical or even proportional to differences between other ranks. For example, you could classify the ages of trees as seed, seedling, sapling and mature. Clearly seedling is older than seed and younger than sapling, but is no way to compare the difference between seed and seedling and sapling or to compute a mean age.
- Variables that cannot be expressed qualitatively are called nominal variables. A nominal measurement puts observations in categories, which have no quantitative relationship with each other. An example is color. You can measure color of one flower as red and of another as blue, but it makes no sense to say that red is more colored than blue, or the mean color is (red + blue)/2. Other examples are male and female, and dead or alive.
It should be obvious that some data can be expressed in more than one way. Height may be a continuous variable if you measure very accurately or you may choose to make height a nominal variable by classifying individuals simply as tall or short.
Thus given certain assumptions, and keeping in mind your hypothesis, the t-test is appropriate if you have a continuous variable and wish to compare the mean heights of two groups. (For more than two groups you need a different statistical test such as a One way analysis of variance.) Depending on your hypothesis, contingency tables may be the best choice for nominal variables.
Many of these statistical tests can be done by hand, on calculators with statistical software or even with statistical packages found on the Internet.
|