Understanding and Using Statistical Research Methodologies in
Medical Education Programs:
A Primer for Medical Students and Residents
MODULES 1, 2, & 3
Danford L. Wilson, Ph.D.
Director for Graduate Medical Education
University of Kansas School of Medicine
3901 Rainbow Boulevard, Kansas City, KS 66160-7300
©
Understanding Statistical Research Methodologies in Medical Education Programs:
A Primer for Medical Students and Residents
Introduction
This Statistical Module is designed for residents and students, who are not epidemiologist, and those who do not wish to use extensive epidemiology statistics in their research. This module will allow the user to understand basic statistics used in medical research, and understand their use and concept when reading medical and scientific journals.
This module establishes a framework for understanding the essential principles and practices of understanding the null hypothesis and using various statistics {Mean, Median & Mode; the Chi Square Test; the Pearson Correlation Coefficient Pearsons r; the t Test; One-Way Analysis of Variance; Two-Way Analysis of Variance; and Multiple Regression}, with emphasis placed on understanding the role of "qualitative" and "quantitative" evaluation in medical education research.
Overall Objectives
Residents and Students will:
C. Logistical Considerations:
PGY 2 and PGY 3
There are three statistical main modules (each containing from 2 to 4 sections) with a "seat time" of approximately 45 minutes per main module and 30 minutes for the "Foundation of Program Evaluation". The total time for all modules will be approximately 3 hours and 30 minutes (allowing for a 15 minute question & answer period for each main module):
Module Summaries and Objectives
Module 1: Basic Statistic Methods and their use
Module One, Basic Statistical Methods and their use, gives residents a simplistic overview in the use of the hypothesis, the mean, median & mode; the chi square test; and the pearsons r in research, and it defines common concepts in their use.
At the end of this module, the resident and students will be able to:
Module 2: Advanced Statistical Methods: The t Test, and
ANOVA
Module two, Advanced Statistical Methods: The t Test and ANOVA, provides residents and students with a simplistic overview and understanding in using and interpreting advanced statistical methods which are used in medical and scientific journals. Residents and students will learn tools and techniques in using these statistics, in preparation for writing their own research papers in their residency and clinical program, without having to use heavy epidemiology statistics.
At the end of this module, the resident will be able to:
Module 3: Advanced Statistical Methods: Multiple Regression, and
ANCOVA (Analysis of Covariance)
Module three, Advanced Statistical Methods: Multiple Regression, and ANCOVA (Analysis of Covariance), provides students and residents with a simplistic overview and understanding in using and interpreting statistical methods which are used in medical and scientific journals. Residents and students will learn tools and techniques in using and solving statistics, in preparation for writing their own research papers during their residency year, without having to learn epidemiology statistics.
At the end of this module, the resident and student will be able to:
Module #1
Section 1
Topic: Introduction to the Null Hypothesis
Suppose we drew a random sample each of medical students and residents, administered a self-report measure of medical knowledge, and computed the mean (the most commonly used average) for each group. Furthermore, suppose the mean for the medical students is 63.00 and the mean for residents is 68.00. Where did the five points difference come from? There are three possible explanations, they are:
This third explanation has a name it is the Null Hypothesis. The general form in which it is stated varies from researcher to researcher. Here are three versions, all of which are consistent with each other:
Significance tests determines the probability that the null hypothesis is true. The researcher sets the probability level. Suppose for our example, we use a significance test and find that the probability that the null hypothesis is true is less than 5 in 100. This would be stated as p < .05, where p obviously stands for probability. The researcher should always state the probability level used in their research findings. It can be set anywhere from .001 to > 0, however, we play it safe by setting it to .05 (which has been accepted as the international standard by statisticians). Of course, if the chances that something is true is less than 5 in 100, its a good bet that its not true. If its probably not true, we reject the null hypothesis, leaving us with only the first two explanations that we started with as viable explanations for the difference.
There is no rule of nature that dictates at what probability level the null hypothesis should be rejected. However, conventional wisdom suggests that .05 or less (such as .01 or .001) are reasonable.
When we fail to reject the null hypothesis because the probability is greater than .05, we do just that: We "fail to reject" the null hypothesis and it stays on our list of possible explanations; we never "accept" the null hypothesis as the only explanation. Remember, there are three possible explanations (see above) and failing to reject one of them does not mean that you are accepting it as the only explanation.
An alternative way to say that we have rejected the null hypothesis is to state that the difference is statistically significant. Thus, if we state that a difference is statistically significant at the .05 level (meaning .05 or less), it is equivalent to stating that the null hypothesis has been rejected at that level.
When you read research reported in academic journals, or research papers, you will find that the null hypothesis is seldom stated by researchers, who assume that you know that the sole purpose of a significance test is to test a null hypothesis. Instead, researchers tell you which differences were tested for significance, which significance test they used, and which differences were found to be statistically significant. It is more common to find null hypotheses stated in theses and dissertations since committee members may wish to make sure that the students they are supervising understand the reason they have conducted a significance test.
Exercise
Question for Discussion
Module #1
Section 2
Topic: The Mean, Median, and Mode
The most frequently used average is the Mean, which is the balance point in a distribution. Its computation is simple just sum (add up) the scores and divide by the number of scores. The most common symbol for the mean in academic journals is M (for the mean of a population) or m (for the mean of a sample). The symbol preferred by statisticians is
X which is pronounced "X-bar"Because the mean is very frequently used as the average, lets consider its formal definition, which is the value around which the deviations sum to zero. You can see what this means by considering the scores in Table 1. When we subtract the mean of the scores (which is 4.0) from each of the other scores, we get the deviations (whose symbol is x). If we sum the deviations, we get zero, as shown in Table 1.
Table 1 Scores and deviation scores
X Minus M Equals x| 1 | - | 4.0 | = | -3.0 |
| 1 | - | 4.0 | = | -3.0 |
| 1 | - | 4.0 | = | -3.0 |
| 2 | - | 4.0 | = | -2.0 |
| 2 | - | 4.0 | = | -2.0 |
| 4 | - | 4.0 | = | 0.0 |
| 6 | - | 4.0 | = | 2.0 |
| 7 | - | 4.0 | = | 3.0 |
| 8 | - | 4.0 | = | 4.0 |
| 8 | - | 4.0 | = | 4.0 |
Sum of the deviations (x) = 0.0
Note: If you take any set of scores, compute their mean, and follow the steps in Table 1, the sum of the deviations will always equal zero (it might be slightly off from zero if you use a rounded mean such as using 20.33 as the mean when its precise value is 20.33333333).
Considering the formal definition, you can see why we also informally define the mean as the balance point in a distribution. The positive and negative deviations balance each other out.
A major drawback of the mean is that it is drawn in the direction of extreme scores. Consider the following two sets of scores and their means:
Scores for Group A: 1, 1, 1, 2, 3, 6, 7, 8, 8
M = 4.11
Scores for Group B: 1, 2, 2, 3, 4, 7, 9, 25, and 32
M = 9.44
Notice in both sets there are nine scores and the two distributions are very similar except for the scores of 25 and 32 in Set B, which are much higher than the others and, thus, create a skewed distribution. Notice that the two very high scores have greatly pulled up the mean for Set B; in fact, the mean for Set B is more than twice as high as the mean for Set A because of the two high scores.
When a distribution is highly skewed, we use a different average, the Median, which is defined as the middle score. To get an approximate median, put the scores in order from low to high as they are for Sets A and B (above), and then count to the middle. Since there are nine scores in Set A, the median (middle score) is 3 (which is five scores up from the bottom). For Set B, the median (middle score) is 4 (which is five scores up from the bottom), which is more representative of the center of this skewed distribution than the mean , which we noted was 9.44. Thus, one use of the median is to describe the average of skewed distributions. Another use is to describe the average of ordinal data, which you must look at when using NOIR (Nominal, Ordinal, Interval and Ratio) data.
A third average, the Mode, is simply the most frequently occurring score. For Set B, there are more scores of 2 than any other score; thus, 2 is the mode. The mode is sometimes used in informal reporting but is very seldom used in formal reports of research.
Because there is more than one type of average, it is vague to make a statement such as, "The average is 4.11." Rather, we should indicate the specific type of average being reported with statements such as, "The mean is 4.11."
A synonym for the term averages is measures of central tendency. Although the latter is seldom used in reports of scientific research, you may encounter it in other research and statistics.
Exercise
Question for Discussion
Module #1
Section 3
Topic: Introduction to the Chi Square Test
Suppose we drew at random a sample of 200 members of the American Medical Association and asked them whether they were in favor of a proposed change to their bylaws. The results are shown in Table 1. But do these observed results reflect the true results that we would have obtained if we had questioned the entire population? ( Note: we are using the term true results here to stand for the results of a census of the entire population. The results of a census are true in the sense that they are free of sampling errors. Of course, there may also be measurement errors, which we are not considering here).
Table 1 Members approval of a change in bylaws
Response(n = 120)
No 40.0%
(n = 80)
__________________________________
Total 100.0%Remember that the null hypothesis says that the observed difference was created by random sampling errors; that is, in the population, the true difference is zero. Put another way, the observed difference (n = 120 vs. n = 80) is an illusion created by chance error.
The usual test of the null hypothesis when we are considering frequencies (that is, number of cases or n) is Chi Square, who symbol is:
X2It turns out that after doing some computations, which are beyond the scope of this paper, for the data in Table 1 (above), the results are:
X2 = 4.00, df = 1, p < .05What does this mean for a user of research who sees this in a report? The values of chi square and degrees of freedom (df) were calculated solely to obtain the probability that the null hypothesis is correct. That is, Chi Square and degrees of freedom are not descriptive statistics that you should attempt to interpret. Rather, think of them as sub-steps in the mathematical procedure for obtaining the value of p. thus, the user of research should concentrate on the fact that p is less than .05. As you probably remember, when the probability (p) that the null hypothesis is correct is .05 or less, we reject the null hypothesis. (Remember, when the probability that something is true is less than 5 in 100 a low probability conventional wisdom suggests that we should reject it as being true.) Thus, the difference we observe in Table 1 was probably not created by random sampling errors; thus, we can say that the difference is statistically significant at the .05 level.
Up to this point, we have concluded that the difference we observed in the sample was probably not created by sampling errors. So where did the difference come from? Two possibilities remain:
Now lets consider some results from a survey in which the null hypothesis was not rejected. Table 2 shows the numbers and percentages of subjects in a random sample from a population of Resident Program Directors who prefer each of three methods for teaching statistics.
Table 2 Resident Program Director preference for method
Method A Method B Method C
n = 30 (37.97%) n = 27 (34.18%) n = 22 (27.85%)
In table 2, there are three differences (30 for A versus 27 for B, 30 for A versus 22 for C, and 27 for B versus 22 for C). The null hypothesis says that this set of differences was created by random sampling errors; in other words, it says that there is no true difference in the population; we have observed a difference only because of sampling errors. The results of the Chi Square test for the data in Table 2 are:
X2 = 1.214, df = 2, p >.05
Using the decision rule that p must be equal to or less than .05 to reject the null hypothesis, we fail to reject the null hypothesis, which is called a statistically insignificant result. In other words, the null hypothesis must remain on our list as a viable explanation for the set of differences we observed by studying a sample.
In this topic, we have considered the use of chi square in a univariate analysis, in which we classify each subject in only one way (such as which candidate each prefers). Chi square can also be useful in bivariate analysis, in which we classify each subject in two ways (such as which candidate each prefers and the gender of each) in order to examine a relationship between the two.
Exercise
Question for Discussion
Module #1
Section 4
When we wish to examine the relationship between two quantitative sets of scores (at the interval or ratio levels), we compute a correlation coefficient. The most widely used coefficient is the Pearson Product-Moment Correlation Coefficient, who symbol is r. It is usually called simply Pearsons r.
Consider the scores in Table 1 (below). The resident test scores places subjects in roughly the order as the ratings by supervisors. In other words, those who had high clinical test scores (such as Jan and Joe) tended to have high supervisors ratings and those who had low test scores (such as Jake and John) tended to have low supervisors ratings. This illustrates what we mean by a direct relationship (also called a positive relationship).
Table 1 Direct relationship, r = .89
| Resident | Clinical Text Scores | Supervisors' Ratings |
|
Joe Jane Bob June Leslie Homer Milly Jake John |
35
32 29 27 25 22 21 18 15 |
9
10 8 8 7 8 6 4 5 |
Note the relationships in Table 1. They are not perfect. For example, although Joe has a higher clinical test score than Jane, Jane has a higher supervisors rating than Joe. If the relationship were perfect, the value of the Pearson r would be 1.00. Being less than perfect, its actual value is .89. As you can see in Figure 1 (below), this value indicates a strong, direct relationship. Note: the value of the Pearson r is always between a negative 1 and a positive +1.
Figure 1 Values of the Pearson r
|
-1.00 inverse relationship 0.00 direct relationship 1.00 Ý Ý Ý Ý Ý Ý Ý Ý Ý Perfect Strong Moderate Weak Weak Moderate Strong Perfect |
In an inverse relationship (also called a negative relationship see Figure 1 above) those who are high on one variable are low on the other. Such a relationship exists between the scores in Table 2. Those who are high on self-concept (such as Joe and Jane) are low on depression, while those who are low on self-concept (such as John and Jake) are high on depression. Again, the relationship is not perfect. The value of the Pearson r for the relationship in Table 2 is -.86.
Table 2 Inverse relationship, r = -.86
|
Resident |
Self-Concept Scores | Depression Score |
|
Joe Jane Bob June Leslie Homer Milly Jake John |
10
8 9 7 7 6 4 1 0 |
2
1 0 5 6 8 8 9 9 |
The relationships in Table 1 and 2 are strong but, in each case, there are exceptions, which make the Pearson rs less than 1.00 and 1.00. As the number and size of the exception increases, the values of the Pearson r become closer to 0.00. Therefore, a value of 0.00 indicates the complete absence of a relationship (See Figure 1).
It is important to note that a Pearson r is not a proportion and cannot be multiplied by 100 to get a percentage. For instance. a Pearson r of .50 does not correspond to 50% of anything (See Figure 1). To think about correlation in terms of percentages, we must convert Pearson rs to another statistic, the coefficient of determination, whose symbol is r2, which indicates how to compute it simply square r. Thus, for an r of .50, r2 equals .25. If we multiply .25 by 100, we get 25%. What does this mean? Simply this: A Pearson r of .50 is 25% better than a Pearson r of 0.00. Table 3 shows selected values of r, r2, and the percentages you should think about when interpreting an r.
Table 3 Selected values of r and r2
| r | r2 | Percentage better than zero |
| .90 | .81 | 81% |
| .50 | .25 | 25% |
| .63 | .39 | 39% |
| .25 | .06 | 6% |
| -.25 | .06 | 6% |
| -.63 | .39 | 39% |
| -.50 | .25 | 25% |
| -.90 | .81 | 81% |
Exercise
Question for Discussion
Module #2
Section 1
Suppose we have a research hypothesis that says medical "research investigators who take a short course on the causes of HIV will be less fearful of the disease than research investigators who have not taken the course," and test it by conducting an experiment in which a random sample of research investigators are assigned to take the course and another random sample are designated as the control group (note: random sampling is preferred, because it precludes any bias in the assignment of subjects to the groups and because we can test for the effect of random errors with significance test; we cannot test for the effects of bias).
Lets suppose that at the end of the experiment the experimental group gets a mean of 16.61 on a fear of HIV scale and the control group gets a mean of 29.67 (where the higher the score, the greater the fear of HIV). These means support our research hypothesis. But can we be certain that our research hypothesis is correct? If youve been reading various topics on statistics, you already know that the answer is "no" because of the Null Hypothesis, which says that there is no true difference between the means; that is, the difference was created merely by the chance errors created by random sampling (these errors are known as sampling errors). Put another more simple way, unrepresentative groups may have been assigned to the two conditions quite at random.
The t test is often used to test the null hypothesis regarding the observed difference between two means (to test the null hypothesis between two medians, the median test is used; it is a specialized form of chi square test). For the example, we are considering, a series of computations (which are beyond the scope of this paper) would be performed to obtain a value of t (which, in this case, is 5.38) and a value of degrees of freedom (which, in this case, is df = 179). These values are not of any special interest to us except that they are used to get the probability (p) that the null hypothesis is true. In this particular case, p is less than .05. Thus, in a research report, you may read a statement such as this:
"The difference between the means is statistically significant (t = 5.38,
df = 179, p< .05)".The term statistically significant indicates that the null hypothesis has been rejected. You will recall that when the probability that the null hypothesis is true is .05 or less (such as .01 or .001), we reject the null hypothesis. When something is unlikely to be true, because it has a low probability of being true, we reject it.
Having rejected the null hypothesis, we are in a position to assert that our research hypothesis probably is true (assuming no procedural bias was allowed to affect the results, such as testing the control group immediately after a major news story on a celebrity person with AIDS, while testing the experimental group at an earlier time).
What leads a t test to give us a low probability? Three things:
A special type of t test is also applied to correlation coefficients. Suppose we drew a random sample of 50 medical students and correlated their hand size with their GPAs and got an r of .19. The null hypothesis says that the true correlation in the population is 0.00 - that we got .19 merely as the result of sampling errors. For this example the t test indicates that p > .05. Since the probability that the null hypothesis is true is greater than 5 in 100, we do not reject the null hypothesis; we have a statistically insignificant correlation coefficient. In other words, for n = 50, an r of .19 is not significantly different from an r of 0.00. When reporting the results of the t test for the significance of a correlation coefficient, it is better not to mention the value of t. Rather, it is better to indicate only whether or not the correlation is significant at a given probability level.
Exercise
Question for Discussion
Module #2
Section 2
Topic: One-Way Analysis of Variance (ANOVA)
In using the t test, you learned it was used to test the null hypothesis for the observed difference between two sample means. An alternative test for this problem is the analysis of variance (often called ANOVA), sometimes called the F test.
Instead of t, it yields a statistic called F, as well as degrees of freedom (df), sum of squares, mean square, and a p value, which indicates the probability that the null hypothesis is correct. As with the t test, the only value of interest to the typical user of research is the value of p. By convention, when p equals .05 or less (such as .01 or .001), we reject the null hypothesis and declare the result to be statistically significant.
Because the t test and ANOVA are based on the same theory and assumptions, when we compare two means, both tests yield exactly the same value of p and, hence, lead to the same conclusion regarding significance. So for two means, both tests are equivalent and either test can be used to get the same results. However, that is where the comparison stops! Note, that a single t test can compare only two means, but a single ANOVA can compare a number of means, which is a great advantage.
Suppose, for example, as a medical researcher, you tested three drugs, on a population, to treat depression in an experiment and obtained the following means and standard deviations.
Table 1 Posttest means and standard deviations
Of depression scores for three drugs_________________________________________________________
Drug A Drug B Drug CM = 6.00 M = 5.50 M = 2.33
_________________________________________________________
Since the higher the score the greater the depression, inspection of the means shows that there are three observed differences:
The null hypothesis says that the entire set of three differences was created by sampling error. Through a series of computations that are beyond the scope of this paper, an ANOVA for these data yields this result: F = 10.837, df = 2, 15, p <.05. This result might be stated in a sentence presented in a table such as Table 2, which is known as an ANOVA table. While it contains many values, which were used to arrive at the probability, we are only interested in the end result the value of p.
As users of statistics know, when the probability is .05 or less, as it is here, we reject the null hypothesis. This means that the entire set of differences is statistically significant at the .05 level (See Table 2).
Note: the ANOVA does not tell us which of the three differences we listed are significant; it could be that only one, or only two or all three are significant. This needs to be explored with additional test, known as multiple comparison tests.
Table 2 ANOVA for data in Table 1| Source of Variation |
df |
Sum of Squares |
Mean Squares |
F |
| Between Groups Within Groups |
2 15 |
47.445 32.833 |
23.722 2.189 |
10.837* |
| Total *p < .05 |
17 | 80.278 |
As previously mentioned, there are a number of multiple comparison tests (Dunnets test, Scheffe test, and the Tukey test). Each one is used based on different assumptions and usually they yield similar results, but not always. For the data we are considering, application of a popular multiple comparisons test, the Scheffes test, yields these probabilities:
(1) for Drug C vs. A, p < .05
(2) for Drug C vs. B, p < .05
(3) for Drug B vs. A, p > .05
Thus, we have found that Drug C is significantly better than Drug A and B, but that Drug B and A are not significantly different from each other.
In review, an ANOVA tells us whether a set of differences, overall, is significant. If so, we can use the appropriate multiple comparison test (either the Dunnets, Scheffes or Tukey test) to determine which pairs of means are significantly different from each others.
In this topic, we have been considering a One-Way ANOVA (also known as a single-factor ANOVA). It is called this because we have classified the subjects in only one way in terms of which drug they took. There is another type of ANOVA where subjects are classified in two ways, which is called appropriately Two-Way ANOVA, which will be discussed in a different topic.
Exercise
Question for Discussion
Module #2
Section 3
Topic: Two-Way Analysis of Variance (ANOVA)
In the one-way ANOVA, we saw how ANOVA can be used to test for the overall significance of a set of means when subjects have been classified in one way. Often, however, it is desirable to look at a two-way classification such as (1) which drug was taken and (2) how long subjects have been depressed. Table 1 (below) shows the means for such a study (note: for instructional purposes, only two drugs are shown. However, we may use ANOVA when there are more than two). Since higher depression scores indicate more depression, a low mean is desirable.
Table 1 Means for a study of depression: Drugs and length of depression comparisons
_________________________________________________________
Drug A Drug B Row TotalLong-term M = 8.11 M = 8.32 M = 8.22
Short-term M = 4.67 M = 8.45 M = 6.56
Column Total M = 6.39 M = 8.38
Although the subjects are classified in two ways, analysis of the table answers three questions. First by comparing the column totals of 6.39 and 8.38, we can see that, overall, those who took Drug A are less depressed. Its important to notice that the mean of 6.39 for Drug A is based on both those who have long-term and those who have short-term depression; the same is true of the mean of 8.39 for Drug B. Thus, by comparing the column total means, we are answering the question of which drug is more effective in general without regard to how long subjects have been depressed. In analysis of variance, this is known as a main effect.
Each way in which subjects are classified yields a main effect in analysis of variance. Thus, since subjects were also classified in terms of their length of depression, there is a main effect for short-term vs. long-term, which can be seen by examining the row total means of 8.22 and 6.56. This main effect indicates that, overall, those with short-term depression are less depressed than those with long-term depression.
In this example, the most interesting question is the question of an interaction. The question is this: Is the effectiveness of the drugs dependent, in part, on the length of depression? By examining the individual cell means (those not in bold in Table 1), we can see that the answer is "yes". Drug A is more effective for short-term than long-term depression (4.67 vs. 8.11) while Drug B is about equally effective for both types of depression (8.32 vs. 8.45). What is the practical implication of this interaction? The overall effectiveness of Drug A is almost entirely attributable to its effectiveness for short-term depression. That is, if a person has short-term depression, Drug A is indicated, but if a person has long-term depression, either drug is likely to be about equally effective.
For data in Table 1, it turns out that P<.05 for both main effects and the interaction. Thus, we can reject the null hypotheses that say that the differences we are considering are the result of random errors. Of course, it does not always turn out this way. Its possible for one or two of the main effects to be significant but the interaction to be not significant; its also possible for neither main effect to be significant while the interaction is insignificant, which is the case for the data in Table 2.
Table 2 Means for a study of depression: Drugs and gender comparisons
_________________________________________________________
Drug A Drug B Row TotalFemales M = 8.00 M = 5.00 M = 6.50
Males M = 5.00 M = 8.00 M = 6.50
Column Total M = 6.50 M = 6.50
Notice that the column totals (6.50 vs. 6.50) in Table 2 indicates no main effect for Drug A vs. Drug B ( in this case they are equal to one another). Likewise, the row totals (6.50 vs. 6.50) indicate no main effect for gender. But, there is a very interesting finding the interaction of drug type and gender, which indicates that for females, Drug B is superior, but for males, Drug A is superior. Note that if we had compared the two drugs in a One-Way ANOVA without also classifying the subjects according to gender (as we did here in the Two-Way ANOVA), we would have missed this important interaction.
Exercise
__________________________________________________________
Praise
Monetary
Row
Reward Rewards
Totals
Young Adults M = 50.00 M = 60.00 M = 55.00
Older Adults M = 60.00 M = 50.00 M = 55.00
Column Total M = 55.00 M = 55.00
_______________________________________________________
Method A Method B Row Totals
High Aptitude M = 100.00 M = 85.00 M = 92.50
Low Aptitude M = 100.00 M = 85.00 M = 92.50
Column Total M = 100.00 M = 85.00
Question for Discussion
Module #3
Section 1
Topic: Multiple Regression Correlation (MRC)
Multiple Regression Correlation (MRC), like the Analysis of Variance (ANOVA) is a statistical method from the "General Linear Model". According to Kerlinger & Pedhazur, "multiple regression analysis can do anything ANOVA does. ANOVA is a special case of MRC. Both are algebraically the same and will produce statistically the same outcome (See Table 1).
Table 1 ANOVA vs. MRC outcomes
________________________________________________
F ratio 20.38 20.38
df 1, 18 1, 18
p value < .01 < .01
SS
Between 64.80
Regression 64.80
Within
57.20
Residual 57.20
___________________________________
ANOVA is used primarily in scientific experiments, while MRC is used primarily in quasi-experimental design experiments. Both have null hypothesis (as described under the Null Hypothesis topic), however, they look at different elements of the model. ANOVA looks to see if the population mean1 equals the population mean2, while MRC looks at the correlation of the Independent Variable (IV) to the Dependent Variable (DV), as follows:
MRC ANOVABoth MRC and ANOVA have the same underlying assumptions, they are:
There are several key components of the MRC and ANOVA which are different from one another, they are:
MRC Key Components ANOVA Key Components
1. Impractical to hold subjects
for 1.
Causation
2. Not appropriate to
randomize
2. Experimental control
3. May be to expensive to
randomize 3. Randomization of
subjects
4. May be logistically impossible
to 4. Ability to
isolate variables
As noted earlier, MRC is used in quasi-experimental design experiments. Quasi-experiments have treatments, outcome measures and experimental units, but do not use random assignment to create the comparisons from which treatment-caused change is inferred. Instead, the comparisons depend on nonequivalent groups that differ from each other in many ways other than the presence of a treatment whose effects are being tested.
In the absence of randomization, the researcher is faced with the task of identifying and separating the effects of the treatments from the effects of all other factors affecting the dependent variable (DV). Campbell & Stanley (1963) warned researchers about "a feeling of hopelessness with regard to achieving experimental control which leads to the abandonment of such efforts in favor of more informal methods of investigation". Their defense of quasi-experimental design was that it is "deemed worthy of use when better designs are not feasible". In other words, support of use of quasi-experiments design is not based on its intrinsic worth; rather, it was positioned as an approach when better designs are not possible (Sawilowsky, 1997).
Campbell & Stanley (1963) suggest the use of quasi-experimental design for the many "social science settings" in which there is no way to randomly assign participants.
In regression analysis one is trying to either predict or explain phenomena. In predictive research the main emphasis is on practical applications, whereas in explanatory research the main emphasis is on understanding phenomena. This is not to say that the two research activities are unrelated, or that they have no bearing on each other. Predictive research may, for example, serve as a source of hunches and insights that might lead to theoretical considerations. Yet the importance of distinguishing between the two type of research activities cannot be overemphasized.
MRC uses IVs and DVs in the context of explanatory research, whereas ANOVA uses predictor and criterion in the context of predictive research. Prediction is really a special case of explanation; it can be subsumed under theory and explanation as noted in Table 2. This is explanation:
Table 2 Explanation of a variable
________________________________
If p, then q, under conditions r, s and t
________________________________
The above explanation in Table 2 is also prediction, prediction from p to q as follows in Table 3.
Table 3 Prediction of a variable
______________________________________
Prediction from p (and r, s, and t) to q
________________________________
Kerlinger & Pedhazur (1973) indicates, while MRC is well-suited to
predictive analysis, it is more fundamentally oriented to "explanatory
analysis". We do not simply throw variables in a regression
equation; we enter them, whenever possible, at the dictates of theory and
reasonable interpretation of empirical research findings.
Exercise
1. Is MRC used primarily in experimental design research? If so why? If not why not?Question for Discussion
Foundations of Program Evaluation
Standards of Evaluation - Education Programs
Standards for Evaluations of Education Programs, Projects, and Materials (Congressional Education Joint Committee, 1981):
What is the difference between evaluation and research?
What is the role of evaluation (Heath, 1969):
Review (Education Programs):
______________________________________________________
Donald T. Campbell
Many call the "father" of scientific evaluation.
Campbell and Stanley (1963) wrote the seminal text on research design titled: Experimental and Quasi-experimental Designs for Research.
Experiment: an experiment that has randomization as its basis, and can be repeated over and over again and the same results will occur.
Quasi-experiment: an experiment, which lacks randomization as its basis.
Campbell was in favor of using both quantitative and qualitative procedures. He wanted qualitative methods to complement quantitative rather than to replace it.
All evaluations should be open for criticism and accountability.
Campbell does not recommend evaluations:
problems.
______________________________________________________
(Shadish, Cook and Leviton-1991)
Modern social program evaluation emerged in the 1960s due to massive Federal involvement in social welfare spending due to the Great Society movement during Presidents Johnson and Nixon e.g. The War on Poverty, Medicare/Medicaid, etc
What percent of the U.S. annual budget is earmarked for research and evaluation?
The first federal program to require "evaluation" was the juvenile delinquency program enacted by congress in 1962 (Weiss, 1987).
Stakeholder: Those who have a "stake" in the program or its "evaluation".
There is a psychological phenomena in which an evaluation demonstrates that the program is not effective, however, the person continues to support the program, Why?
Then how does change occur?
Program evaluation assumes that education problem solving can be improved by incremental improvements in existing programs, better designs of new ones, terminating bad programs and replacing them with better ones.
Most education programs get replaced or die-out due to political or economic reasons not because of evaluation results.
Internal validity is the sine qua non (Campbell and Stanley, 1963)
Lincoln and Guba, (1985, 1986) argued that there is no reality beyond what we each construct, so causality, generalizability and truth, have little useful meaning.
Value Component: Early evaluators thought that they could be value-free from justice, equality, liberty and human rights.
Theory of Valuing (Beauchamp, 1982)
Evaluation is about determining value, merit or worth.
Most evaluators use descriptive valuing. They describe values held by the stakeholders, however, no claim by the evaluator is made that this is the best program.
Prescriptive theories have a heavy burden of justifying why?
(Michael S. Scriven 1983)
Without evaluation, we have no means of allocating resources; waste, fraud, and incompetence would go undetected (Scriven, 1983).
The fiscal benefits of the evaluation should always exceed the cost of conducting the evaluation.
Scriven coined formative evaluation and summative evaluation:
How bias are introduced into evaluations:
How do we correct bias in education programs (goal-free evaluations):
Why should the evaluator be goal-free in evaluating a program:
Cost-benefit Analysis: "Translates all inputs and outputs into monetary units, yielding a single cost-benefit ratio of net fiscal gains or losses.
Qualitative vs. Quantitative
Research: An Overview
- Typically reflective of normal or everyday life
- Individuals, groups, societies, organizations, etc.
- "From the inside"
Qualitative Analysis
- Review with informants
- One main task: Explain ways people in part come to understand, account for, take action and manage their day-to-day situations
Advantages and Disadvantages
Advantages Disadvantages1. Helps establish interpretive 1. Concepts not always clearly framework defined
2. Helps reduce bias 2. Field notes guarded
3. Processual 3. Territoriality
4. Adds "punch" to research 4. Criticism taboo reports
5. Generally high in validity 5. Lots of data
6. Analysis procedures poor defined
7. Generally low reliability
Relationship Between Methods
Qualitative Research Quantitative Research
1. Towards discovery 1. Towards testing a hypothesis
2. Induction 2. Deduction
3. Specific to General 3. General to specific
4. Fieldwork 4. Office/desk/lab
5. Small n 5. Large N
6. Non-statistical 6. Statistical
7. Describe, interpret, explain 7. Predict, control

Steps of Qualitative Method
Qualitative Data Sources
- Structured
- Semi-structured
- Unstructured
Early Field Reactions
Late Field Reactions
Computers & Quantitative Analysis