Understanding and Using Statistical Research Methodologies in

Medical Education Programs:

A Primer for Medical Students and Residents

 

MODULES 1, 2, & 3

 

Danford L. Wilson, Ph.D.

Director for Graduate Medical Education

University of Kansas – School of Medicine

3901 Rainbow Boulevard, Kansas City, KS 66160-7300

©

Understanding Statistical Research Methodologies in Medical Education Programs:

A Primer for Medical Students and Residents


Introduction

  1. Statement of Purpose:
  2. This Statistical Module is designed for residents and students, who are not epidemiologist, and those who do not wish to use extensive epidemiology statistics in their research. This module will allow the user to understand basic statistics used in medical research, and understand their use and concept when reading medical and scientific journals.

    This module establishes a framework for understanding the essential principles and practices of understanding the null hypothesis and using various statistics {Mean, Median & Mode; the Chi Square Test; the Pearson Correlation Coefficient – Pearson’s r; the t Test; One-Way Analysis of Variance; Two-Way Analysis of Variance; and Multiple Regression}, with emphasis placed on understanding the role of "qualitative" and "quantitative" evaluation in medical education research.

  3. Goal:

Overall Objectives

Residents and Students will:

  • Understand what the null hypothesis is and what it means when you reject the null hypothesis or when you fail to reject the null hypothesis.
  • Understand what statistically significant means.
  • Understand and discern the difference between qualitative and quantitative research using various statistical methods {Mean, Median & Mode, Chi Square, Pearson r, t test, ANOVA one-way and two-way analysis of variance, Multiple Regression and ANCOVA}.
  • Understand the history of qualitative and quantitative research in medicine.
  • Identify and understand the strengths and limitations of qualitative and quantitative data.
  • Analyze a real qualitative data set.
  • Identify qualitative assessments that are being used today in medical research.
  • Identify and discuss the seminal bodies of qualitative work.
  • Identify political limitations of research results (politically correct).
  • Understand the function of focus groups and their applications in medical research.
  • Understand and be aware of the available software that can be used in providing data results (SAS, SPSS, etc…).

C. Logistical Considerations:

  1. Residency year:
  2. PGY 2 and PGY 3

  3. Length of program and number of modules:

There are three statistical main modules (each containing from 2 to 4 sections) with a "seat time" of approximately 45 minutes per main module and 30 minutes for the "Foundation of Program Evaluation". The total time for all modules will be approximately 3 hours and 30 minutes (allowing for a 15 minute question & answer period for each main module):

  • Understanding and using the Mean, Median & Mode of a data set; Introduction to the Chi Square Test; Calculating the Pearson Correlation Coefficient (Pearson’s r) and it’s meaning.
  • Understanding when and where to use and interpret the t Test; ANOVA – One-Way Analysis of Variance; ANOVA – Two-Way Analysis of Variance in medical education research
  • Understanding Multiple Regression in medical research. When to use it, how to interpret the results, and what does it mean in your research paper.
D. Materials:
      Handouts
      Evaluation forms

Module Summaries and Objectives

Module 1: Basic Statistic Methods and their use

Module One, Basic Statistical Methods and their use, gives residents a simplistic overview in the use of the hypothesis, the mean, median & mode; the chi square test; and the pearson’s r in research, and it defines common concepts in their use.

At the end of this module, the resident and students will be able to:

  • Describe and understand the way data is used in research papers in medical and scientific research.
  • Compare and understand the results of one research against similar research. Understand chi square and the pearson’s r.
  • Define qualitative vs. quantitative research.
  • Define "reliability" and "validity" in the use of data and statistical methods.

 

Module 2: Advanced Statistical Methods: The t Test, and

ANOVA

Module two, Advanced Statistical Methods: The t Test and ANOVA, provides residents and students with a simplistic overview and understanding in using and interpreting advanced statistical methods which are used in medical and scientific journals. Residents and students will learn tools and techniques in using these statistics, in preparation for writing their own research papers in their residency and clinical program, without having to use heavy epidemiology statistics.

At the end of this module, the resident will be able to:

  • Describe the purpose and use of the t Test, and ANOVA (One-Way and Two-Way Analysis of Variance).
  • Describe the type of data that must be used with these types of test.
  • Understand why the user must set alpha (a ) at a level (preferably .05 or less) to test a hypothesis.
  • Understand what "statistically significant" means using these statistical methods (t Test, and ANOVA).
  • Discuss strategies in the development of their research paper.
  • Read and understand research papers that use the t Test and/or ANOVA in analyzing data.

 

Module 3: Advanced Statistical Methods: Multiple Regression, and

ANCOVA (Analysis of Covariance)

Module three, Advanced Statistical Methods: Multiple Regression, and ANCOVA (Analysis of Covariance), provides students and residents with a simplistic overview and understanding in using and interpreting statistical methods which are used in medical and scientific journals. Residents and students will learn tools and techniques in using and solving statistics, in preparation for writing their own research papers during their residency year, without having to learn epidemiology statistics.

At the end of this module, the resident and student will be able to:

  • Describe the purpose and use of Multiple Regression and ANCOVA (Analysis of Covariance).
  • Describe the type of data that must be used with these types of test.
  • Understand what "statistically significant" means using these statistical methods (Multiple Regression & ANCOVA).
  • Discuss strategies in using these statistics in the development of their research paper.
  • Read and understand research papers that use the Multiple Regression and/or ANCOVA in analyzing data.

Module #1

Section 1

Topic: Introduction to the Null Hypothesis

Suppose we drew a random sample each of medical students and residents, administered a self-report measure of medical knowledge, and computed the mean (the most commonly used average) for each group. Furthermore, suppose the mean for the medical students is 63.00 and the mean for residents is 68.00. Where did the five points difference come from? There are three possible explanations, they are:

  1. Perhaps the population of residents is truly more knowledgeable about medicine than the population of medical students, and our samples correctly identified the difference. (In fact, our research hypothesis may have been that residents are more knowledgeable about medicine than medical students – which now appears to be supported by the data.)
  2. Perhaps there was a bias in procedures. By using random sampling, we have ruled out sampling bias, but other procedures such as measurement may be biased. For example, maybe the residents were contacted during September, when many clinical events (conferences, lectures, etc…) take place and the medical students were contacted during the gloomy month of December when no clinical events (conferences, lectures, etc…) took place. The only way to rule out bias as an explanation is to take physical steps to prevent it. In this case, we would want to make sure that the medical knowledge for both groups was measured in the same way at the same time.
  3. Perhaps the populations of residents and medical students are the same but the samples are unrepresentative of their populations because of random sampling errors. For instance, the random draw may have given us a sample of residents who are more knowledgeable, on the average, than their population.

This third explanation has a name – it is the Null Hypothesis. The general form in which it is stated varies from researcher to researcher. Here are three versions, all of which are consistent with each other:

  1. Version "A" of the Null Hypothesis:
  • The observed difference was created by sampling errors. Note: The term sampling error refers only to random errors, not errors created by a bias.
 B.  Version "B" of the Null Hypothesis:
  • There is no true difference between the two groups. (Note: The term true difference refers to the difference we would find in a census of the population, that is, the difference we would find if there were no sampling errors.)
 C.  Version "C" of the Null Hypothesis:
  • The true difference between the two groups is zero.

Significance tests determines the probability that the null hypothesis is true. The researcher sets the probability level. Suppose for our example, we use a significance test and find that the probability that the null hypothesis is true is less than 5 in 100. This would be stated as p < .05, where p obviously stands for probability. The researcher should always state the probability level used in their research findings. It can be set anywhere from .001 to > 0, however, we play it safe by setting it to .05 (which has been accepted as the international standard by statisticians). Of course, if the chances that something is true is less than 5 in 100, it’s a good bet that it’s not true. If it’s probably not true, we reject the null hypothesis, leaving us with only the first two explanations that we started with as viable explanations for the difference.

There is no rule of nature that dictates at what probability level the null hypothesis should be rejected. However, conventional wisdom suggests that .05 or less (such as .01 or .001) are reasonable.

When we fail to reject the null hypothesis because the probability is greater than .05, we do just that: We "fail to reject" the null hypothesis and it stays on our list of possible explanations; we never "accept" the null hypothesis as the only explanation. Remember, there are three possible explanations (see above) and failing to reject one of them does not mean that you are accepting it as the only explanation.

An alternative way to say that we have rejected the null hypothesis is to state that the difference is statistically significant. Thus, if we state that a difference is statistically significant at the .05 level (meaning .05 or less), it is equivalent to stating that the null hypothesis has been rejected at that level.

When you read research reported in academic journals, or research papers, you will find that the null hypothesis is seldom stated by researchers, who assume that you know that the sole purpose of a significance test is to test a null hypothesis. Instead, researchers tell you which differences were tested for significance, which significance test they used, and which differences were found to be statistically significant. It is more common to find null hypotheses stated in theses and dissertations since committee members may wish to make sure that the students they are supervising understand the reason they have conducted a significance test.

Exercise

  1. How many explanations are there for the differences in medical knowledge between residents and medical students in the example in this topic?
  2. What does the null hypothesis say about sampling errors?
  3. Does the term sampling error refer to random errors or to bias?
  4. The null hypothesis says that the true difference equals what value?
  5. What is used to determine the probabilities that null hypothesis are true?
  6. For what does p < .05 stand for?
  7. Do we reject the null hypothesis when the probability of truth is high or when it is low?
  8. What do we do if the probability is greater than .05?
  9. What is an alternative way of saying that we have rejected the null hypothesis?
  10. Are you more likely to find a null hypothesis stated in a journal article or in a thesis?

Question for Discussion

  1. We all use probabilities in everyday activities to make decisions. For example, before we cross a busy street, we estimate the odds that we will get across the street safely. Briefly describe one other specific use of probability in everyday decision-making.

Module #1

Section 2

Topic: The Mean, Median, and Mode

The most frequently used average is the Mean, which is the balance point in a distribution. Its computation is simple – just sum (add up) the scores and divide by the number of scores. The most common symbol for the mean in academic journals is M (for the mean of a population) or m (for the mean of a sample). The symbol preferred by statisticians is

X which is pronounced "X-bar"

Because the mean is very frequently used as the average, let’s consider its formal definition, which is the value around which the deviations sum to zero. You can see what this means by considering the scores in Table 1. When we subtract the mean of the scores (which is 4.0) from each of the other scores, we get the deviations (whose symbol is x). If we sum the deviations, we get zero, as shown in Table 1.

Table 1 Scores and deviation scores

      X            Minus           M            Equals       x

1 - 4.0 = -3.0
1 - 4.0 = -3.0
1 - 4.0 = -3.0
2 - 4.0 = -2.0
2 - 4.0 = -2.0
4 - 4.0 = 0.0
6 - 4.0 = 2.0
7 - 4.0 = 3.0
8 - 4.0 = 4.0
8 - 4.0 = 4.0

Sum of the deviations (x) = 0.0

Note: If you take any set of scores, compute their mean, and follow the steps in Table 1, the sum of the deviations will always equal zero (it might be slightly off from zero if you use a rounded mean such as using 20.33 as the mean when its precise value is 20.33333333).

Considering the formal definition, you can see why we also informally define the mean as the balance point in a distribution. The positive and negative deviations balance each other out.

A major drawback of the mean is that it is drawn in the direction of extreme scores. Consider the following two sets of scores and their means:

Scores for Group A: 1, 1, 1, 2, 3, 6, 7, 8, 8

                        M = 4.11

Scores for Group B: 1, 2, 2, 3, 4, 7, 9, 25, and 32

                               M = 9.44

Notice in both sets there are nine scores and the two distributions are very similar except for the scores of 25 and 32 in Set B, which are much higher than the others and, thus, create a skewed distribution. Notice that the two very high scores have greatly pulled up the mean for Set B; in fact, the mean for Set B is more than twice as high as the mean for Set A because of the two high scores.

When a distribution is highly skewed, we use a different average, the Median, which is defined as the middle score. To get an approximate median, put the scores in order from low to high as they are for Sets A and B (above), and then count to the middle. Since there are nine scores in Set A, the median (middle score) is 3 (which is five scores up from the bottom). For Set B, the median (middle score) is 4 (which is five scores up from the bottom), which is more representative of the center of this skewed distribution than the mean , which we noted was 9.44. Thus, one use of the median is to describe the average of skewed distributions. Another use is to describe the average of ordinal data, which you must look at when using NOIR (Nominal, Ordinal, Interval and Ratio) data.

A third average, the Mode, is simply the most frequently occurring score. For Set B, there are more scores of 2 than any other score; thus, 2 is the mode. The mode is sometimes used in informal reporting but is very seldom used in formal reports of research.

Because there is more than one type of average, it is vague to make a statement such as, "The average is 4.11." Rather, we should indicate the specific type of average being reported with statements such as, "The mean is 4.11."

A synonym for the term averages is measures of central tendency. Although the latter is seldom used in reports of scientific research, you may encounter it in other research and statistics.

 

Exercise

  1. Which average is defined as the most frequently occurring score?
  2. Which average is defined as the balance point in a distribution?
  3. Which average is defined as the middle score?
  4. What is the formal definition of the mean?
  5. How is the mean calculated?
  6. Should the mean be used for highly skewed distributions? Why and/or why not?
  7. Should the median be used for highly skewed distributions? Why and/or why not?
  8. What is a synonym for the term averages?

Question for Discussion

  1. Suppose a fellow student gave a report in class and said, "the average was 25.88." For what additional information should you ask? Why?

Module #1

Section 3

Topic: Introduction to the Chi Square Test

Suppose we drew at random a sample of 200 members of the American Medical Association and asked them whether they were in favor of a proposed change to their bylaws. The results are shown in Table 1. But do these observed results reflect the true results that we would have obtained if we had questioned the entire population? ( Note: we are using the term true results here to stand for the results of a census of the entire population. The results of a census are true in the sense that they are free of sampling errors. Of course, there may also be measurement errors, which we are not considering here).

 

Table 1 Members’ approval of a change in bylaws

Response
_______________________________

Yes                            60.0%

                                  (n = 120)

No                              40.0%

                                  (n = 80)

__________________________________

Total 100.0%
____________________________________

Remember that the null hypothesis says that the observed difference was created by random sampling errors; that is, in the population, the true difference is zero. Put another way, the observed difference (n = 120 vs. n = 80) is an illusion created by chance error.

 

The usual test of the null hypothesis when we are considering frequencies (that is, number of cases or n) is Chi Square, who symbol is:

X2

It turns out that after doing some computations, which are beyond the scope of this paper, for the data in Table 1 (above), the results are:

X2 = 4.00, df = 1, p < .05

What does this mean for a user of research who sees this in a report? The values of chi square and degrees of freedom (df) were calculated solely to obtain the probability that the null hypothesis is correct. That is, Chi Square and degrees of freedom are not descriptive statistics that you should attempt to interpret. Rather, think of them as sub-steps in the mathematical procedure for obtaining the value of p. thus, the user of research should concentrate on the fact that p is less than .05. As you probably remember, when the probability (p) that the null hypothesis is correct is .05 or less, we reject the null hypothesis. (Remember, when the probability that something is true is less than 5 in 100 – a low probability – conventional wisdom suggests that we should reject it as being true.) Thus, the difference we observe in Table 1 was probably not created by random sampling errors; thus, we can say that the difference is statistically significant at the .05 level.

Up to this point, we have concluded that the difference we observed in the sample was probably not created by sampling errors. So where did the difference come from? Two possibilities remain:

  1. Perhaps there was a bias in procedures such as the person asking the question in the survey leading the respondents by talking enthusiastically about the proposed change in the bylaws. If we are convinced that adequate measures were taken to prevent procedural bias, we are left with only the next possibility as a viable explanation.
  2. perhaps the population of physicians is, in fact, in favor of the proposed change, and this fact is correctly identified by studying the random sample.

Now let’s consider some results from a survey in which the null hypothesis was not rejected. Table 2 shows the numbers and percentages of subjects in a random sample from a population of Resident Program Directors who prefer each of three methods for teaching statistics.

 

Table 2 Resident Program Director’ preference for method

Method A                     Method B                        Method C

n = 30 (37.97%)                   n = 27 (34.18%)                           n = 22 (27.85%)

 

 

In table 2, there are three differences (30 for A versus 27 for B, 30 for A versus 22 for C, and 27 for B versus 22 for C). The null hypothesis says that this set of differences was created by random sampling errors; in other words, it says that there is no true difference in the population; we have observed a difference only because of sampling errors. The results of the Chi Square test for the data in Table 2 are:

X2 = 1.214, df = 2, p >.05

 

Using the decision rule that p must be equal to or less than .05 to reject the null hypothesis, we fail to reject the null hypothesis, which is called a statistically insignificant result. In other words, the null hypothesis must remain on our list as a viable explanation for the set of differences we observed by studying a sample.

In this topic, we have considered the use of chi square in a univariate analysis, in which we classify each subject in only one way (such as which candidate each prefers). Chi square can also be useful in bivariate analysis, in which we classify each subject in two ways (such as which candidate each prefers and the gender of each) in order to examine a relationship between the two.

 

Exercise

  1. When we study a sample, are the results called the true results or the observed results?
  2. According to the null hypothesis, what created the difference in Table 1 in this topic?
  3. What is the name of the test of the null hypothesis used when we are analyzing frequencies?
  4. As a consumer (user) of research, should you try to interpret the value of df?
  5. What is the symbol for probability?
  6. If you read that a chi square test of a difference yielded a p of less than 5 in 100, what should you conclude about the null hypothesis on the basis of conventional wisdom?
  7. Does p < .05 or p > .05 usually lead a researcher to declare a difference to be statistically significant?
  8. If we fail to reject a null hypothesis, is the difference in question statistically significant?
  9. If we have a statistically insignificant result, does the null hypotheses remain on our list of viable hypotheses?

Question for Discussion

  1. Briefly describe a hypothetical study in which it would be appropriate to conduct a chi square test for univariate data.

Module #1

Section 4
Topic: The Pearson Correlation Coefficient (Pearson’s r)

When we wish to examine the relationship between two quantitative sets of scores (at the interval or ratio levels), we compute a correlation coefficient. The most widely used coefficient is the Pearson Product-Moment Correlation Coefficient, who symbol is r. It is usually called simply Pearson’s r.

Consider the scores in Table 1 (below). The resident test scores places subjects in roughly the order as the ratings by supervisors. In other words, those who had high clinical test scores (such as Jan and Joe) tended to have high supervisors’ ratings and those who had low test scores (such as Jake and John) tended to have low supervisors’ ratings. This illustrates what we mean by a direct relationship (also called a positive relationship).

Table 1 Direct relationship, r = .89

Resident Clinical Text Scores Supervisors' Ratings

Joe

Jane

Bob

June

Leslie

Homer

Milly

Jake

John

35

32

29

27

25

22

21

18

15

9

10

8

8

7

8

6

4

5

Note the relationships in Table 1. They are not perfect. For example, although Joe has a higher clinical test score than Jane, Jane has a higher supervisor’s rating than Joe. If the relationship were perfect, the value of the Pearson r would be 1.00. Being less than perfect, it’s actual value is .89. As you can see in Figure 1 (below), this value indicates a strong, direct relationship. Note: the value of the Pearson r is always between a negative –1 and a positive +1.

 

Figure 1 Values of the Pearson r

-1.00           inverse relationship          0.00         direct relationship                1.00

   Ý              Ý           Ý         Ý             Ý            Ý          Ý          Ý                  Ý

Perfect      Strong  Moderate  Weak              Weak   Moderate  Strong         Perfect

In an inverse relationship (also called a negative relationship – see Figure 1 above) those who are high on one variable are low on the other. Such a relationship exists between the scores in Table 2. Those who are high on self-concept (such as Joe and Jane) are low on depression, while those who are low on self-concept (such as John and Jake) are high on depression. Again, the relationship is not perfect. The value of the Pearson r for the relationship in Table 2 is -.86.

Table 2 Inverse relationship, r = -.86

Resident

Self-Concept Scores Depression Score

Joe

Jane

Bob

June

Leslie

Homer

Milly

Jake

John

10

8

9

7

7

6

4

1

0

2

1

0

5

6

8

8

9

9

 

The relationships in Table 1 and 2 are strong but, in each case, there are exceptions, which make the Pearson rs less than 1.00 and –1.00. As the number and size of the exception increases, the values of the Pearson r become closer to 0.00. Therefore, a value of 0.00 indicates the complete absence of a relationship (See Figure 1).

It is important to note that a Pearson r is not a proportion and cannot be multiplied by 100 to get a percentage. For instance. a Pearson r of .50 does not correspond to 50% of anything (See Figure 1). To think about correlation in terms of percentages, we must convert Pearson rs to another statistic, the coefficient of determination, whose symbol is r2, which indicates how to compute it –simply square r. Thus, for an r of .50, r2 equals .25. If we multiply .25 by 100, we get 25%. What does this mean? Simply this: A Pearson r of .50 is 25% better than a Pearson r of 0.00. Table 3 shows selected values of r, r2, and the percentages you should think about when interpreting an r.

 

Table 3 Selected values of r and r2

r r2 Percentage better than zero
.90 .81 81%
.50 .25 25%
.63 .39 39%
.25 .06 6%
-.25 .06 6%
-.63 .39 39%
-.50 .25 25%
-.90 .81 81%

Exercise

  1. "Pearson r" stands for what words?
  2. When the relationship between two variables is perfect and inverse, what is the value of r?
  3. Is it possible for a negative relationship to be strong?
  4. Is an r of -.90 stronger than an r of .50?
  5. Is a relationship direct or inverse when those with high scores on one variable have high scores on the other and those with low scores on one variable have low scores on the other?
  6. What does an r of 1.00 indicate?
  7. For a Pearson r of .60, what is the value of the coefficient of determination?
  8. What do we do to a coefficient of determination to get a percentage?
  9. A Pearson r of .70 is what percentage better than a Pearson r of 0.00?

Question for Discussion

  1. Name two variables between which you would expect to get strong, positive value of r.
  2. Name two variables between which you would expect to get a strong, negative value of r.

Module #2

Section 1
Topic: The t Test

Suppose we have a research hypothesis that says medical "research investigators who take a short course on the causes of HIV will be less fearful of the disease than research investigators who have not taken the course," and test it by conducting an experiment in which a random sample of research investigators are assigned to take the course and another random sample are designated as the control group (note: random sampling is preferred, because it precludes any bias in the assignment of subjects to the groups and because we can test for the effect of random errors with significance test; we cannot test for the effects of bias).

Let’s suppose that at the end of the experiment the experimental group gets a mean of 16.61 on a fear of HIV scale and the control group gets a mean of 29.67 (where the higher the score, the greater the fear of HIV). These means support our research hypothesis. But can we be certain that our research hypothesis is correct? If you’ve been reading various topics on statistics, you already know that the answer is "no" because of the Null Hypothesis, which says that there is no true difference between the means; that is, the difference was created merely by the chance errors created by random sampling (these errors are known as sampling errors). Put another more simple way, unrepresentative groups may have been assigned to the two conditions quite at random.

The t test is often used to test the null hypothesis regarding the observed difference between two means (to test the null hypothesis between two medians, the median test is used; it is a specialized form of chi square test). For the example, we are considering, a series of computations (which are beyond the scope of this paper) would be performed to obtain a value of t (which, in this case, is 5.38) and a value of degrees of freedom (which, in this case, is df = 179). These values are not of any special interest to us except that they are used to get the probability (p) that the null hypothesis is true. In this particular case, p is less than .05. Thus, in a research report, you may read a statement such as this:

"The difference between the means is statistically significant (t = 5.38, df = 179, p< .05)".

The term statistically significant indicates that the null hypothesis has been rejected. You will recall that when the probability that the null hypothesis is true is .05 or less (such as .01 or .001), we reject the null hypothesis. When something is unlikely to be true, because it has a low probability of being true, we reject it.

Having rejected the null hypothesis, we are in a position to assert that our research hypothesis probably is true (assuming no procedural bias was allowed to affect the results, such as testing the control group immediately after a major news story on a celebrity person with AIDS, while testing the experimental group at an earlier time).

What leads a t test to give us a low probability? Three things:

  1. Sample size. The larger the sample, the less likely that an observed difference is due to sampling errors. Large samples provide more precise information. Thus, when the sample is large, we are more likely to reject the null hypothesis than when the sample is small.
  2. The size of the difference between means. The larger the difference, the less likely that the difference is due to sampling errors. Thus, when the difference between the means is large, we are more likely to reject the null hypothesis than when the difference is small.
  3. The amount of variation in the population. When a population is very heterogeneous (has much variability) there is more potential for sampling error. Thus, when there is little variation (as indicated by the standard deviations of the sample), we are more likely to reject the null hypothesis than when there is much variation.

A special type of t test is also applied to correlation coefficients. Suppose we drew a random sample of 50 medical students and correlated their hand size with their GPAs and got an r of .19. The null hypothesis says that the true correlation in the population is 0.00 - that we got .19 merely as the result of sampling errors. For this example the t test indicates that p > .05. Since the probability that the null hypothesis is true is greater than 5 in 100, we do not reject the null hypothesis; we have a statistically insignificant correlation coefficient. In other words, for n = 50, an r of .19 is not significantly different from an r of 0.00. When reporting the results of the t test for the significance of a correlation coefficient, it is better not to mention the value of t. Rather, it is better to indicate only whether or not the correlation is significant at a given probability level.

 

Exercise

  1. What does the null hypothesis say about the difference between two sample means?
  2. Is the value of t usually of any special interest to consumers of research?
  3. Suppose you read that for the difference between two means, t = 2.00, df = 20, p>.05. Using conventional standards, should you conclude that the null hypothesis should be rejected?
  4. Suppose you read that for the difference between two means, t = 2.859, df = 40, p<.05. Using conventional standards, should you conclude that the null hypothesis should be rejected?
  5. Based on the information in question 4, should you conclude that the difference between the means is statistically significant?
  6. When we use a large sample are we more or less likely to reject the null hypothesis than when we us a small sample?
  7. When the size of the difference between means is large are we more or less likely to reject the null hypothesis than when the size of the difference is small?
  8. If we read that for a sample of 92 subjects, r = .41, p<.001, should we reject the null hypothesis?
  9. Is the value of r in question 8 statistically significant?

Question for Discussion

  1. Of the three things that lead to a low probability, which one is most directly under the control of a researcher?

Module #2

Section 2

Topic: One-Way Analysis of Variance (ANOVA)

In using the t test, you learned it was used to test the null hypothesis for the observed difference between two sample means. An alternative test for this problem is the analysis of variance (often called ANOVA), sometimes called the F test.

Instead of t, it yields a statistic called F, as well as degrees of freedom (df), sum of squares, mean square, and a p value, which indicates the probability that the null hypothesis is correct. As with the t test, the only value of interest to the typical user of research is the value of p. By convention, when p equals .05 or less (such as .01 or .001), we reject the null hypothesis and declare the result to be statistically significant.

Because the t test and ANOVA are based on the same theory and assumptions, when we compare two means, both tests yield exactly the same value of p and, hence, lead to the same conclusion regarding significance. So for two means, both tests are equivalent and either test can be used to get the same results. However, that is where the comparison stops! Note, that a single t test can compare only two means, but a single ANOVA can compare a number of means, which is a great advantage.

Suppose, for example, as a medical researcher, you tested three drugs, on a population, to treat depression in an experiment and obtained the following means and standard deviations.

Table 1 Posttest means and standard deviations

Of depression scores for three drugs

_________________________________________________________

Drug A                      Drug B                   Drug C

M = 6.00                  M = 5.50                M = 2.33

_________________________________________________________

Since the higher the score the greater the depression, inspection of the means shows that there are three observed differences:

    1. Drug C is superior to Drug A.
    2. Drug C is superior to Drug B.
    3. Drug B is superior to Drug A.

The null hypothesis says that the entire set of three differences was created by sampling error. Through a series of computations that are beyond the scope of this paper, an ANOVA for these data yields this result: F = 10.837, df = 2, 15, p <.05. This result might be stated in a sentence presented in a table such as Table 2, which is known as an ANOVA table. While it contains many values, which were used to arrive at the probability, we are only interested in the end result – the value of p.

As users of statistics know, when the probability is .05 or less, as it is here, we reject the null hypothesis. This means that the entire set of differences is statistically significant at the .05 level (See Table 2).

Note: the ANOVA does not tell us which of the three differences we listed are significant; it could be that only one, or only two or all three are significant. This needs to be explored with additional test, known as multiple comparison tests.

Table 2 ANOVA for data in Table 1

Source of
Variation
 

df

Sum of
Squares
Mean
Squares
 

F

Between Groups
Within Groups
2
15
47.445
32.833
23.722
2.189
10.837*
Total
*p < .05
17 80.278

 

As previously mentioned, there are a number of multiple comparison tests (Dunnets test, Scheffe’ test, and the Tukey test). Each one is used based on different assumptions – and usually they yield similar results, but not always. For the data we are considering, application of a popular multiple comparisons test, the Scheffe’s test, yields these probabilities:

(1) for Drug C vs. A, p < .05

(2) for Drug C vs. B, p < .05

(3) for Drug B vs. A, p > .05

Thus, we have found that Drug C is significantly better than Drug A and B, but that Drug B and A are not significantly different from each other.

In review, an ANOVA tells us whether a set of differences, overall, is significant. If so, we can use the appropriate multiple comparison test (either the Dunnets, Scheffe’s or Tukey test) to determine which pairs of means are significantly different from each other’s.

In this topic, we have been considering a One-Way ANOVA (also known as a single-factor ANOVA). It is called this because we have classified the subjects in only one way – in terms of which drug they took. There is another type of ANOVA where subjects are classified in two ways, which is called appropriately Two-Way ANOVA, which will be discussed in a different topic.

Exercise

  1. ANOVA stands for what word?
  2. If we compare two means for significance, will ANOVA and the t test yield the same probability?
  3. If an ANOVA yields p<.001, should the null hypothesis be rejected?
  4. If an ANOVA yields p>.05, should the null hypothesis be rejected?
  5. If we have four means on an achievement test for samples of students in four states, can we determine whether the set of differences, overall, is statistically significant by using ANOVA?
  6. For the information in question 5, could we use a t test for the same purpose?
  7. Should the typical consumer of research be concerned with the values of the sum of squares?
  8. In an ANOVA table, which statistic is of the greatest interest to the typical user of research?
  9. If an overall ANOVA for three or more means is significant, it can be followed up with what type of test to determine the significance of the differences among the individual pairs of means?

Question for Discussion

  1. Briefly describe a hypothetical study in which it would be appropriate to conduct a one-way ANOVA but it would not be appropriate to conduct a t test.
  2. If you have means for four groups, you have how many individual pairs of means to be compared with a multiple comparison test?

Module #2

Section 3

Topic: Two-Way Analysis of Variance (ANOVA)

In the one-way ANOVA, we saw how ANOVA can be used to test for the overall significance of a set of means when subjects have been classified in one way. Often, however, it is desirable to look at a two-way classification such as (1) which drug was taken and (2) how long subjects have been depressed. Table 1 (below) shows the means for such a study (note: for instructional purposes, only two drugs are shown. However, we may use ANOVA when there are more than two). Since higher depression scores indicate more depression, a low mean is desirable.

 

Table 1 Means for a study of depression: Drugs and length of depression comparisons

_________________________________________________________

                                            Drug A        Drug B             Row Total

                     Long-term       M = 8.11     M = 8.32         M = 8.22

                    Short-term       M = 4.67    M = 8.45        M = 6.56

                    Column Total M = 6.39 M = 8.38

Although the subjects are classified in two ways, analysis of the table answers three questions. First by comparing the column totals of 6.39 and 8.38, we can see that, overall, those who took Drug A are less depressed. It’s important to notice that the mean of 6.39 for Drug A is based on both those who have long-term and those who have short-term depression; the same is true of the mean of 8.39 for Drug B. Thus, by comparing the column total means, we are answering the question of which drug is more effective in general without regard to how long subjects have been depressed. In analysis of variance, this is known as a main effect.

Each way in which subjects are classified yields a main effect in analysis of variance. Thus, since subjects were also classified in terms of their length of depression, there is a main effect for short-term vs. long-term, which can be seen by examining the row total means of 8.22 and 6.56. This main effect indicates that, overall, those with short-term depression are less depressed than those with long-term depression.

In this example, the most interesting question is the question of an interaction. The question is this: Is the effectiveness of the drugs dependent, in part, on the length of depression? By examining the individual cell means (those not in bold in Table 1), we can see that the answer is "yes". Drug A is more effective for short-term than long-term depression (4.67 vs. 8.11) while Drug B is about equally effective for both types of depression (8.32 vs. 8.45). What is the practical implication of this interaction? The overall effectiveness of Drug A is almost entirely attributable to its effectiveness for short-term depression. That is, if a person has short-term depression, Drug A is indicated, but if a person has long-term depression, either drug is likely to be about equally effective.

For data in Table 1, it turns out that P<.05 for both main effects and the interaction. Thus, we can reject the null hypotheses that say that the differences we are considering are the result of random errors. Of course, it does not always turn out this way. It’s possible for one or two of the main effects to be significant but the interaction to be not significant; it’s also possible for neither main effect to be significant while the interaction is insignificant, which is the case for the data in Table 2.

 

Table 2 Means for a study of depression: Drugs and gender comparisons

_________________________________________________________

                                                      Drug A         Drug B         Row Total

                            Females              M = 8.00      M = 5.00        M = 6.50

                            Males                M = 5.00     M = 8.00        M = 6.50

                           Column Total   M = 6.50     M = 6.50

Notice that the column totals (6.50 vs. 6.50) in Table 2 indicates no main effect for Drug A vs. Drug B ( in this case they are equal to one another). Likewise, the row totals (6.50 vs. 6.50) indicate no main effect for gender. But, there is a very interesting finding – the interaction of drug type and gender, which indicates that for females, Drug B is superior, but for males, Drug A is superior. Note that if we had compared the two drugs in a One-Way ANOVA without also classifying the subjects according to gender (as we did here in the Two-Way ANOVA), we would have missed this important interaction.

Exercise

  1. Suppose we drew random samples of urban, suburban, and rural children and tested them for creativity, and obtained three means. Should we use a one-way or a two-way ANOVA to test significance?
  2. Do the following means on a performance test indicate an interaction between type of reward and age?
  3. __________________________________________________________

                                                                Praise             Monetary             Row
                                                                Reward          Rewards            Totals

                              Young Adults         M = 50.00          M = 60.00         M = 55.00

                             Older Adults          M = 60.00         M = 50.00        M = 55.00

                            Column Total        M = 55.00         M = 55.00

     

     

  4. Do the means for question 2 indicate a main effect for type of reward?
  5. Do the following means on an achievement test indicate an interaction between the method o f instruction (A vs. B) and the aptitude of the students (high vs. low)?
  6. _______________________________________________________

                                                             Method A         Method B          Row Totals

                                High Aptitude        M = 100.00       M = 85.00          M = 92.50

                               Low Aptitude        M = 100.00      M = 85.00        M = 92.50

                              Column Total       M = 100.00      M = 85.00

     

  7. Do the means for question 4 indicate a main effect for method of instruction?
  8. Do the means for question 4 indicate a main effect for aptitude?
  9. If p>.05 for an interaction in an analysis of variance, should we reject the null hypothesis?
  10. If p<.05 for a main effect in an analysis of variance, should we reject the null hypothesis?
  11. If both main effects are statistically significant in an analysis of variance, will the interaction necessarily be significant?

Question for Discussion

  1. Briefly describe a hypothetical study in which it would be appropriate to conduct a two-way ANOVA but it would not be appropriate to conduct a one-way ANOVA.

Module #3

Section 1

Topic: Multiple Regression Correlation (MRC)

Multiple Regression Correlation (MRC), like the Analysis of Variance (ANOVA) is a statistical method from the "General Linear Model". According to Kerlinger & Pedhazur, "multiple regression analysis can do anything ANOVA does. ANOVA is a special case of MRC. Both are algebraically the same and will produce statistically the same outcome (See Table 1).

                             Table 1          ANOVA          vs.           MRC outcomes

                                                  ________________________________________________

                              F ratio             20.38                             20.38

                              df                    1, 18                             1, 18

                              p value             < .01                             < .01

                             SS

                                     Between   64.80        Regression     64.80
                                     Within      57.20        Residual         57.20

                             ___________________________________

 

ANOVA is used primarily in scientific experiments, while MRC is used primarily in quasi-experimental design experiments. Both have null hypothesis (as described under the Null Hypothesis topic), however, they look at different elements of the model. ANOVA looks to see if the population mean1 equals the population mean2, while MRC looks at the correlation of the Independent Variable (IV) to the Dependent Variable (DV), as follows:

                           MRC                 ANOVA
                                 H0 = P = 0          H0 = M1 = M2

Both MRC and ANOVA have the same underlying assumptions, they are:

  1. That the subjects have equal variances.
  2. That the subjects are normally distributed, and
  3. That they are independent of one another.

There are several key components of the MRC and ANOVA which are different from one another, they are:

       MRC Key Components                          ANOVA Key Components

       1. Impractical to hold subjects for             1. Causation
              an extended period of time.

       2. Not appropriate to randomize                2. Experimental control
             subjects.

       3. May be to expensive to randomize         3. Randomization of subjects
             subjects.

       4. May be logistically impossible to           4. Ability to isolate variables
              randomize the subjects.

As noted earlier, MRC is used in quasi-experimental design experiments. Quasi-experiments have treatments, outcome measures and experimental units, but do not use random assignment to create the comparisons from which treatment-caused change is inferred. Instead, the comparisons depend on nonequivalent groups that differ from each other in many ways other than the presence of a treatment whose effects are being tested.

In the absence of randomization, the researcher is faced with the task of identifying and separating the effects of the treatments from the effects of all other factors affecting the dependent variable (DV). Campbell & Stanley (1963) warned researchers about "a feeling of hopelessness with regard to achieving experimental control which leads to the abandonment of such efforts in favor of more informal methods of investigation". Their defense of quasi-experimental design was that it is "deemed worthy of use when better designs are not feasible". In other words, support of use of quasi-experiments design is not based on its intrinsic worth; rather, it was positioned as an approach when better designs are not possible (Sawilowsky, 1997).

Campbell & Stanley (1963) suggest the use of quasi-experimental design for the many "social science settings" in which there is no way to randomly assign participants.

In regression analysis one is trying to either predict or explain phenomena. In predictive research the main emphasis is on practical applications, whereas in explanatory research the main emphasis is on understanding phenomena. This is not to say that the two research activities are unrelated, or that they have no bearing on each other. Predictive research may, for example, serve as a source of hunches and insights that might lead to theoretical considerations. Yet the importance of distinguishing between the two type of research activities cannot be overemphasized.

MRC uses IV’s and DV’s in the context of explanatory research, whereas ANOVA uses predictor and criterion in the context of predictive research. Prediction is really a special case of explanation; it can be subsumed under theory and explanation as noted in Table 2. This is explanation:

Table 2 Explanation of a variable
________________________________

If p, then q, under conditions r, s and t
________________________________

The above explanation in Table 2 is also prediction, prediction from p to q as follows in Table 3.

Table 3 Prediction of a variable
______________________________________

Prediction from p (and r, s, and t) to q
________________________________

Kerlinger & Pedhazur (1973) indicates, while MRC is well-suited to predictive analysis, it is more fundamentally oriented to "explanatory analysis". We do not simply throw variables in a regression equation; we enter them, whenever possible, at the dictates of theory and reasonable interpretation of empirical research findings.

Exercise

1. Is MRC used primarily in experimental design research? If so why? If not why not?
2. Why is ANNOVA a special case of MRC?
3.  What are the underlying assumptions of MRC?
4.  What are the three components that MRC and ANOVA have and what is the one lacking component that is missing in MRC that ANOVA has?
5.  Campbell & Stanley (1963) suggested that the use of quasi-experimental design and MRC is more centered in the Social Science setting, why?
6.  In predictive research, the main emphasis is on what?
7.  In explanatory research, the main emphasis is on what?
8.  What does MRC use in the context of explanatory research?
9. What does ANOVA use in the context of predictive research?

Question for Discussion

  1. Briefly describe a hypothetical study in which it would be appropriate to use MRC.

Foundations of Program Evaluation

Standards of Evaluation - Education Programs

Standards for Evaluations of Education Programs, Projects, and Materials (Congressional Education Joint Committee, 1981):

  • A set of 30 criteria subsumed into four (4) categories: utility, feasibility, propriety and accuracy (Payne, 1981).
    1. Utility: Audience identification, evaluator credibility, dissemination, report clarity and timeliness.
    2. Feasibility: Practicality (data collection), political viability and cost-effectiveness.
    3. Propriety: The legal and ethical issues associated with conducting the evaluation.
    4. Accuracy Standards: The reliability, validity, data control and analysis.

What is the difference between evaluation and research?

  • Evaluation takes place in a naturalistic setting.
  • Evaluation focuses on the entire program.
  • Evaluation has more complex outcomes.
  • The objective of evaluation involves a greater range of phenomena, and
  • The objectives of evaluation tend to be oriented more to process and behavior.

What is the role of evaluation (Heath, 1969):

  • Contributes to the general body of knowledge about some item.
  • Facilitation of some rational comparison or competing program.
  • Improvement of the program during the development phase.

Review (Education Programs):

  • Scale of Measurement
    1. Nominal & Ratio (qualitative).
    2. Interval & Ratio (quantitative).

______________________________________________________

Donald T. Campbell

Many call the "father" of scientific evaluation.

Campbell and Stanley (1963) wrote the seminal text on research design titled: Experimental and Quasi-experimental Designs for Research.

Experiment: an experiment that has randomization as its basis, and can be repeated over and over again and the same results will occur.

Quasi-experiment: an experiment, which lacks randomization as its basis.

Campbell was in favor of using both quantitative and qualitative procedures. He wanted qualitative methods to complement quantitative rather than to replace it.

All evaluations should be open for criticism and accountability.

Campbell does not recommend evaluations:

  • If the program to be evaluated is "puny".
  • If it has already been approved so officials can say they are addressing

problems.

  • Still being implemented despite ongoing mistakes.
  • Involves officials who are not proud of their work.

______________________________________________________

 

(Shadish, Cook and Leviton-1991)

Modern social program evaluation emerged in the 1960’s due to massive Federal involvement in social welfare spending due to the Great Society movement during Presidents Johnson and Nixon e.g. The War on Poverty, Medicare/Medicaid, etc…

What percent of the U.S. annual budget is earmarked for research and evaluation?

The first federal program to require "evaluation" was the juvenile delinquency program enacted by congress in 1962 (Weiss, 1987).

Stakeholder: Those who have a "stake" in the program or its "evaluation".

There is a psychological phenomena in which an evaluation demonstrates that the program is not effective, however, the person continues to support the program, Why?

  • The administrator’s main concern is maintaining his/her budget.
  • The administrator is protecting his/her employees job security.
  • The politician is concerned about funds to his/her home district.
  • The politician will exploit the program in order to get reelected.
  • A tremendous amount of energy went into the development of the program.

Then how does change occur?

  • By incremental steps – hundreds of accumulated small inputs.
  • No single authority can radically change a program e.g. social security.

Program evaluation assumes that education problem solving can be improved by incremental improvements in existing programs, better designs of new ones, terminating bad programs and replacing them with better ones.

Most education programs get replaced or die-out due to political or economic reasons not because of evaluation results.

Internal validity is the sine qua non (Campbell and Stanley, 1963)

Lincoln and Guba, (1985, 1986) argued that there is no reality beyond what we each construct, so causality, generalizability and truth, have little useful meaning.

Value Component: Early evaluators thought that they could be value-free from justice, equality, liberty and human rights.

Theory of Valuing (Beauchamp, 1982)

  • Metatheory: study of the nature of and justification for valuing
  • Prescriptive theory: advocates the primacy of particular values
  • Descriptive theory: describes values without advocating one as best.

Evaluation is about determining value, merit or worth.

Most evaluators use descriptive valuing. They describe values held by the stakeholders, however, no claim by the evaluator is made that this is the best program.

Prescriptive theories have a heavy burden of justifying why?


(Michael S. Scriven – 1983)

Without evaluation, we have no means of allocating resources; waste, fraud, and incompetence would go undetected (Scriven, 1983).

The fiscal benefits of the evaluation should always exceed the cost of conducting the evaluation.

Scriven coined formative evaluation and summative evaluation:

  • Formative evaluation – aimed at improving an educational experience or product during it's developmental phases (Scriven, 1967)
  • Summative evaluation – an end of course assessment (Scriven, 1967)

How bias are introduced into evaluations:

  • When one is an internal evaluator.
  • Divides loyalties among management and employees.
  • Uses goals established by the "stakeholder".
  • Uses the program’s goals.

How do we correct bias in education programs (goal-free evaluations):

  • The evaluator must be "totally blind" to the program, similar to blind justice and double-blind medicine trials.
  • The evaluator must be able to identify both positive and negative effects.

Why should the evaluator be goal-free in evaluating a program:

  • Because their reputation as a "quality" evaluator is on the line.
  • Auditing the final report is a special case of "meta-evaluation".
  • Because the evaluator gets evaluated.
  • Because the evaluator creates standards of acceptable performance on criteria of merit.

Cost-benefit Analysis: "Translates all inputs and outputs into monetary units, yielding a single cost-benefit ratio of net fiscal gains or losses.


Qualitative vs. Quantitative

Research: An Overview
Elements of "Naturalistic" Research

    1. Intense and/or Prolonged Contact with a "Field" or life situation
    2. - Typically reflective of normal or everyday life

      - Individuals, groups, societies, organizations, etc.

    3. Researcher Role to Gain a "Holistic" Overview of the Context Under Study
    4. Researcher Attempts to Capture Data on the Perceptions of Local Actors
    • "From the inside"

Qualitative Analysis…

    1. Few Standardized Instruments Used
    2. Most Analysis performed with Words Assembled into Similar "Meaning Units"
    3. Analysis Proceeds by Isolating Themes & Expressions
    • Review with informants
    • One main task: Explain ways people in part come to understand, account for, take action and manage their day-to-day situations

Advantages and Disadvantages…

Advantages                                                          Disadvantages

1. Helps establish interpretive                            1. Concepts not always clearly framework defined

2. Helps reduce bias                                            2. Field notes guarded

3. Processual                                                       3. Territoriality

4. Adds "punch" to research                             4. Criticism taboo reports

5. Generally high in validity                               5. Lots of data

                                                                            6. Analysis procedures poor defined

   7. Generally low  reliability

 

 

 

 


Relationship Between Methods…

Qualitative Research                 Quantitative Research

1. Towards discovery                 1. Towards testing a hypothesis

2. Induction                                 2. Deduction

3. Specific to General                  3. General to specific

4. Fieldwork                                4. Office/desk/lab

5. Small n                                    5. Large N

6. Non-statistical                         6. Statistical

7. Describe, interpret, explain    7. Predict, control


 


 

Steps of Qualitative Method…

 

    1. Record raw data
    2. Conceptualize
    3. Develop Propositions
    4. Develop Hypotheses
    5. Construct Theory
    6. Develop a Model

Qualitative Data Sources…

    1. Interviews
    2. - Structured

      - Semi-structured

      - Unstructured

    3. Oral Histories & Archival Materials
    4. Direct Observation/Participant-Observation
    5. Personal Documents
    6. Visual Documents

Early Field Reactions…

    1. Overwhelmed by data
    2. Uncomfortable, "cultural shock"
    3. Continually explain purpose of research
    4. Feelings of inadequacy
    5. Concern about how informants see you
    6. Guilt about not doing enough work

Midpoint Field Reactions…

    1. Almost too rapid internalization & acceptance of cultural norms/
      values/behavior
    2. Inability to verbalize the meaning related to experiences/events/interactions
    3. Recognizing need for physical distance
    4. Worry about asking right questions

Late Field Reactions…

    1. Concern about fulfilling research design
    2. Fear of collecting insufficient data
    3. Feelings of taking without giving back
    4. Personal difficulties associated with leaving the field site

Computers & Quantitative Analysis

    1. SPSS statistical package
    2. SAS statistical package
    3. Data base – data properly coded
    4. Data manipulation