|
Understanding and Using Statistical Research Methodologies in
Medical Education Programs:
A Primer for Medical Students and Residents
MODULES 1, 2, & 3
Danford L. Wilson, Ph.D.
Director for Graduate Medical
Education
University of Kansas School of
Medicine
3901 Rainbow Boulevard, Kansas City,
KS 66160-7300
©
Understanding Statistical Research Methodologies in Medical
Education Programs:
A Primer for Medical Students and Residents
Introduction
- Statement of Purpose:
This Statistical Module is designed for residents and students, who are not
epidemiologist, and those who do not wish to use extensive epidemiology
statistics in their research. This module will allow the user to understand
basic statistics used in medical research, and understand their use and
concept when reading medical and scientific journals.
This module establishes a framework for understanding the essential
principles and practices of understanding the null hypothesis and using
various statistics {Mean, Median & Mode; the Chi Square Test; the Pearson
Correlation Coefficient Pearsons r; the t Test; One-Way
Analysis of Variance; Two-Way Analysis of Variance; and Multiple Regression},
with emphasis placed on understanding the role of "qualitative" and
"quantitative" evaluation in medical education research.
- Goal:
Overall Objectives
Residents and Students will:
- Understand what the null hypothesis is and what it means when you reject
the null hypothesis or when you fail to reject the null hypothesis.
- Understand what statistically significant means.
- Understand and discern the difference between qualitative and quantitative
research using various statistical methods {Mean, Median & Mode, Chi
Square, Pearson r, t test, ANOVA one-way and two-way analysis
of variance, Multiple Regression and ANCOVA}.
- Understand the history of qualitative and quantitative research in
medicine.
- Identify and understand the strengths and limitations of qualitative and
quantitative data.
- Analyze a real qualitative data set.
- Identify qualitative assessments that are being used today in medical
research.
- Identify and discuss the seminal bodies of qualitative work.
- Identify political limitations of research results (politically correct).
- Understand the function of focus groups and their applications in medical
research.
- Understand and be aware of the available software that can be used in
providing data results (SAS, SPSS, etc
).
C. Logistical Considerations:
- Residency year:
PGY 2 and PGY 3
- Length of program and number of modules:
There are three statistical main modules (each containing from 2 to 4
sections) with a "seat time" of approximately 45 minutes per main
module and 30 minutes for the "Foundation of Program Evaluation".
The total time for all modules will be approximately 3 hours and 30 minutes
(allowing for a 15 minute question & answer period for each main
module):
- Understanding and using the Mean, Median & Mode of a data set;
Introduction to the Chi Square Test; Calculating the Pearson Correlation
Coefficient (Pearsons r) and its meaning.
- Understanding when and where to use and interpret the t Test; ANOVA
One-Way Analysis of Variance; ANOVA Two-Way Analysis of Variance in
medical education research
- Understanding Multiple Regression in medical research. When to use it, how
to interpret the results, and what does it mean in your research paper.
D. Materials:
Handouts
Evaluation forms
Module Summaries and Objectives
Module 1: Basic Statistic Methods and their use
Module One, Basic Statistical Methods and their use, gives residents a
simplistic overview in the use of the hypothesis, the mean, median & mode;
the chi square test; and the pearsons r in research, and it defines
common concepts in their use.
At the end of this module, the resident and students will be able to:
- Describe and understand the way data is used in research papers in medical
and scientific research.
- Compare and understand the results of one research against similar
research. Understand chi square and the pearsons r.
- Define qualitative vs. quantitative research.
- Define "reliability" and "validity" in the use of data
and statistical methods.
Module 2: Advanced Statistical Methods: The t Test, and
ANOVA
Module two, Advanced Statistical Methods: The t Test and ANOVA,
provides residents and students with a simplistic overview and understanding in
using and interpreting advanced statistical methods which are used in medical
and scientific journals. Residents and students will learn tools and techniques
in using these statistics, in preparation for writing their own research papers
in their residency and clinical program, without having to use heavy
epidemiology statistics.
At the end of this module, the resident will be able to:
- Describe the purpose and use of the t Test, and ANOVA (One-Way and
Two-Way Analysis of Variance).
- Describe the type of data that must be used with these types of test.
- Understand why the user must set alpha (a ) at
a level (preferably .05 or less) to test a hypothesis.
- Understand what "statistically significant" means using
these statistical methods (t Test, and ANOVA).
- Discuss strategies in the development of their research paper.
- Read and understand research papers that use the t Test and/or
ANOVA in analyzing data.
Module 3: Advanced Statistical Methods: Multiple Regression, and
ANCOVA (Analysis of Covariance)
Module three, Advanced Statistical Methods: Multiple Regression, and ANCOVA
(Analysis of Covariance), provides students and residents with a simplistic
overview and understanding in using and interpreting statistical methods which
are used in medical and scientific journals. Residents and students will learn
tools and techniques in using and solving statistics, in preparation for writing
their own research papers during their residency year, without having to learn
epidemiology statistics.
At the end of this module, the resident and student will be able to:
- Describe the purpose and use of Multiple Regression and ANCOVA (Analysis
of Covariance).
- Describe the type of data that must be used with these types of test.
- Understand what "statistically significant" means using
these statistical methods (Multiple Regression & ANCOVA).
- Discuss strategies in using these statistics in the development of their
research paper.
- Read and understand research papers that use the Multiple Regression
and/or ANCOVA in analyzing data.
Module #1
Section 1
Topic: Introduction to the Null Hypothesis
Suppose we drew a random sample each of medical students and residents,
administered a self-report measure of medical knowledge, and computed the mean
(the most commonly used average) for each group. Furthermore, suppose the mean
for the medical students is 63.00 and the mean for residents is 68.00. Where did
the five points difference come from? There are three possible explanations,
they are:
- Perhaps the population of residents is truly more knowledgeable about
medicine than the population of medical students, and our samples correctly
identified the difference. (In fact, our research hypothesis may have
been that residents are more knowledgeable about medicine than medical
students which now appears to be supported by the data.)
- Perhaps there was a bias in procedures. By using random sampling, we have
ruled out sampling bias, but other procedures such as measurement may be
biased. For example, maybe the residents were contacted during September,
when many clinical events (conferences, lectures, etc
) take place and the
medical students were contacted during the gloomy month of December when no
clinical events (conferences, lectures, etc
) took place. The only way to
rule out bias as an explanation is to take physical steps to prevent
it. In this case, we would want to make sure that the medical knowledge for
both groups was measured in the same way at the same time.
- Perhaps the populations of residents and medical students are the same but
the samples are unrepresentative of their populations because of random
sampling errors. For instance, the random draw may have given us a sample of
residents who are more knowledgeable, on the average, than their population.
This third explanation has a name it is the Null Hypothesis. The
general form in which it is stated varies from researcher to researcher. Here
are three versions, all of which are consistent with each other:
- Version "A" of the Null Hypothesis:
- The observed difference was created by sampling errors
. Note: The term
sampling error refers only to random errors, not errors created
by a bias.
B. Version "B" of the Null Hypothesis:
- There is no true difference between the two groups.
(Note: The term true
difference refers to the difference we would find in a census of the
population, that is, the difference we would find if there were no sampling
errors.)
C. Version "C" of the Null Hypothesis:
- The true difference between the two groups is zero
.
Significance tests determines the probability that the null hypothesis is
true. The researcher sets the probability level. Suppose for our
example, we use a significance test and find that the probability that the null
hypothesis is true is less than 5 in 100. This would be stated as p <
.05, where p obviously stands for probability. The researcher
should always state the probability level used in their research findings. It
can be set anywhere from .001 to > 0, however, we play it safe by setting it
to .05 (which has been accepted as the international standard by statisticians).
Of course, if the chances that something is true is less than 5 in 100, its a
good bet that its not true. If its probably not true, we
reject the null hypothesis, leaving us with only the first two
explanations that we started with as viable explanations for the difference.
There is no rule of nature that dictates at what probability level the null
hypothesis should be rejected. However, conventional wisdom suggests that .05 or
less (such as .01 or .001) are reasonable.
When we fail to reject the null hypothesis because the probability is greater
than .05, we do just that: We "fail to reject" the null
hypothesis and it stays on our list of possible explanations; we never "accept"
the null hypothesis as the only explanation. Remember, there are three possible
explanations (see above) and failing to reject one of them does not mean that
you are accepting it as the only explanation.
An alternative way to say that we have rejected the null hypothesis is to
state that the difference is statistically significant. Thus, if
we state that a difference is statistically significant at the .05 level
(meaning .05 or less), it is equivalent to stating that the null hypothesis has
been rejected at that level.
When you read research reported in academic journals, or research papers, you
will find that the null hypothesis is seldom stated by researchers, who assume
that you know that the sole purpose of a significance test is to test a null
hypothesis. Instead, researchers tell you which differences were tested for
significance, which significance test they used, and which differences were
found to be statistically significant. It is more common to find null hypotheses
stated in theses and dissertations since committee members may wish to make sure
that the students they are supervising understand the reason they have conducted
a significance test.
Exercise
- How many explanations are there for the differences in medical knowledge
between residents and medical students in the example in this topic?
- What does the null hypothesis say about sampling errors?
- Does the term sampling error refer to random errors or to bias?
- The null hypothesis says that the true difference equals what value?
- What is used to determine the probabilities that null hypothesis are true?
- For what does p < .05 stand for?
- Do we reject the null hypothesis when the probability of truth is high or
when it is low?
- What do we do if the probability is greater than .05?
- What is an alternative way of saying that we have rejected the null
hypothesis?
- Are you more likely to find a null hypothesis stated in a journal article
or in a thesis?
Question for Discussion
- We all use probabilities in everyday activities to make decisions. For
example, before we cross a busy street, we estimate the odds that we will
get across the street safely. Briefly describe one other specific use of
probability in everyday decision-making.
Module #1
Section 2
Topic: The Mean, Median, and Mode
The most frequently used average is the Mean, which is the balance
point in a distribution. Its computation is simple just sum (add up) the
scores and divide by the number of scores. The most common symbol for the mean
in academic journals is M (for the mean of a population) or m
(for the mean of a sample). The symbol preferred by statisticians is
X
which is pronounced "X-bar"
Because the mean is very frequently used as the average, lets consider its
formal definition, which is the value around which
the deviations sum to zero. You can see what this means by considering
the scores in Table 1. When we subtract the mean of the scores (which is 4.0)
from each of the other scores, we get the deviations (whose symbol is x).
If we sum the deviations, we get zero, as shown in Table 1.
Table 1 Scores and deviation scores
X
Minus M
Equals x
| 1 |
- |
4.0 |
= |
-3.0 |
| 1 |
- |
4.0 |
= |
-3.0 |
| 1 |
- |
4.0 |
= |
-3.0 |
| 2 |
- |
4.0 |
= |
-2.0 |
| 2 |
- |
4.0 |
= |
-2.0 |
| 4 |
- |
4.0 |
= |
0.0 |
| 6 |
- |
4.0 |
= |
2.0 |
| 7 |
- |
4.0 |
= |
3.0 |
| 8 |
- |
4.0 |
= |
4.0 |
| 8 |
- |
4.0 |
= |
4.0 |
Sum of the deviations (x) = 0.0
Note: If you take any set of scores, compute their mean, and follow
the steps in Table 1, the sum of the deviations will always equal zero (it might
be slightly off from zero if you use a rounded mean such as using 20.33 as the
mean when its precise value is 20.33333333).
Considering the formal definition, you can see why we also informally define
the mean as the balance point in a distribution. The positive and
negative deviations balance each other out.
A major drawback of the mean is that it is drawn in the direction of extreme
scores. Consider the following two sets of scores and their means:
Scores for Group A: 1, 1, 1, 2, 3, 6, 7, 8, 8
M = 4.11
Scores for Group B: 1, 2, 2, 3, 4, 7, 9, 25, and 32
M = 9.44
Notice in both sets there are nine scores and the two distributions are very
similar except for the scores of 25 and 32 in Set B, which are much higher than
the others and, thus, create a skewed distribution. Notice that the two
very high scores have greatly pulled up the mean for Set B; in fact, the mean
for Set B is more than twice as high as the mean for Set A because of the two
high scores.
When a distribution is highly skewed, we use a different average, the Median,
which is defined as the middle score. To get an approximate median,
put the scores in order from low to high as they are for Sets A and B (above),
and then count to the middle. Since there are nine scores in Set A, the median
(middle score) is 3 (which is five scores up from the bottom). For Set B, the
median (middle score) is 4 (which is five scores up from the bottom), which is
more representative of the center of this skewed distribution than the
mean , which we noted was 9.44. Thus, one use of the median is to describe the
average of skewed distributions. Another use is to describe the average of
ordinal data, which you must look at when using NOIR (Nominal, Ordinal, Interval
and Ratio) data.
A third average, the Mode, is simply the most frequently occurring
score. For Set B, there are more scores of 2 than any other score; thus, 2
is the mode. The mode is sometimes used in informal reporting but is very seldom
used in formal reports of research.
Because there is more than one type of average, it is vague to make a
statement such as, "The average is 4.11." Rather, we should
indicate the specific type of average being reported with statements such as,
"The mean is 4.11."
A synonym for the term averages is measures of central tendency.
Although the latter is seldom used in reports of scientific research, you may
encounter it in other research and statistics.
Exercise
- Which average is defined as the most frequently occurring score?
- Which average is defined as the balance point in a distribution?
- Which average is defined as the middle score?
- What is the formal definition of the mean?
- How is the mean calculated?
- Should the mean be used for highly skewed distributions? Why and/or why
not?
- Should the median be used for highly skewed distributions? Why and/or why
not?
- What is a synonym for the term averages?
Question for Discussion
- Suppose a fellow student gave a report in class and said, "the
average was 25.88." For what additional information should you ask?
Why?
Module #1
Section 3
Topic: Introduction to the Chi Square Test
Suppose we drew at random a sample of 200 members of the American Medical
Association and asked them whether they were in favor of a proposed change to
their bylaws. The results are shown in Table 1. But do these observed results
reflect the true results that we would have obtained if we had questioned
the entire population? ( Note: we are using the term true results here to
stand for the results of a census of the entire population. The results of a
census are true in the sense that they are free of sampling errors.
Of course, there may also be measurement errors, which we are not considering
here).
Table 1 Members approval of a change in bylaws
Response
_______________________________
Yes
60.0%
(n = 120)
No
40.0%
(n = 80)
__________________________________
Total 100.0%
____________________________________
Remember that the null hypothesis says that the observed difference was
created by random sampling errors; that is, in the population, the true
difference is zero. Put another way, the observed difference (n = 120 vs.
n = 80) is an illusion created by chance error.
The usual test of the null hypothesis when we are considering frequencies
(that is, number of cases or n) is Chi Square, who symbol is:
X2
It turns out that after doing some computations, which are beyond the scope
of this paper, for the data in Table 1 (above), the results are:
X2 = 4.00, df = 1, p < .05
What does this mean for a user of research who sees this in a report? The
values of chi square and degrees of freedom (df) were calculated solely
to obtain the probability that the null hypothesis is correct. That is, Chi
Square and degrees of freedom are not descriptive statistics that you
should attempt to interpret. Rather, think of them as sub-steps in the
mathematical procedure for obtaining the value of p. thus, the user of
research should concentrate on the fact that p is less than .05.
As you probably remember, when the probability (p) that the null
hypothesis is correct is .05 or less, we reject the null hypothesis.
(Remember, when the probability that something is true is less than 5 in 100
a low probability conventional wisdom suggests that we should reject it as
being true.) Thus, the difference we observe in Table 1 was probably not created
by random sampling errors; thus, we can say that the difference is statistically
significant at the .05 level.
Up to this point, we have concluded that the difference we observed in the
sample was probably not created by sampling errors. So where did the
difference come from? Two possibilities remain:
- Perhaps there was a bias in procedures such as the person asking the
question in the survey leading the respondents by talking enthusiastically
about the proposed change in the bylaws. If we are convinced that adequate
measures were taken to prevent procedural bias, we are left with only the
next possibility as a viable explanation.
- perhaps the population of physicians is, in fact, in favor of the
proposed change, and this fact is correctly identified by studying the
random sample.
Now lets consider some results from a survey in which the null hypothesis
was not rejected. Table 2 shows the numbers and percentages of subjects
in a random sample from a population of Resident Program Directors who prefer
each of three methods for teaching statistics.
Table 2 Resident Program Director preference for method
Method A
Method B
Method C
n = 30
(37.97%)
n = 27
(34.18%)
n = 22 (27.85%)
In table 2, there are three differences (30 for A versus 27 for B, 30 for A
versus 22 for C, and 27 for B versus 22 for C). The null hypothesis says that
this set of differences was created by random sampling errors; in other
words, it says that there is no true difference in the population; we have
observed a difference only because of sampling errors. The results of the Chi
Square test for the data in Table 2 are:
X2 = 1.214, df = 2, p >.05
Using the decision rule that p must be equal to or less than .05 to
reject the null hypothesis, we fail to reject the null hypothesis, which
is called a statistically insignificant result. In other words, the null
hypothesis must remain on our list as a viable explanation for the set of
differences we observed by studying a sample.
In this topic, we have considered the use of chi square in a univariate
analysis, in which we classify each subject in only one way (such as which
candidate each prefers). Chi square can also be useful in bivariate analysis,
in which we classify each subject in two ways (such as which candidate each
prefers and the gender of each) in order to examine a relationship between
the two.
Exercise
- When we study a sample, are the results called the true results or
the observed results?
- According to the null hypothesis, what created the difference in Table 1
in this topic?
- What is the name of the test of the null hypothesis used when we are
analyzing frequencies?
- As a consumer (user) of research, should you try to interpret the value of
df?
- What is the symbol for probability?
- If you read that a chi square test of a difference yielded a p of
less than 5 in 100, what should you conclude about the null hypothesis on
the basis of conventional wisdom?
- Does p < .05 or p > .05 usually lead a researcher to
declare a difference to be statistically significant?
- If we fail to reject a null hypothesis, is the difference in question
statistically significant?
- If we have a statistically insignificant result, does the null hypotheses
remain on our list of viable hypotheses?
Question for Discussion
- Briefly describe a hypothetical study in which it would be appropriate to
conduct a chi square test for univariate data.
Module #1
Section 4
Topic: The Pearson Correlation Coefficient (Pearsons r)
When we wish to examine the relationship between two quantitative sets of
scores (at the interval or ratio levels), we compute a correlation coefficient.
The most widely used coefficient is the Pearson Product-Moment Correlation
Coefficient, who symbol is r. It is usually called simply Pearsons
r.
Consider the scores in Table 1 (below). The resident test scores places
subjects in roughly the order as the ratings by supervisors. In other
words, those who had high clinical test scores (such as Jan and Joe) tended to
have high supervisors ratings and those who had low test scores (such as Jake
and John) tended to have low supervisors ratings. This illustrates what we
mean by a direct relationship (also called a positive relationship).
Table 1 Direct relationship, r = .89
| Resident |
Clinical Text Scores |
Supervisors' Ratings |
|
Joe
Jane
Bob
June
Leslie
Homer
Milly
Jake
John |
35
32
29
27
25
22
21
18
15 |
9
10
8
8
7
8
6
4
5 |
Note the relationships in Table 1. They are not perfect. For example,
although Joe has a higher clinical test score than Jane, Jane has a higher
supervisors rating than Joe. If the relationship were perfect, the value of
the Pearson r would be 1.00. Being less than perfect, its actual value
is .89. As you can see in Figure 1 (below), this value indicates a strong,
direct relationship. Note: the value of the Pearson r is always
between a negative 1 and a positive +1.
Figure 1 Values of the Pearson r
|
-1.00
inverse
relationship
0.00
direct
relationship
1.00
Ý
Ý
Ý
Ý
Ý
Ý
Ý
Ý
Ý
Perfect Strong
Moderate
Weak
Weak Moderate
Strong Perfect
|
In an inverse relationship (also called a negative relationship
see Figure 1 above) those who are high on one variable are low on the other.
Such a relationship exists between the scores in Table 2. Those who are high on
self-concept (such as Joe and Jane) are low on depression, while those who are
low on self-concept (such as John and Jake) are high on depression. Again, the
relationship is not perfect. The value of the Pearson r for the
relationship in Table 2 is -.86.
Table 2 Inverse relationship, r = -.86
|
Resident |
Self-Concept Scores |
Depression Score |
|
Joe
Jane
Bob
June
Leslie
Homer
Milly
Jake
John |
10
8
9
7
7
6
4
1
0 |
2
1
0
5
6
8
8
9
9 |
The relationships in Table 1 and 2 are strong but, in each case, there are
exceptions, which make the Pearson rs less than 1.00 and 1.00. As the
number and size of the exception increases, the values of the Pearson r
become closer to 0.00. Therefore, a value of 0.00 indicates the complete
absence of a relationship (See Figure 1).
It is important to note that a Pearson r is not a proportion and
cannot be multiplied by 100 to get a percentage. For instance. a Pearson r
of .50 does not correspond to 50% of anything (See Figure 1). To think about
correlation in terms of percentages, we must convert Pearson rs to
another statistic, the coefficient of determination, whose symbol is r2,
which indicates how to compute it simply square r. Thus, for an r
of .50, r2 equals .25. If we multiply .25 by 100, we get 25%.
What does this mean? Simply this: A Pearson r of .50 is 25% better than a
Pearson r of 0.00. Table 3 shows selected values of r, r2,
and the percentages you should think about when interpreting an r.
Table 3 Selected values of r and r2
| r |
r2 |
Percentage better than zero |
| .90 |
.81 |
81% |
| .50 |
.25 |
25% |
| .63 |
.39 |
39% |
| .25 |
.06 |
6% |
| -.25 |
.06 |
6% |
| -.63 |
.39 |
39% |
| -.50 |
.25 |
25% |
| -.90 |
.81 |
81% |
Exercise
- "Pearson r" stands for what words?
- When the relationship between two variables is perfect and inverse, what
is the value of r?
Is it possible for a negative relationship to be strong?
Is an r of -.90 stronger than an r of .50?
Is a relationship direct or inverse when those with high scores on one
variable have high scores on the other and those with low scores on
one variable have low scores on the other?
What does an r of 1.00 indicate?
For a Pearson r of .60, what is the value of the coefficient of
determination?
What do we do to a coefficient of determination to get a percentage?
A Pearson r of .70 is what percentage better than a Pearson r of
0.00?
Question for Discussion
- Name two variables between which you would expect to get strong, positive
value of r.
- Name two variables between which you would expect to get a strong,
negative value of r.
Module #2
Section 1
Topic: The t Test
Suppose we have a research hypothesis that says medical "research
investigators who take a short course on the causes of HIV will be less fearful
of the disease than research investigators who have not taken the course,"
and test it by conducting an experiment in which a random sample of research
investigators are assigned to take the course and another random sample are
designated as the control group (note: random sampling is preferred, because it
precludes any bias in the assignment of subjects to the groups and because we
can test for the effect of random errors with significance test; we cannot
test for the effects of bias).
Lets suppose that at the end of the experiment the experimental group gets
a mean of 16.61 on a fear of HIV scale and the control group gets a mean of
29.67 (where the higher the score, the greater the fear of HIV).
These means support our research hypothesis. But can we be certain that our
research hypothesis is correct? If youve been reading various topics on
statistics, you already know that the answer is "no" because of
the Null Hypothesis, which says that there is no true difference
between the means; that is, the difference was created merely by the chance
errors created by random sampling (these errors are known as sampling errors).
Put another more simple way, unrepresentative groups may have been assigned
to the two conditions quite at random.
The t test is often used to test the null hypothesis regarding the
observed difference between two means (to test the null hypothesis between two medians,
the median test is used; it is a specialized form of chi square test).
For the example, we are considering, a series of computations (which are beyond
the scope of this paper) would be performed to obtain a value of t
(which, in this case, is 5.38) and a value of degrees of freedom (which, in this
case, is df = 179). These values are not of any special interest to us
except that they are used to get the probability (p) that the null
hypothesis is true. In this particular case, p is less than .05. Thus, in
a research report, you may read a statement such as this:
"The difference between the means is statistically
significant (t = 5.38, df = 179, p< .05)".
The term statistically significant indicates that the null
hypothesis has been rejected. You will recall that when the probability
that the null hypothesis is true is .05 or less (such as .01 or .001), we reject
the null hypothesis. When something is unlikely to be true, because it has a low
probability of being true, we reject it.
Having rejected the null hypothesis, we are in a position to assert that our
research hypothesis probably is true (assuming no procedural bias was allowed to
affect the results, such as testing the control group immediately after a major
news story on a celebrity person with AIDS, while testing the experimental group
at an earlier time).
What leads a t test to give us a low probability? Three things:
- Sample size
. The larger the sample, the less likely that an observed
difference is due to sampling errors. Large samples provide more precise
information. Thus, when the sample is large, we are more likely to reject
the null hypothesis than when the sample is small.
- The size of the difference between means
. The larger the difference,
the less likely that the difference is due to sampling errors. Thus, when
the difference between the means is large, we are more likely to reject the
null hypothesis than when the difference is small.
- The amount of variation in the population
. When a population is very
heterogeneous (has much variability) there is more potential for sampling
error. Thus, when there is little variation (as indicated by the standard
deviations of the sample), we are more likely to reject the null hypothesis
than when there is much variation.
A special type of t test is also applied to correlation coefficients.
Suppose we drew a random sample of 50 medical students and correlated their hand
size with their GPAs and got an r of .19. The null hypothesis says that
the true correlation in the population is 0.00 - that we got .19 merely
as the result of sampling errors. For this example the t test indicates
that p > .05. Since the probability that the null hypothesis is true
is greater than 5 in 100, we do not reject the null hypothesis; we have a
statistically insignificant correlation coefficient. In other words, for n
= 50, an r of .19 is not significantly different from an r of
0.00. When reporting the results of the t test for the significance of a
correlation coefficient, it is better not to mention the value of t.
Rather, it is better to indicate only whether or not the correlation is
significant at a given probability level.
Exercise
- What does the null hypothesis say about the difference between two sample
means?
- Is the value of t usually of any special interest to consumers of
research?
- Suppose you read that for the difference between two means, t =
2.00, df = 20, p>.05. Using conventional standards, should
you conclude that the null hypothesis should be rejected?
- Suppose you read that for the difference between two means, t =
2.859, df = 40, p<.05. Using conventional standards, should
you conclude that the null hypothesis should be rejected?
- Based on the information in question 4, should you conclude that the
difference between the means is statistically significant?
- When we use a large sample are we more or less likely to reject the
null hypothesis than when we us a small sample?
- When the size of the difference between means is large are we more or
less likely to reject the null hypothesis than when the size of the
difference is small?
- If we read that for a sample of 92 subjects, r = .41, p<.001,
should we reject the null hypothesis?
- Is the value of r in question 8 statistically significant?
Question for Discussion
- Of the three things that lead to a low probability, which one is most
directly under the control of a researcher?
Module #2
Section 2
Topic: One-Way Analysis of Variance (ANOVA)
In using the t test, you learned it was used to test the null
hypothesis for the observed difference between two sample means. An
alternative test for this problem is the analysis of variance (often
called ANOVA), sometimes called the F test.
Instead of t, it yields a statistic called F, as well as
degrees of freedom (df), sum of squares, mean square, and a p
value, which indicates the probability that the null hypothesis is correct.
As with the t test, the only value of interest to the typical user of
research is the value of p. By convention, when p equals .05
or less (such as .01 or .001), we reject the null hypothesis and declare the
result to be statistically significant.
Because the t test and ANOVA are based on the
same theory and assumptions, when we compare two means, both tests yield exactly
the same value of p and, hence, lead to the same conclusion regarding
significance. So for two means, both tests are equivalent and either test can
be used to get the same results. However, that is where the comparison
stops! Note, that a single t test can compare only two means, but a
single ANOVA can compare a number of means, which is a great advantage.
Suppose, for example, as a medical researcher, you tested three drugs, on a
population, to treat depression in an experiment and obtained the following
means and standard deviations.
Table 1 Posttest means and standard deviations
Of depression scores for three drugs
_________________________________________________________
Drug A
Drug B
Drug C
M =
6.00
M =
5.50
M = 2.33
_________________________________________________________
Since the higher the score the greater the depression, inspection of the
means shows that there are three observed differences:
Drug C is superior to Drug A.
Drug C is superior to Drug B.
Drug B is superior to Drug A.
The null hypothesis says that the entire set of three differences was
created by sampling error. Through a series of computations that are beyond the
scope of this paper, an ANOVA for these data yields this result: F =
10.837, df = 2, 15, p <.05. This result might be stated in a
sentence presented in a table such as Table 2, which is known as an ANOVA table.
While it contains many values, which were used to arrive at the probability, we
are only interested in the end result the value of p.
As users of statistics know, when the probability is .05 or less, as it is
here, we reject the null hypothesis. This means that the entire set of
differences is statistically significant at the .05 level (See Table 2).
Note: the ANOVA does not tell us which of the three differences we listed are
significant; it could be that only one, or only two or all three
are significant. This needs to be explored with additional test, known as
multiple comparison tests.
Table 2 ANOVA for data in Table 1
Source of
Variation |
df |
Sum of
Squares |
Mean
Squares |
F |
Between Groups
Within Groups |
2
15 |
47.445
32.833 |
23.722
2.189 |
10.837* |
As previously mentioned, there are a number of multiple comparison tests (Dunnets
test, Scheffe test, and the Tukey test). Each one is used based on different
assumptions and usually they yield similar results, but not always. For the
data we are considering, application of a popular multiple comparisons test, the
Scheffes test, yields these probabilities:
(1) for Drug C vs. A, p < .05
(2) for Drug C vs. B, p < .05
(3) for Drug B vs. A, p > .05
Thus, we have found that Drug C is significantly better than Drug A and B,
but that Drug B and A are not significantly different from each other.
In review, an ANOVA tells us whether a set of differences, overall, is
significant. If so, we can use the appropriate multiple comparison test (either
the Dunnets, Scheffes or Tukey test) to determine which pairs of means are
significantly different from each others.
In this topic, we have been considering a One-Way ANOVA (also known as a
single-factor ANOVA). It is called this because we have classified the subjects
in only one way in terms of which drug they took. There is another type of
ANOVA where subjects are classified in two ways, which is called appropriately
Two-Way ANOVA, which will be discussed in a different topic.
Exercise
- ANOVA stands for what word?
- If we compare two means for significance, will ANOVA and the t test
yield the same probability?
- If an ANOVA yields p<.001, should the null hypothesis be rejected?
- If an ANOVA yields p>.05, should the null hypothesis be rejected?
- If we have four means on an achievement test for samples of students in
four states, can we determine whether the set of differences, overall, is
statistically significant by using ANOVA?
- For the information in question 5, could we use a t test for the
same purpose?
- Should the typical consumer of research be concerned with the values of
the sum of squares?
- In an ANOVA table, which statistic is of the greatest interest to the
typical user of research?
- If an overall ANOVA for three or more means is significant, it can be
followed up with what type of test to determine the significance of the
differences among the individual pairs of means?
Question for Discussion
- Briefly describe a hypothetical study in which it would be appropriate to
conduct a one-way ANOVA but it would not be appropriate to conduct a t
test.
- If you have means for four groups, you have how many individual pairs of
means to be compared with a multiple comparison test?
Module #2
Section 3
Topic: Two-Way Analysis of Variance (ANOVA)
In the one-way ANOVA, we saw how ANOVA can be used to test for the overall
significance of a set of means when subjects have been classified in one way.
Often, however, it is desirable to look at a two-way classification such as (1)
which drug was taken and (2) how long subjects have been depressed. Table 1
(below) shows the means for such a study (note: for instructional purposes, only
two drugs are shown. However, we may use ANOVA when there are more than two).
Since higher depression scores indicate more depression, a low mean is
desirable.
Table 1 Means for a study of depression: Drugs and length of
depression comparisons
_________________________________________________________
Drug A Drug B
Row Total
Long-term M =
8.11 M =
8.32 M = 8.22
Short-term M = 4.67
M = 8.45 M
= 6.56
Column Total M = 6.39 M = 8.38
Although the subjects are classified in two ways, analysis of the table
answers three questions. First by comparing the column totals of 6.39 and
8.38, we can see that, overall, those who took Drug A are less depressed. Its
important to notice that the mean of 6.39 for Drug A is based on both those who
have long-term and those who have short-term depression; the same is true
of the mean of 8.39 for Drug B. Thus, by comparing the column total means, we
are answering the question of which drug is more effective in general without
regard to how long subjects have been depressed. In analysis of variance,
this is known as a main effect.
Each way in which subjects are classified yields a main effect in analysis of
variance. Thus, since subjects were also classified in terms of their length of
depression, there is a main effect for short-term vs. long-term, which can be
seen by examining the row total means of 8.22 and 6.56. This main effect
indicates that, overall, those with short-term depression are less depressed
than those with long-term depression.
In this example, the most interesting question is the question of an
interaction. The question is this: Is the effectiveness of the drugs
dependent, in part, on the length of depression? By examining the individual
cell means (those not in bold in Table 1), we can see that the answer is
"yes". Drug A is more effective for short-term than long-term
depression (4.67 vs. 8.11) while Drug B is about equally effective for both
types of depression (8.32 vs. 8.45). What is the practical implication of this
interaction? The overall effectiveness of Drug A is almost entirely attributable
to its effectiveness for short-term depression. That is, if a person has
short-term depression, Drug A is indicated, but if a person has long-term
depression, either drug is likely to be about equally effective.
For data in Table 1, it turns out that P<.05 for both main effects
and the interaction. Thus, we can reject the null hypotheses that say that the
differences we are considering are the result of random errors. Of course, it
does not always turn out this way. Its possible for one or two of the main
effects to be significant but the interaction to be not significant; its also
possible for neither main effect to be significant while the interaction is
insignificant, which is the case for the data in Table 2.
Table 2 Means for a study of depression: Drugs and gender
comparisons
_________________________________________________________
Drug A Drug B
Row Total
Females
M = 8.00 M =
5.00 M = 6.50
Males
M = 5.00 M = 8.00
M = 6.50
Column Total M = 6.50 M = 6.50
Notice that the column totals (6.50 vs. 6.50) in Table 2 indicates no main
effect for Drug A vs. Drug B ( in this case they are equal to one another).
Likewise, the row totals (6.50 vs. 6.50) indicate no main effect for gender.
But, there is a very interesting finding the interaction of drug type
and gender, which indicates that for females, Drug B is superior, but for males,
Drug A is superior. Note that if we had compared the two drugs in a One-Way
ANOVA without also classifying the subjects according to gender (as we did here
in the Two-Way ANOVA), we would have missed this important interaction.
Exercise
- Suppose we drew random samples of urban, suburban, and rural children and
tested them for creativity, and obtained three means. Should we use a
one-way or a two-way ANOVA to test significance?
- Do the following means on a performance test indicate an interaction
between type of reward and age?
__________________________________________________________
Praise
Monetary
Row
Reward Rewards
Totals
Young Adults M =
50.00 M =
60.00 M = 55.00
Older Adults M
= 60.00 M =
50.00 M = 55.00
Column Total M =
55.00 M = 55.00
- Do the means for question 2 indicate a main effect for type of reward?
- Do the following means on an achievement test indicate an interaction
between the method o f instruction (A vs. B) and the aptitude of the
students (high vs. low)?
_______________________________________________________
Method A Method B
Row Totals
High Aptitude M =
100.00 M =
85.00 M =
92.50
Low Aptitude M =
100.00 M = 85.00
M = 92.50
Column Total M =
100.00 M = 85.00
- Do the means for question 4 indicate a main effect for method of
instruction?
- Do the means for question 4 indicate a main effect for aptitude?
- If p>.05 for an interaction in an analysis of variance, should
we reject the null hypothesis?
- If p<.05 for a main effect in an analysis of variance, should we
reject the null hypothesis?
- If both main effects are statistically significant in an analysis of
variance, will the interaction necessarily be significant?
Question for Discussion
- Briefly describe a hypothetical study in which it would be appropriate to
conduct a two-way ANOVA but it would not be appropriate to conduct a
one-way ANOVA.
Module #3
Section 1
Topic: Multiple Regression Correlation (MRC)
Multiple Regression Correlation (MRC), like the Analysis of Variance (ANOVA)
is a statistical method from the "General Linear Model". According to
Kerlinger & Pedhazur, "multiple regression analysis can do anything
ANOVA does. ANOVA is a special case of MRC. Both are algebraically the same and
will produce statistically the same outcome (See Table 1).
Table 1
ANOVA
vs. MRC outcomes
________________________________________________
F ratio
20.38
20.38
df
1,
18
1, 18
p
value
<
.01
< .01
SS
Between 64.80
Regression 64.80
Within
57.20
Residual 57.20
___________________________________
ANOVA is used primarily in scientific experiments, while MRC is used
primarily in quasi-experimental design experiments. Both have null
hypothesis (as described under the Null Hypothesis topic), however, they look at
different elements of the model. ANOVA looks to see if the population mean1
equals the population mean2, while MRC looks at the correlation of the
Independent Variable (IV) to the Dependent Variable (DV), as follows:
MRC
ANOVA
H0 = P =
0 H0 = M1 = M2
Both MRC and ANOVA have the same underlying assumptions, they are:
- That the subjects have equal variances.
- That the subjects are normally distributed, and
- That they are independent of one another.
There are several key components of the MRC and ANOVA which are different
from one another, they are:
MRC Key Components
ANOVA Key Components
1. Impractical to hold subjects
for 1.
Causation
an extended period of time.
2. Not appropriate to
randomize
2. Experimental control
subjects.
3. May be to expensive to
randomize 3. Randomization of
subjects
subjects.
4. May be logistically impossible
to 4. Ability to
isolate variables
randomize the subjects.
As noted earlier, MRC is used in quasi-experimental design experiments.
Quasi-experiments have treatments, outcome measures and experimental units,
but do not use random assignment to create the comparisons from which
treatment-caused change is inferred. Instead, the comparisons depend on
nonequivalent groups that differ from each other in many ways other than the
presence of a treatment whose effects are being tested.
In the absence of randomization, the researcher is faced with the task of
identifying and separating the effects of the treatments from the effects of all
other factors affecting the dependent variable (DV). Campbell & Stanley
(1963) warned researchers about "a feeling of hopelessness with
regard to achieving experimental control which leads to the abandonment of such
efforts in favor of more informal methods of investigation". Their
defense of quasi-experimental design was that it is "deemed worthy of
use when better designs are not feasible". In other words, support
of use of quasi-experiments design is not based on its intrinsic worth; rather,
it was positioned as an approach when better designs are not possible (Sawilowsky,
1997).
Campbell & Stanley (1963) suggest the use of quasi-experimental design
for the many "social science settings" in which there is no way
to randomly assign participants.
In regression analysis one is trying to either predict or explain
phenomena. In predictive research the main emphasis is on practical
applications, whereas in explanatory research the main emphasis is on understanding
phenomena. This is not to say that the two research activities are
unrelated, or that they have no bearing on each other. Predictive research may,
for example, serve as a source of hunches and insights that might lead to
theoretical considerations. Yet the importance of distinguishing between the two
type of research activities cannot be overemphasized.
MRC uses IVs and DVs in the context of explanatory
research, whereas ANOVA uses predictor and criterion in the context of
predictive research. Prediction is really a special case of explanation; it can
be subsumed under theory and explanation as noted in Table 2. This is
explanation:
Table 2 Explanation of a variable
________________________________
If p, then q, under conditions r, s and t
________________________________
The above explanation in Table 2 is also prediction, prediction from p to q
as follows in Table 3.
Table 3 Prediction of a variable
______________________________________
Prediction from p (and r, s, and t) to q
________________________________
Kerlinger & Pedhazur (1973) indicates, while MRC is well-suited to
predictive analysis, it is more fundamentally oriented to "explanatory
analysis". We do not simply throw variables in a regression
equation; we enter them, whenever possible, at the dictates of theory and
reasonable interpretation of empirical research findings.
Exercise
1. Is MRC used primarily in
experimental design research? If so why? If not why not?
2. Why is ANNOVA a special case of MRC?
3. What are the underlying assumptions of MRC?
4. What are the three components that MRC and ANOVA have and what is the
one lacking component that is missing in MRC that ANOVA has?
5. Campbell & Stanley (1963) suggested that the use of
quasi-experimental design and MRC is more centered in the Social Science
setting, why?
6. In predictive research, the main emphasis is on what?
7. In explanatory research, the main emphasis is on what?
8. What does MRC use in the context of explanatory research?
9. What does ANOVA use in the context of predictive research?
Question for Discussion
Briefly describe a hypothetical study in which it would be appropriate to
use MRC.
Foundations of Program Evaluation
Standards of Evaluation - Education Programs
Standards for Evaluations of Education Programs, Projects, and Materials
(Congressional Education Joint Committee, 1981):
- A set of 30 criteria subsumed into four (4) categories: utility,
feasibility, propriety and accuracy (Payne, 1981).
- Utility: Audience identification, evaluator credibility, dissemination,
report clarity and timeliness.
- Feasibility: Practicality (data collection), political viability and
cost-effectiveness.
- Propriety: The legal and ethical issues associated with conducting the
evaluation.
- Accuracy Standards: The reliability, validity, data control and
analysis.
What is the difference between evaluation and research?
- Evaluation takes place in a naturalistic setting.
- Evaluation focuses on the entire program.
- Evaluation has more complex outcomes.
- The objective of evaluation involves a greater range of phenomena, and
- The objectives of evaluation tend to be oriented more to process and
behavior.
What is the role of evaluation (Heath, 1969):
- Contributes to the general body of knowledge about some item.
- Facilitation of some rational comparison or competing program.
- Improvement of the program during the development phase.
Review (Education Programs):
- Nominal & Ratio (qualitative).
- Interval & Ratio (quantitative).
______________________________________________________
Donald T. Campbell
Many call the "father" of scientific evaluation.
Campbell and Stanley (1963) wrote the seminal text on research design titled:
Experimental and Quasi-experimental Designs for Research.
Experiment: an experiment that has randomization as its basis, and
can be repeated over and over again and the same results will occur.
Quasi-experiment: an experiment, which lacks randomization as its basis.
Campbell was in favor of using both quantitative and qualitative
procedures. He wanted qualitative methods to complement quantitative rather than
to replace it.
All evaluations should be open for criticism and accountability.
Campbell does not recommend evaluations:
- If the program to be evaluated is "puny".
- If it has already been approved so officials can say they are addressing
problems.
- Still being implemented despite ongoing mistakes.
- Involves officials who are not proud of their work.
______________________________________________________
(Shadish, Cook and Leviton-1991)
Modern social program evaluation emerged in the 1960s due to massive
Federal involvement in social welfare spending due to the Great Society movement
during Presidents Johnson and Nixon e.g. The War on Poverty, Medicare/Medicaid,
etc
What percent of the U.S. annual budget is earmarked for research and
evaluation?
The first federal program to require "evaluation" was
the juvenile delinquency program enacted by congress in 1962 (Weiss, 1987).
Stakeholder: Those who have a "stake" in the program or its
"evaluation".
There is a psychological phenomena in which an evaluation demonstrates that
the program is not effective, however, the person continues to support the
program, Why?
- The administrators main concern is maintaining his/her budget.
- The administrator is protecting his/her employees job security.
- The politician is concerned about funds to his/her home district.
- The politician will exploit the program in order to get reelected.
- A tremendous amount of energy went into the development of the program.
Then how does change occur?
- By incremental steps hundreds of accumulated small inputs.
- No single authority can radically change a program e.g. social security.
Program evaluation assumes that education problem solving can be improved by
incremental improvements in existing programs, better designs of new ones,
terminating bad programs and replacing them with better ones.
Most education programs get replaced or die-out due to political or economic
reasons not because of evaluation results.
Internal validity is the sine qua non (Campbell and Stanley, 1963)
Lincoln and Guba, (1985, 1986) argued that there is no reality beyond what we
each construct, so causality, generalizability and truth,
have little useful meaning.
Value Component: Early evaluators thought that they could be value-free from
justice, equality, liberty and human rights.
Theory of Valuing (Beauchamp, 1982)
- Metatheory: study of the nature of and justification for valuing
- Prescriptive theory: advocates the primacy of particular values
- Descriptive theory: describes values without advocating one as best.
Evaluation is about determining value, merit or worth.
Most evaluators use descriptive valuing. They describe values held by the
stakeholders, however, no claim by the evaluator is made that this is the best
program.
Prescriptive theories have a heavy burden of justifying why?
(Michael S. Scriven 1983)
Without evaluation, we have no means of allocating resources; waste, fraud,
and incompetence would go undetected (Scriven, 1983).
The fiscal benefits of the evaluation should always exceed the cost of
conducting the evaluation.
Scriven coined formative evaluation and summative
evaluation:
- Formative evaluation aimed at improving an educational experience or
product during it's developmental phases (Scriven, 1967)
- Summative evaluation an end of course assessment (Scriven, 1967)
How bias are introduced into evaluations:
- When one is an internal evaluator.
- Divides loyalties among management and employees.
- Uses goals established by the "stakeholder".
- Uses the programs goals.
How do we correct bias in education programs (goal-free evaluations):
- The evaluator must be "totally blind" to the program, similar to
blind justice and double-blind medicine trials.
- The evaluator must be able to identify both positive and negative effects.
Why should the evaluator be goal-free in evaluating a program:
- Because their reputation as a "quality" evaluator is on the
line.
- Auditing the final report is a special case of
"meta-evaluation".
- Because the evaluator gets evaluated.
- Because the evaluator creates standards of acceptable performance on
criteria of merit.
Cost-benefit Analysis: "Translates all inputs and outputs into monetary
units, yielding a single cost-benefit ratio of net fiscal gains or losses.
Qualitative vs. Quantitative
Research: An Overview
Elements of "Naturalistic" Research
Intense and/or Prolonged Contact with a "Field" or life
situation
- Typically reflective of normal or everyday life
- Individuals, groups, societies, organizations, etc.
Researcher Role to Gain a "Holistic" Overview of the Context
Under Study
Researcher Attempts to Capture Data on the Perceptions of Local Actors
Qualitative Analysis
Few Standardized Instruments Used
Most Analysis performed with Words Assembled into Similar
"Meaning Units"
Analysis Proceeds by Isolating Themes & Expressions
- Review with informants
- One main task: Explain ways people in part come to understand,
account for, take action and manage their day-to-day situations
Advantages and Disadvantages
Advantages
Disadvantages
1. Helps establish
interpretive
1. Concepts not always clearly framework defined
2. Helps reduce
bias
2. Field notes guarded
3. Processual
3. Territoriality
4. Adds "punch" to
research
4. Criticism taboo reports
5. Generally high in
validity
5. Lots of data
6. Analysis procedures poor defined
7. Generally low
reliability
Relationship Between Methods
Qualitative Research
Quantitative Research
1. Towards
discovery
1. Towards testing a hypothesis
2.
Induction
2. Deduction
3. Specific to
General
3. General to specific
4.
Fieldwork
4. Office/desk/lab
5. Small n
5. Large N
6.
Non-statistical
6. Statistical
7. Describe, interpret, explain 7. Predict, control

Steps of Qualitative Method
Record raw data
Conceptualize
Develop Propositions
Develop Hypotheses
Construct Theory
Develop a Model
Qualitative Data Sources
- Interviews
- Structured
- Semi-structured
- Unstructured
- Oral Histories & Archival Materials
- Direct Observation/Participant-Observation
- Personal Documents
- Visual Documents
Early Field Reactions
Overwhelmed by data
Uncomfortable, "cultural shock"
Continually explain purpose of research
Feelings of inadequacy
Concern about how informants see you
Guilt about not doing enough work
Midpoint Field Reactions
Almost too rapid internalization & acceptance
of cultural norms/
values/behavior
Inability to verbalize the meaning related to
experiences/events/interactions
Recognizing need for physical distance
Worry about asking right questions
Late Field Reactions
- Concern about fulfilling research design
- Fear of collecting insufficient data
- Feelings of taking without giving back
- Personal difficulties associated with leaving the field site
Computers
& Quantitative Analysis
- SPSS statistical package
- SAS statistical package
- Data base data properly coded
- Data manipulation
|