These
relate to some specific technical matters I am sometimes asked about but which
are not covered in detail in my courses… or maybe passed you by unnoticed.
Note, I am not trying to present all my course material here (you have
to take the courses for that), just deal with some ‘frequently asked questions’
and ‘things people frequently get confused over/get wrong’. Also, these are not
all readily understandable unless you took stats courses already!
·
How do I round figures
down to make them shorter, e.g. 3.852. And
how many decimal places should I report?
·
How do I generate random numbers to help when
sampling from a list, or when dividing subjects randomly into groups? Use the facility
at http://www.randomizer.org/form.htm
·
I have the proficiency scores (or the like) for
30 subjects, and want to divide the cases into groups
based on this. Or I need categories of word stimuli of three different
frequencies.. How do I do it?
·
Can I get phonetic symbols
like [ð] shown on the scales of SPSS graphs?
·
How do I combine columns of
figures I have entered in SPSS, when I want averages for each person of the
figures in the columns (e.g. the scores for separate items in a test)?
·
What is item analysis? And
what does it mean if the F in an ANOVA result is labelled F1 or F2, where there
has been an analysis by items as well as by subjects?
·
How do I eliminate extreme
response times in psycholinguistic data?… or response times where the response
was wrong?
·
What does the standard deviation
really mean?
·
When I do a histogram of some scores (interval
scale data) I am supposed to look at the distribution
shape… the pattern of the heaps on the graph… but how do I interpret the
shape I see?
·
How should one treat rating
scale responses? As ordered categories or interval scores?
·
If my data is not normally distributed, so not
suited to t tests and ANOVA, what can I do? What are the transformations I can use?
·
What really are Likert and Guttman
scales, and how should they be constructed? They both are ways of measuring
things via a set of agree-disagree items. Often we use sets of items of this
type that other researchers made… but I wonder if anyone actually selected and
rated the items in the approved way in the first place?
·
What does it mean when SPSS gives you a figure
with an E on the end? e.g. 7.012E-02
·
What are degrees of freedom (df) and how do I report them, if
needed?
·
What are residuals and
what do they tell me?
·
If in a pilot trial of a few subjects I don’t
get the significant result I want, how can I estimate how
many subjects I would need to probably get a sig result?
·
How do I do follow-up post
hoc paired comparisons and planned comparison tests for any kind of main
effect or interaction in ANOVA where more than two groups or conditions were
initially compared? SPSS doesn’t do all the possibilities, or hides some away…
·
How do I do post hoc paired comparisons after a
Kruskal-Wallis test?
·
What is Bonferroni
adjustment and how can I do it?
·
What is eta squared and how
does SPSS calculate it?
·
Esp. for ACQUISITION people and SOCIOLINGUISTS.
Twenty people in two groups are each measured for the number of times they use
the third person –s out of all the occasions when they had an
opportunity to in compositions, recorded speech etc. (often called potential
occurrences or loci). How do you summarise % scores
like this? Group % scores for frequency of use of things, or individual %
scores?
·
Esp. for PSYCHOLINGUISTS and people doing
repeated measures EXPERIMENTS. What on earth is a Latin
square and how do I use it or some other method of organising conditions,
different types of stimuli etc. in an experimental design?
·
What are those tests of prerequisites
for ANOVA/GLM such as those of Levene, Mauchly etc. all in aid of?
·
If I have a lot of missing
scores, can I fill them in somehow?
·
Can I check on whether people are responding by
random guessing or with bias, and adjust scores to take
account of that?
·
My subjects all gave several responses to a set
of different stimuli, and I have entered the data in SPSS with each response as
a row. So there are several rows for each subject. How do I turn that into the more usable SPSS layout with ‘one
row per subject’?
·
Subjects have been categorised in a parallel
way in several different columns. E.g. they answered a set of questions each of
which had the possible response: me, my teacher, my classmates (i.e. although
coded for SPSS as 1, 2, 3, the responses
cannot be considered as degrees of anything on an interval number scale). How
do I get SPSS to add up for each person across
the items totals of how many times each category was chosen?
·
If you are into word association tests, there
are a few descriptive stats that one can use there that one does not find used
anywhere else much: The Group overlap coefficient, Within groups overlap coefficient, and Index of commonality.
Sometimes
journals expect you to report these df figures along with other
statistics. They are the figures you see quoted in brackets often subscript
after t, F, Chi squared etc. E.g. instead of
t = 2.34 one sees perhaps t(28) = 2.34.
They can
usually easily be got from SPSS output where they are not obvious. Look for df.
Broadly they reflect the number of categories in any category variables in the design,
and the number of cases in each group. The exception is designs where only
category variables are involved (e.g. where you would use chi squared): in that
instance the df just reflects the number of categories.
Since you
will have told the reader the numbers of categories and cases involved anyway,
I don't personally see the point of mentioning df. But in case you need to,
they mainly turn out to be one less than the numbers you started with…. though
it can get more complicated.
The df
numbers are written subscript, or in brackets, after the statistic t, F or
whatever (not the p).
So in a t
test comparing two groups, 108 subjects altogether, the df will be 1,106. One
might write t1,106 = .....
The first figure is one less than the number of EV categories (2-1=1).
The second is the number of cases less one for each group involved
(N-2=108-2=106).
In an ANOVA
comparing four groups with 108 subjects altogether, df would be 3,104.
In a t test
comparing the same group in two conditions, the df for 108 cases will be 1,107.
The df can
be more tricky for more complicated designs and interactions. In the output of
ANOVA you will generally see the first df figure you need in line with the main
effect or interaction of interest, and the second one listed as within
groups or error below it.
In a chi
squared test with three categories on each scale, the df is 4 because (3-1) x
(3-1) = 4. In a chi squared test with two categories on each scale, the df is 1
because (2-1) x (2-1) = 1.
Why are
these figures called 'degrees of freedom', and why are they important? It is
basically because what is important in statistics is not so much the numbers of
anything but the numbers of choices or separate pieces of information
involved. Typically there are always one less choices than people etc. If I
have ten assignments to hand back to my class of ten students, I have to make a
choice who to give each one to for the first nine, but for the tenth one there
is no choice, as there is only one assignment left and one person left to give
it to. I have no 'freedom' left on the last one.
Here's the
statistician's analog of that. 100 people answer a yes-no question and 38 say
'yes' and 62 say 'no'. We want to know if that differs significantly from
50:50. I.e. are they showing a real preference? There are two categories (yes
and no), so we use the binomial test. It might seem that we have two
figures to handle in the test and two comparisons to make. We have to check if
the observed figure of 38 differs from the E of 50 and if O of 62
differs from E of 50. But in fact, of course, the test need only do one of
those. The data has only one degree of freedom. Once the test establishes if 38
differs significantly from 50 for one category, the answer for the other
category, whether 62 does so as well, is fixed. Hence if one calculates
statistics by hand one always finds that in the formulae one has to use the df
figures rather than full numbers of cases or categories.
These are
simply the differences between observed figures (O) and some kind of
predicted/expected figure (E). But they mean different things in different
analyses.
Category data: for significant differences/relationships we
want them big, because the E figures represent what is expected under the null
hypothesis of NO difference/relationship. In analyses where just frequencies in
categories are involved (e.g. analysed using chi squared or the binomial test),
the residuals are the differences between O and E frequencies. The bigger they
are, the more likely that there is a significant difference involved. In the
Labov analysis in class we looked at the table of O and E values to see where
the biggest O-E differences were (for which r use in which store). In fact chi
squared itself is calculated by essentially adding up the residuals for each
cell in the table (with a bit more maths to it). In the binomial test where,
say, 20 people are divided 4 saying 'yes' and 16 saying 'no' to a question, we
want to know if that differs significantly from a 50-50 split, which would be
10 'yes' and 10 'no' in this instance. So we are concerned with the size of the
residual... in this instance 6. The bigger the better, if we want to show a
clear preference.
Interval data: for significant relationships we want them
small, because the E figures represent what is expected under the hypothesis of
a perfect linear relationship. This is the other place where you often find
residuals being talked about - in data where all the variables are (treated as)
interval (analysed using Pearson r, or regression). Here they are the
differences between the observed scores and the scores predicted by the best
fitting line on a scatterplot, showing the EV-DV relationship. Here obviously
the smaller the residuals, the more likely the relationship is significant.
Obviously one can find a best fitting line to any data where cases are
scored on two or more interval variables.... but if most of the observations
fall miles away from the line, that does
not show a real relationship. Pearson r
and regression statistics in effect reflect whether the residuals are generally
large or small; examining scatterplots, when we look at cases (subjects) that
are way off line, we are looking at cases with exceptionally large residuals.
This is the
measure of relationship that you can get in ANOVAs and the like. A bit like a
correlation coefficient it tells you on a scale 0-1 how much EV-DV
relationship there is. Really it is more analogous to r2 and can be
thought of as a % on a scale 0-100. It is a useful addition to just being told
if a relationship or difference is significant. Many significant
differences/relationships in fact are quite small in terms of the SIZE of the
difference/relationship.
SPSS does
not calculate eta quite how the books suggest, or even how SPSS help itself
seems to suggest.
In fact every eta sq is calculated so that it is a proportion out of a
different total and some of the variance that goes into the calculation of one
of them may also go into the calculation of another, so none of them can be
added sensibly to each other.
So every effect (main or interaction) is out of its own 100%,
representing the maximum variance that it could account for, but not all the
variation in DV scores. This applies even where the effects are of the same
type and a sensible calculation could be made of the % of variance of the same
type accounted for (e.g. two between subjects main effects - in principle one
could calculate what % of the WS variance they account for together). In
fact this is not done.
So the SPSS etas can be compared with each other (This one is accounting for
more of the total it could account for than that one is...) but not
really added. Or if you like, the total % if there are three factors with three
main effects, 3 two-way interactions, and one three way, is not 700% but less
than that... but hard to calculate exactly what. (In fact you can see how SPSS
calculates the etas: in the sum of squares column it is simply the sum of
squares for the effect of interest divided by the SS of that effect plus the
relevant error SS for that effect. Clearly then it is not calculating the
proportion of all the SS in the entire analysis accounted for by that effect,
just the proportion of the SS relevant to that effect. And also the error SS
get re-used in different calculations)
Wherever a
main effect or interaction involves a comparison of more than two means, ‘post
hoc’ tests can be relevant, as the basic significance value given by the ANOVA
does not say which pair or pairs is/are sig different. If the main or
interaction effect from ANOVA comes out significant that just means that there
is a sig difference SOMEWHERE among the means… but not between every pair
necessarily. Especially this arises where one or more of the EVs has three or
more levels (i.e. groups or conditions), though it can also arise, say, where
you have two two-value EVs and the interaction is significant. You need a post
hoc test to identify where the differences are exactly… or just judge it by eye
from a graph or table of means. This situation arises in various ways in
ANOVAs, some of which SPSS deals with straightforwardly, others not.
One might
think the solution is just to do loads of familiar t tests comparing the means
in pairs as required, to see which pairs are sig different. Indeed one sees
this done in some published work, and in moderation probably you can get away
with this… However, statisticians don’t like that. The statistical issue underlying
all this is that, when you do paired comparisons like this, the same means are
getting reused several times in different comparisons. If you have three groups and compare them in
pairs then the mean for group 1 gets used in the comparison both with group 2
and group 3. Now the more times a mean gets compared with others in repeated
statistical tests, the more chances it has to come out as significantly
different just by chance, not reflecting a real population difference. Remember
that if a difference between two means is ‘significant (at the .05 level)’ that
actually MEANS that one would not get a result this different more than 5% of
the time … or one in twenty times… by chance, due to the vagaries of random
sampling, in similar sized samples from a population where there really was no
difference. But another way of looking at that is to say that if you use
the same data in twenty comparisons, then one of the results might be that
one-in-twenty result that looks significant but is actually from a
population where there is no difference. The more tests you do, the more chance
of getting a result that looks sig but is not really.
Some
adjustment has to be made to compensate for this. Like other activities in life
involving pairs, your tests for multiple paired comparisons should not be
‘unprotected’! Post hoc tests and the like cope with this better than t tests.
It is not appropriate to do multiple t tests… at least not without a Bonferroni adjustment of the sig level (though that
is a solution that is seen as rather overcompensating for the problem). Better
is to use a post hoc test designed for such comparisons (e.g. Tukey, Scheffe, …
etc.). However, as the SPSS dialog box for post hoc shows, there is a
myriad of options: nobody is certain which is the best, and none are perfect.
As a consequence sometimes you can get an anomalous result that the ANOVA says
there is a sig difference somewhere, but the paired post hoc test does not find
any pair significantly different.
The term ‘post hoc’ is used for where you just
want to consider all pairs of means that are possible to compare, following an
overall analysis including all the means, which is the appropriate starting
point. SPSS however limits this term to comparisons between cases in different groups,
though statisticians use the term generally for follow up comparisons of pairs
of repeated measures conditions as well. The term ‘planned comparison’
(=contrasts in SPSS) is used where you planned specific paired comparisons,
not all the possible ones, such as the comparison of three groups of learners
with an NS group, but not with each other.
The general rule is that for k means there are
k(k-1)/2 paired comparisons possible. E.g. if four groups… then 4 x 3 / 2
comparisons, i.e. 6. However, SPSS output usually gives you the pairs twice
over so it looks even more.
E.g. the %
correct scores for third singular –s of three groups of learners are
compared. The basic ANOVA result says whether there is a significant
relationship between the EV and the DV… a difference somewhere among the
groups… but not exactly where. If the overall result is sig, then to see which
pairs of groups are sig different… you need to do post hoc tests. Whether you
do the ANOVA via Compare means… Oneway ANOVA or via General Linear
Model… Univariate, you get many many ways of doing the post hoc test
offered under the Post Hoc option. Tukey HSD is a common safe bet.
Basic post
hoc tests compare every pair of means. But suppose your groups were two of
learners and one of native speakers and you plan to compare the two learner
groups with the NS group (which may be thought of as a control group) but not
with each other. These are often called ‘planned comparisons’ and you would do
better not to use the post hoc tests which compare every pair, and so are
weaker (less likely to identify sig differences). You get this sort of limited
comparison in Analyze.. General Linear Model... Univariate... enter your
DV as usual and the three languages variable as a fixed factor. This does a
oneway ANOVA exactly like you get with Compare Means... Oneway.. except
that it gives you some extra options. If you click Contrasts and click
the contrast option to get Simple and then click first or last
depending on whether the control group is numbered 1 or 3... then (don't
forget) click Change... then Continue then OK... you get
an output that just does those limited paired comparisons.
E.g. you compare the same people’s fluency speaking to the teacher, to peers and to
parents. You want to compare each pair of those conditions afterwards. In General
Linear Model… Repeated Measures you have to use not what is labelled Post
Hoc but rather Options… click the variables into Display means
and tick Compare main effects and below that choose Bonferroni.
This in effect uses t tests with a simple Bonferroni
adjustment for multiple comparisons to compare the pairs of means. Not ideal…
because overcautious: i.e. likely to lead to you missing a difference that is
actually sig. SPSS should really make Tukey etc. available in repeated measures
as well as independent groups comparisons…. Alternatively you can do your own
Tukey test as described below.
Once again
you can alternatively choose limited planned comparisons via the Contrasts
option as above.
Where there
are two EVs that are groupings, the interaction always involves at least 4
subgroups. Even if both variables are just two groups, like male-female and
upper class-middle class, the interaction has four groups involved and, if the
interaction is sig, you might want to know which pairs of those are producing
that result, beyond just guessing from a suitable graph.
SPSS does
not deal with post hoc for interactions, but in some instances you can do it
yourself fairly simply with calculator. For instance you can do a Tukey test to test for pairwise differences
when you get a sig interaction in a two way ANOVA with two independent
gps factors, where all groups have the same number of subjects in.
Calculate T
= q x √(error mean square / number of people in each group)
Error mean
square or error variance is in the original ANOVA table in output.
q is found
from the table of the Tukey statistic (ask me for it or see a serious stats textbook
which has it in the back. I can’t include it here for copyright reasons). Read
off the column for the number of means being compared pairwise, and the row for
the df of the error variance/mean square (from ANOVA table).
Then
calculate T and any pair of means differing by more than T is sig different.
If the
groups are different sizes, or you wish to save effort, do t tests with Bonferroni adjustment.
As for 3.
OR Treat it as a oneway repeated measures situation. Enter all the repeated
measures columns as if there were just one factor not two, and follow 2 above.
That in effect does the post hoc for the interaction.
As usual,
if the result in ANOVA is significant, and more than two means are being
compared, one needs follow-up tests to see which pairs of means are
significantly different (or be happy just to judge it visually from a graph).
Each main effect involving 3 or more levels can be dealt with as above, but the
interactions are more of a problem.
Take five
repeated measures conditions and two groups.
One can get the main effect multiple comparisons done by SPSS with suitable
adjustments as described in (2) above (i.e. comparing results on the five
conditions with each other in pairs, for the whole sample of subjects lumping
both gps together). In fact if one wants all of them there are 10
comparisons.... because there are five conditions, so (5 x 4) / 2 paired
comparisons.
In the interaction, since there are 10 means involved for all 5 conditions and
two groups, there are (10 x 9) / 2 comparisons potentially, which makes 45.
One can do some of the interaction paired comparisons, by splitting the file
and getting SPSS to use the Bonferroni option again. Those are the comparisons
of each condition with each other condition within each group separately. 10
comparisons in each group = 20 in all.
That leaves 25 comparisons that you could not do with any post hoc procedure in
SPSS as far as I know... the comparisons between each of the 5 means for one gp
and the five for the other. Ordinary t tests do not have any required reduction
for multiple comparisons like post hoc tests do. However a simple adjustment by
hand is to use the t test but require stricter sig levels. In fact this is
really making the Bonferroni adjustment oneself.
The account
immediately above assumed that there was no a priori reason to be interested in
any of those 25 pairs more than any other... It was a DIY post hoc solution.
However, it
could be that, for theoretical reasons or whatever, you were not interested in
comparing every pair of means, only certain ones. In particular:
- the comparisons
of all 5 conditions within each group, done OK with split file and Bonferroni
adjustment..... 20 comparisons
- the
comparison of each group with the other on each condition separately. That is
in fact only 5 comparisons out of the 25 possible other ones. (I.e. you have no
interest in comparisons like that between the lower group on condition A and
the higher group on condition C, between with lower and higher on A etc.). You
want to claim, in this instance, that these were what are called 'planned comparisons'
not the usual post hoc 'try everything' type. Then you could reduce the
required sig value of the t test for this part by dividing by 5 not 25 in the Bonferroni adjustment....
In general,
then, where there is no post hoc test available in SPSS, the simple but crude
solution is to use ordinary pair comparison statistical tests, but divide the
target sig level by the number of potential comparisons you COULD make, or
PLANNED to make, to compensate for making multiple comparisons. However, this
is cruder than using post hoc tests, which take care of this better. You are
more likely to miss sig differences (a so-called Type II error).
You don’t get a sig result and you want to know how big a sample you
would need to get one
If you have gathered data, especially in a pilot study, and
not got a significant result, you may want to know how big a sample you would
need to make the result significant. Remember, if you choose a big enough
sample, even a very small difference or relationship may be significant. So if
you have the possibility available to increase the size of the sample (i.e.
there are more subjects or cases available), and are desperate to get a
significant result, it would be useful to know how many subjects would be
ideal.
Some books give formulae to calculate how big a sample you
need, but they don’t necessarily straightforwardly fit the situations you have.
The following is my best suggestion for an easy way to get an estimate of
required sample size using SPSS facilities.
Basically you create imaginary larger samples simply by
using your subjects more than once. Suppose you have 20 subjects and p=.231 for
whatever test you are interested in. You get SPSS to think that you have three
times as many subjects, simply by getting each subject counted three times, and
run the test again. Say then p=.09. Then you get SPSS to think you have four
times as many subjects, including each of your twenty four times, and see again
what happens. By trial and error you get to the point where p=.05, and that
gives an estimate of the minimum number of subjects you need to get a
significant result.
To get SPSS to count a subject more than once you weight
the data, similar to how you are familiar with doing elsewhere. At transform..compute you nominate a
new target variable which you might
call incr (since it will tell SPSS how many times to increase your sample
size). You then enter in the numerical expression space whatever you want the weighting to be. You
could start with a weighting like 2. Click OK and you will find a new column called incr with 2
repeated all the way down. If you now go to data…weight cases and weight the data by that
column, then SPSS sees your data as having twice as many cases… counting each
one twice.
Now do your analysis again and see if it is significant. Go
on altering the weighting figure in the incr column via transform…compute repeatedly and
redoing the analysis until you get a sig difference or relationship. Note that
you can enter partial weightings like 3.5 as well.
When by trial and error you achieve a weighting that gives
a significant result, multiply it by your original sample size to see how many
subjects you would need. E.g if your sample from two groups was 20 in all but
you only get a sig difference with a weighting of 3.8, then you need at least
20 x 3.8 subjects (= 76), in similar proportions in the two groups as before to
have a chance of getting a sig difference…
Cautions. You have to make sure the new bigger sample IS
from the same population as the old one. In the case of comparisons of groups
of course several populations may be involved. Even then, any method of
estimating the required sample size is only approximate, because even truly
random samples can vary a lot. Also, with an increase in sample size the actual
difference or relationship you are interested in may not actually get any
bigger. It is just more likely to be significant. I.e. you may end up showing
that there is indeed a non-zero difference or relationship in the population
(which is what ‘significant’ means), but not that it is a very large one.
Twenty people in two groups are each measured
for the number of times they use the third person –s out of all the
occasions or loci when they had an opportunity to (often called ‘potential
occurrences’)… Very many linguistic features are measured this way in
acquisition and sociolinguistic research. In the former it is often a matter of
how often the correct form (in NS terms) is used, as against some erroneous
form or omission, on occasions where there was an opportunity to use it; in the
latter it is often a matter of how often one variant out of two or more that
make up a sociolinguistic variable is used.
In all these situations there are two ways of
summarising and graphing the data – 1) the group way and 2) the individual way.
Either 1) you add up all the potential
occurrences for each group, and all the occurrences of the form of interest,
and express the second as a percent of the first for each group.
Or 2) you calculate a % score for each person
using their individual frequency of the form of interest and their individual
number of potential occurrences. Then for each group you can calculate the
average (mean) % score for that group from the individual scores of its
members. However, you have to be aware that this can be a bit misleading for
cases whose number of potential occurrences is small: getting one out of one
right is 100% as much as getting 20 right out of 20 possible occasions! It is
common to require at least 5 potential occurrences, otherwise treat a case as
‘missing’ data.
It is easy to show that the group figures may
not come out the same! Here we imagine figures for a group of two people and
see what happens:
Method 1
|
Frequency of form of interest |
Number of potential occurrences |
% occurrence of form of interest |
|
Person 1 |
4 |
16 |
25% |
|
Person 2 |
8 |
10 |
80% |
|
Total |
12 |
26 |
|
|
Group % |
|
|
(12/26)x100 = 46.2% |
Method 2
|
Frequency of form of interest |
Number of potential occurrences |
% occurrence of form of interest |
|
Person 1 |
4 |
16 |
25% |
|
Person 2 |
8 |
10 |
80% |
|
Mean % for group |
|
|
(25+80)/2 = 52.5% |
In fact the two methods will come out the same only when all subjects
had the same number of potential occurrences (e.g. in a test or list reading
task).
Many
BUT for
any inferential statistics you should use the method 2, entering the data in
SPSS in the form of one row per person, with a % score for each person. Then,
to compare two groups, for example, you use the independent groups t test on
the two sets of scores.
If you
were to attempt inferential statistics on the total figures of method 1, you
would have to use the numbers of individual occurrences regardless of people.
I.e. if the example above were for one group, you would represent that group
with the proportions 12 and 14 (i.e. 12 occurrences of the form of interest,
versus 14 non-occurrences, making up the total of 26 potential occurrences) and
compare those with the overall proportions for the other group being compared
with. The test for that is chi squared, and you do see this used even in some
published work for data like this. However, there are at least two major
problems with this which would lead statisticians mostly to regard this as a misuse
of chi squared.
-
Like for all significance tests, the basic
observations (cases) which enter into the test have to be independent of each
other. Now in method 2 the cases are the people, and there is no problem in
seeing scores from different people as being independent of each other.
However, in method 1 the 26 occurrences in the example are the cases, and
clearly while some of those are independent of each other (being from different
people) some are likely not (being from the same person)
-
There is also an expectation that populations
sampled are homogeneous. From what we have just said that is clearly not the
case in method 1: the 26 observations representing one group in the example are
a mixture. It cannot be said that each observation is from one population – it
is from a mixture of a population of people and the populations of occurrences
of each separate person.
The only
instances where chi squared and method 1 might be defensible would be where the
numbers of potential occurrences are very small… amounting to little more than
one or two per person included. OR where all the potential and actual
occurrences come from just one person per group…. though that still does not
deal with the independence problem. OR where you feel able to argue that
responses from the same person are as independent as if they were from
different people… There is a tradition of phoneticians making this tacit
assumption for things like VOT, on the belief that such things are beyond the person’s ability to control.
Just
checking.... do we know how to round figures on interval scales? The mean of a
set of scores may come out as 6.3597, but often we want to express this in
shorter form, such as 6.36 or 6.4. Quoting long strings of numbers after the
decimal point can look as if you are just trying to impress with loads of
numbers. Or it may be you are trying to make up for sloppy METHOD by being
super-detailed in the figures quoted in RESULTS.... Best not to do that, since
one's measurement is unlikely to be so accurate that more than two decimal
places are relevant (except perhaps where a computer has measured something for
you like response time...). Generally three or two decimal places for sig/p
values, and two or one for everything else. Keep it intelligible and round
numbers where necessary. But where do you round up, and where down?
Just round the following figures to
two decimal places:
3.852 0.679 18.505
1.006 7.597 20.955 0.602
SPSS often
rounds figures on screen (e.g. in the data grid) even though it is holding
longer versions in its memory. You can select for each column how many decimal
places it shows on the Data View window.
Answer to
above… 3.85 0.68 18.51
1.01 7.60 20.96 0.60
Decoding interval scores expressed
in E notation in SPSS output
Sometimes
SPSS produces numbers like 7.012E-02
This is not
7.012.... It is 0.07012
The E with
a minus sign signals the number of places the decimal point has to be moved to
the left.
So
1.369E-03 = 0.001369
Etc. The E
is a shorthand so as not to write a load of noughts.
Always convert
any such figures into the familiar form if you report them in your work.
Correspondingly
7.012E+02 would indicate 701.2.
Where a test
or other instrument produces scores for separate items which then need to be
added up to give a total score for a variable, one could of course add them up
off computer and just enter the totals. However, to check on internal
reliability or to do an analysis by items in addition, or
to filter response times and exclude some, you will need
the scores for every item in a separate column, so will have to enter the data
in full.
To then add
columns use Transform…Compute in SPSS to create a new column that
totals the separate ones. You enter the title of the new summary column top
left in the dialog box, and click the column names to be added into the top
right space, with + between them. That creates a new column of totals.
However, anyone with a score missing
in any column will be missed out and their total will come out as missing.
If there are missing values in some
columns, marked in SPSS by a . , where subjects failed to respond or have
unanalysable data, you will probably want each person’s total really to be the
average score over all the items they answered, not the total (unless you have
some reason to count ‘missing’ as the same as ‘wrong’ and so score it 0). You
can get this by, in Transform…Compute, inserting in the Numeric Expression
box the function MEAN(numexpr, numexpr,…) from the functions
list, and putting the relevant column labels in the brackets separated by
commas. I.e. if you have a set of three items whose scores are in columns
item1, item2, item3, then you would enter
MEAN(item1, item2, item3) in the Numeric Expression box. SPSS
then generates a new column with the average score of each case on the three
items or, if they answered less, over the ones they answered.
Similarly, if you want to just add,
not average, a set of columns, using whatever scores are available, then to
avoid the people with missing values getting recorded as with zero total
use SUM(numexpr, numexpr,…) in
the same way as described for means above.
Cutting an interval scale into ordered categories
A common
example is deriving a grouping of subjects from something you measured about
them originally on a numerical scale: an explanatory variable such as their
ages, English proficiency, extraversion… etc.,. This is often done
casually without due thought, and often in peculiar idiosyncratic ways by
novice researchers, but above all it needs careful thought about why it is
done, and how…
Before you
do this at all, you need to ask if it is necessary at all. Just because some
other researcher had a high prof group and a low prof one does not mean you
necessarily have to have groups. When you derive such groupings from scores
originally recorded on a continuous interval scale, obviously you lose some
information. One person may be a bit better than another on the original
scores, but once you decide they both belong in the high prof group, or
whatever, they are treated as identical in any further tests. This may or may
not help produce the result you want… Certainly how you divide subjects
into groups, if you do, can drastically affect the result!
There are a
number of reasons… some statistical, some related to research methods, design
and hypotheses more.
i.
nonlinear relations – e.g. where high and low
proficiency subjects perform similarly on some other variable of interest,
compared with intermediate subjects
ii.
interactions between different EVs – e.g. where
you want to see the combined effects of gender and prof on something: do high
prof females differ from high prof males in the same way as low prof females
differ from low prof males?
iii.
designs involving repeated measures.
OK… so you
still want to make groups… there are many ways of doing it. To some extent they
match the reasons above.. The principles apply to any interval-scored variable
that is to be turned into a grouping. The issue is where to cut the
original interval scale so as to obtain two or more groups of cases…

With all
the above methods, but especially the third, researchers may choose to use
extreme groups only. Often where a researcher wants to get clear differences
between groups later he/she will help this along a bit by, say, using the top
third and the bottom third of subjects and missing out the middle third in any
later comparisons. Reason 5,6 above.
However you
cut, you have to be careful how you speak. Very often you will call the groups
you make ‘the high proficiency group’ and ‘the low proficiency group’, or the
like. But unless your original test that produced the scores was a criterion
referenced one, deciding some absolute level of prof for each taker,
with international equivalence, then this can be misleading. Very often the
proficiency test researchers use test was a cloze test you cracked up yourself,
or the like. It may well distinguish students with higher proficiency from
those with lower, in the sample of students you are using. But that does not
mean there is any equivalence with what were called ‘high prof’ students by
some other researcher who used a different test with a different sample in
another country. It could be that all his students, high or low prof, are no
better than the worst of your low prof group, and so on. Only if some standard published test such as
FCE or TOEFL was used by all could you match up across studies and see if there
was any real comparability between so-called ‘high prof’ students in different
studies. In fact close examination shows that many variables used in research
have no absolute definitions of scale points, and most of the above ways of
dividing cases into groups only distinguish in a relative way between who/what
has more of something or less, not exactly how much.
One is
quite used to having SPSS calculate the SD along with the mean (=average) of a
set of scores (i.e. for any interval scale).
We are also
used to the idea that the SD measures spread of scores around the mean. If all
cases scored the same, the SD would be 0. The bigger the SD, the more spread
the scores of different cases… the more subjects are ‘disagreeing with’ each
other in their scores. And the more that happens within groups, the harder it
is usually to show any convincing differences between groups. Similar concepts
to SD are what statisticians call ‘variance’ and ‘error’. These measures are
slightly different but all, roughly, are averages of the differences
between each case’s score and the mean. If all cases score the same, which will
be also the mean score, then their differences from the mean are 0, so SD = 0.
Sometimes
SPSS fails to perform a procedure because of a problem of ‘zero variance’. That
means it found that one of your groups on one the variables measured had an SD
of 0. All cases scored the same. This makes certain statistical procedures
impossible: they involve variables and cannot work if everyone scores the same,
as then you have not a variable but a constant. You cannot answer the question
‘what is the relationship between age and reading ability?’ if you have
obtained data from a sample who are actually all of the same age!
So we know
what an SD of 0 means, but what about big SDs? There is often no simple maximum
value that the SD can have. But there are some guides to help assess the size
of an SD:
An old
problem is how to handle responses to items recorded on scales such as
strongly agree – agree –
neutral – disagree – strongly disagree
always – often –
sometimes – never
They are
rating scales (not usually called multiple choice). They are clearly ordered
choices and there is uncertainty whether they are really best thought of, and
treated statistically, as
·
Ordered categories: so you present the results
in bar charts, report the % of people who responded in each
category on the scale, and use ordered category statistics to analyse
relationships with other variables.
OR
·
Interval scores: so you assign a score number
to each point on the scale and present
the results as a histogram, report the mean and SD of the
scores of a group, and use t tests, Pearson correlation or
whatever when comparing groups or looking for relationships. The numbering
could be e.g. strongly disagree = 0, disagree = 1, and so on; or if you prefer
strongly disagree = -2, disagree = -1, neutral = 0 etc.
Generally
it is far easier for any statistical handling to treat the data the interval
score way as the stats for interval scores are more well known and versatile in
what they can do. The results are usually easier to absorb as well. Suppose two
groups are asked how far they agree that a CALL activity is easy to understand;
Group B is of a higher English level than A. ?Is it easier to derive some
meaning from being told:
·
In group A the response was: strongly agree
43.3% – agree 20% – neutral 13.3% – disagree 13.3% – strongly disagree 10%. In
group B it was: strongly agree 30% – agree 30% – neutral 10% – disagree 30% –
strongly disagree 0%. The difference between the two groups is not significant
(Kolmogorov-Smirnov Z =0.365, p=0.999).
OR from
·
The mean agreement response (on a scale –2 for
strong disagreement to +2 for strong agreement) was in group A 0.73 and in
group B 0.6. Variation was similar in the two groups, and moderately high (SDs
1.41, 1.26). The difference between the groups is not significant (t=0.265, p=0.793).
I know
which I find easier to follow!
So I advise
going for the second interpretation wherever possible, but making sure that
when you use such scales the way they are used in the data gathering itself justifies
this interpretation. In particular:
These tests
of prerequisites are only of interest to check if the data is suitable for using
some OTHER test that you are REALLY interested in, because it relates to your
actual research questions or hypotheses. Tests of prerequisites generally apply
where ANOVA/GLM is used, though researchers rarely report having made these
checks… and we cannot tell if the checks were performed or not! You generally
want them all to be nonsignificant, as that is what shows the data is
straightforwardly suitable for parametric significance tests like ANOVA/GLM.
If the prerequisite test is failed then there may be alternatives within the
parametric tests you can use to compensate, or weaker nonparametric tests you
can use instead of straightforward ANOVA etc., or possible transformations of
the data one could do... but often one has to just admit the data is not
perfect for the procedure but carry on and use ANOVA anyway....
Their functions are as follows:
Missing
values are where cases have scores or categorisations completely missing for
some reason, where most cases did provide data. E.g. they gave no response,
were uncooperative, or their response was unanalysable, etc. (Where subjects
have taken a multi-item test or the like to produce their scores, then they may
miss some items but still get a score for the test as a whole. That is a
different issue… You have to decide there whether a missed out item counts as
wrong, or whether you allow people to miss items and as overall test
score give them the average score for the set of items they did answer)
They are
usually entered in SPSS by a . in the space where a figure should be, unless
you have assigned an actual number that you enter as indicating missing values,
and declared it in Variable View…Missing.
If you have
missing values there may be problems:
-
You may have very few cases left that you can
use in the required statistical analyses: especially in repeated measures and
multivariate designs if a case has data missing on one variable/condition
included in an analysis, it gets left out totally (i.e. ‘listwise’).
-
The missing values may not be random, but
certain kinds of subject may be more prone to produce them … so using the data
without them, or with too few of them, will lead to a biased result. E.g. young
versus older testees; lower versus middle class informants.
If you
leave missing values in place, SPSS usually gives the choice (in Options
for a given test) for you to treat them listwise or pairwise/test-by-test.
This really applies to multiple analyses of the same data, as within one
analysis it usually has to be listwise, meaning that the number of cases used
is the maximum number that has a complete set of data across all the relevant
columns: e.g. if in Correlation you want correlations done between every
pair of variables in 5 columns: ten pairs, so ten analyses. Listwise
option would get you correlations using just the cases with full data across
all 5 columns, so the same number of cases would be used in each analysis. Pairwise
would, for each analysis, use the maximum cases with data on both the relevant
columns, so use more of the data, but different numbers of cases might well be
used to calculate different correlations.
If you want
to fill in missing values. the main principle is that it should not be done in
some way that will clearly directly influence the result you are interested in.
I.e. you should not fill in the missing values following a principle that will
obviously make the difference or relationship which is the focus of your actual
research more marked.
Broadly
there are two ways of filling in missings in any column in SPSS (where a column
represents a variable, or a condition in a repeated measures data).
A)
You fill in with the mean of the scores in the
column itself (or if it is in categories, the mode, which is the most popular
category in that column).
B)
You fill in by predicting a score from the
general correlation of that column with others in the data: the EM and regression
methods.
Imagine
data as follows:
C1 C2
3 5
5 7
7 9
4 .
6 8
If the
research question concerns whether there is a relationship between two
variables, in C1 and C2 (correlational design), then you do NOT use method B,
which would use the correlation that exists already in the data to fill in
missing values. I.e. here, given the perfect positive correlation between the
two sets of scores, method B would fill in the missing as 6, predicting it from
C1. But that will obviously enhance the perfection of the correlation which it
is your aim to discover! So the mean of the second column (method A) would be
better a better fill-in value: 7.25.
If on the
other hand this was data from the same subjects on the same DV scored in two
conditions in C1 and C2 (repeated measures design), and the research interest
is in the difference between the means of the scores in each column (Do they
score significantly higher on condition 2?), the better way to fill in the
missing values would by method B. Method A would simply enhance the level of
the mean of C2, and strengthen its distance from the mean of C1.
For these
reasons, when you run correlation-type statistics like Regression and Factor
analysis, SPSS under Options offers you the choice to fill in
missing values with the means (method A) as it operates. The data in the Data
view does not get visibly altered: just you find all the cases have been
used instead of those with missings left out. Similarly in Regression with
optimal scaling, which works on associations between categories rather than
interval scores, there is the choice to use Mode imputation, which fills
in the missings with the most popular category in the relevant column.
In
situations where method B is suitable, you have to use Analyze…Missing Value
Analysis to actually fill in the missing values in the data in Data view
beforehand. Basic instructions: at the first box, enter all the columns
relevant to the analysis you will be doing, either as quantitative (i.e.
interval) or categorical (categories/nominal). Only the former are
actually used in the estimation of missing scores, though (SPSS does not seem
to provide a way of filling in missing category data by Method B). Tick EM and if there are some
quantitative columns that you don’t want used as a basis for predicting values
of missings, then click the Variables button and make your selection.
Otherwise all the quantitative columns you declared in the first box are used
to predict any missings in each other. Click the EM button and tick Save
completed data; and under File name a file for it to be stored in.
Then Save…Continue…OK… The procedure will produce various output, but
mainly you are interested in the new stored file of data. If you call it up,
you will find the missings all filled in.
In data for
independent groups analysis (e.g. t tests, ANOVA), with missings in the DV
column, if you have other columns of dependent variable data not being used in
the same analysis, you could use them to fill in the missings by method B.
Otherwise you can only use method A – i.e. use the mean for the DV column (NOT
the mean of each group) to fill them in.
First ensure you have the fonts of your choice (e.g.
SILManuscriptIPA etc...) installed in
Windows in the usual way. If they are available to you in Word in the usual way
via Insert… Symbol, then they will be available in SPSS. If not, get a copy of the font file (ending .ttf) and put it in
the Fonts subdirectory of the Windows folder on your PC.
Now, having made a graph in SPSS, click the graph you
have created to make it appear in the Chart editing window. Then click
the part you want to put special symbols in, such as the bottom scale, so it
comes up outlined. Next click Format...Text and select the required
font from the menu and the size you want and click Apply, Close.
Now when you click the scale of the graph and choose
to change the Labels, you can type the symbols you want. However, you
don't initially see them when you type them in the dialog box. You have to know
that in the SIL font shift-t gets you the θ symbol for the th sound of thick,
though it will look as if you have just got T. Anyway, you have to type
all the labels in the new font, you cannot mix symbols from different fonts, I
think. So retype the labels using Change, and Continue. The
symbols you want will appear on the graph itself.
I have not found a way to get symbols that are coded
outside the range of the font that is covered by the keyboard keys, with and
without shift. To know what symbols you can get from which key with and without
shift, you may have to study the table of symbols for your font in advance
through a program such as Word which displays it through the Insert..Symbol
option.
This term is found
used in two distinct senses. Both involve data where variables or experimental
conditions are measured using sets of items for each in some way.
A)
The usual traditional
sense found especially in the pedagogical ‘testing’ literature. Here it applies
in the situation where a set of items is used to measure what is regarded as
one single variable/construct. The set of items is usually thought of as a
multi-item test of one thing (e.g. reading ability, or vocabulary size).
However, item analysis may also be applied to, say, a set of Gardner-type
statements for respondents to agree or not with, where a distinct attitude or orientation
is measured by an inventory of five such statements, rather than just one. It can also apply separately to each set of
items designed collectively to measure a single condition in an experiment.
Item analysis in all these instances is the activity of checking whether there
are some items in the set that in some way do not seem to belong there,
illuminating how and, if possible, why they are ‘odd’, and maybe removing them
or replacing them with better items when the test is used again. It is closely
tied to internal reliability checking, often done these days with the use of
the Cronbach alpha coefficient or Rasch analysis. Removing items that are odd
improves reliability. This sort of item analysis is often done in pilot
studies, as it represents a way of refining the quality of instruments for use
in a main study. There are several statistical criteria for deciding what items
are ‘odd’ in a set that is supposed to be all measuring one thing. See further
my Reliability handouts. Where items are supposed to attract similar levels of
response (e.g. be of similar difficulty) then the classical IA approach
involving alpha is appropriate; where items are supposed to be graded, and form
an implicational scale, then approach using IRT/Rasch is better. Where response
times are involved, other criteria may be used to exclude
responses for specific people on specific items instances rather than whole
items.
B) The sense in
which it is found used in some psycholinguistic literature. Here it denotes a
second kind of analysis of data, beyond the usual default one. In an item
analysis, instead of the subjects (usually people) being treated as the cases,
the items are treated as the cases. Hence it is really ‘analysis with items as
cases’, rather than ‘item analysis’, and is typically part of the analysis of
the results of a main study. This applies only when a study has several
conditions, each represented by a set of items, but this is very common in
psycholinguistic studies, where subjects’ performance in different conditions
is often measured by their responses to sets of stimuli in a repeated measures
design. For example a repeated measures variable ‘word frequency’ might be
constituted as three sets of ten words, of three different frequency levels,
making 30 items for people to respond to in some way; a variable ‘early vs late attachment’ could be
instantiated as two sets of sentences, of two structure types, one in which a
relative clause has to parsed with an early noun phrase, the other with a late
occurring one. Often such data arises also in areas such as SLA, applied
linguistic and even sociolinguistic research as well as psycholinguistics, but
‘item analysis’ in this sense is only routine in the latter, where it is
regarded as a further confirmation of results obtained by the usual ‘subject
analysis’, i.e. ‘analysis with subjects as cases’. Where, as often, ANOVA (see
my handouts) is used to analyse the results, then the F values for the
‘subjects as cases’ analysis are reported as F1, and those for the ‘items as
cases’ analysis as F2. Statisticians generally regard analysis with subjects as
cases as the sounder basis, due especially to the ‘independence’ requirement.
Cases have to be regarded as providing independent observations if the
assumptions of inferential statistical tests (e.g. ANOVA) are to be met. While
it is generally not difficult to assume that responses from different people
are independent of each other, it is not so certain if responses to different
items are so independent, when the same people respond to all of them. One has
to assume that in psycholinguistic experiments people are unable to make their
responses to one item reflect their response to another. This is often assumed
by phoneticians and psycholinguists.
Imaginary
dataset to illuminate both the above. Suppose we have two groups of ten people (G1 and G2), and each respond in
two conditions (C1 and C2), where 5 items are used to obtain responses for each
condition. As laid out for a customary ‘subjects as cases’ analysis in SPSS
this would appear as 11 columns and 20 rows thus. Of course, the items would
often not have been presented to subjects in an experiment in sets, but
intermixed with each other and maybe with additional distracter/filler items
that are not scored at all.
|
Group |
C1 item1 |
C1 item2 |
C1 item3 |
C1 item4 |
C1 item5 |
C2 item1 |
C2 item2 |
C2 item3 |
C2 item4 |
C2 item5 |
|
10 rows labelled 1, to mark each G1 subject |
Scores for each G1 person on C1 item 1 |
Scores for each G1 person on C1 item 2 |
Etc. |
|
|
|
|
|
|
|
|
10 rows labelled 2, to mark each G2 subject |
Scores for each G2 person on C1 item 1 |
Etc. |
|
|
|
|
|
|
|
|
To
do item analysis (A) above in SPSS, you would split the file by Group
and use Analyze… Scale… Reliability analysis… Alpha on each set of five
items separately (or for Rasch analysis, you need other software). Four
analyses. That means that the internal consistency is always assessed within a
collection of scores which is from a set of items that supposedly measures one
thing, and which comes from a homogeneous group of subjects. After any
adjustment of the data to improve reliability based on the above, you then
typically move on the the actual analysis of results with subjects as cases.
You first produce two extra columns which contain the averages of each five
item set of scores for each person. Use Transform… Compute. These Mean
C1 and Mean C2 columns each now summarise the performance of subjects in one condition.
Those two columns, together with the Group column, are then used in a mixed two
way ANOVA to see if there is a sig difference between groups or between
conditions, or a significant interaction effect. That is your ‘subjects as
cases’ F1 ANOVA.
For
item analysis (B), you need to make the items into the rows. You can do this
with Data… Transpose in SPSS. If you start from the data as displayed
above and include all the columns you end up with 11 rows, which were
previously the columns. There are columns now for each of the 20 subjects. You
can now use Transform… Compute to get two new columns calculated which
represent the mean scores for each group of subjects on each item. Then delete
the row that contains the grouping numbers. Add a column of 5 1s and 5 2s to
record which items (now rows) relate to condition C1 and which to C2. So the
data should end up much as below. Finally use the column that records whether
an item belongs to C1 or C2, and the two columns of group mean scores for each
item. Again do a mixed two way ANOVA to see if there is a sig difference
between groups or between conditions, or a significant interaction effect. That
is your ‘items as cases’ F2 ANOVA. Note that what was a repeated measures
factor in the F1 ‘subject analysis’, condition, becomes a between groups factor
in the F2 ‘item analysis’. The grouping of subjects, which was a between groups
factor in F1, becomes a repeated measures factor in F2.
|
G1 subj1 |
G1 subj2 |
G1 subj3 |
Etc. to G1 subj10 |
G2 subj1 |
G2 subj2 |
G2 subj3 |
Etc. to G2 subj10 |
Condition |
Group 1 |
Group 2 |
|
5 rows with scores for G1 subj1 on each C1 item |
Scores for G1 subj2 on each C1 item |
Etc. |
|
|
|
|
|
5 rows labelled 1, to mark each C1 item |
Mean scores of 10 G1 subjects on each C1 item |
Mean scores of 10 G2 subjects on each C1 item |
|
5 rows with scores for G1 subj1 on each C2 item |
Etc. |
|
|
|
|
|
|
5 rows labelled 2, to mark each C2 item |
Mean scores of 10 G1 subjects on each C2 item |
Mean scores of 10 G2 subjects on each C2 item |
Note,
the above account of items-as-cases analysis assumed that the sets of items
used to represent the two conditions were not themselves matched or repeated in
any way. I.e. C1 items 1-5 might have been five nouns as stimuli in some
response time experiment, and C2 items 1-5 five verbs, with no special
connection between individual verbs in one set and individual nouns in the
other. If however the items are themselves matched in pairs or repeated in
different forms etc. across conditions, the items as cases analysis should be
different. E.g. if C1 items were five verbs in the past tense and C2 five verbs
in the bare infinitive form, the researcher might choose to use the same
five verbs in both conditions (randomised with suitable distracters
interspersed when they are actually presented to subjects). Then the items are
individually matched and the items-as-cases analysis should be done with the
items as repeated measures. I.e. in the data grid above for SPSS, the 5 rows
for C2 responses would need to be not below the 5 rows for C1 but side by side,
with the matched items in the same row, to allow repeated measures comparison
of items as well as subjects.
Checking for guessing or response bias when using certain data-gathering
instruments with closed responses
Any instrument where the subjects are given
choices to pick from for an answer are potentially open to guessing…. In the
sense of ‘picking one option at random, without thought’.
For example, the respondent may randomly
pick one of the choices because
·
they can’t be bothered to think about the question/item… just want to
finish quickly
·
they don’t actually have any relevant knowledge to make a correct choice
·
they can’t understand the question (language too hard, too long,
pragmatically odd etc.)…
·
etc..
Clearly the results will not then be a true
measure of whatever the researcher
intended to measure, and could even vary if the subjects responded to
the same items again on another occasion. I.e. not valid or even reliable.
This affects multiple choice items, yes/no or
agree/disagree items in questionnaires and tests, rating scales and so forth.
Clearly it cannot affect instruments which have open response in some form,
i.e. with no alternatives supplied.
One cannot statistically tell definitely if
guessing has taken place or not, but one can check if the responses are like
those one would get from someone who was guessing, or not. Obviously it is
quite possible to get a real result, where people have paid attention and
answered sensibly, which happens to be similar to the guessing one. Only the
researcher can judge the interpretation.
You need to calculate what the result would
be, on average, for someone who was randomly guessing, and use the appropriate
one sample test (see my LG475-SP handout) to check if the observed result
differs significantly from the one you would get by random guessing.
For example:
1) 30 subjects have to answer yes or no to a
question about whether they use the keyword method of vocab learning or not.
Random guess frequency of yes would yield a frequency of 30/2 = 15 yes
responses. Use 50% binomial test.
2) 30 subjects have to pick one of four
reasons they are offered for why they are learning English. Random guess
frequency of each choice being picked would be 30/4 = 7.5. Use chi squared one
sample fit test.
3) 30 subjects have to judge 20 words for
whether they exist or not in English. Thus each person gets a score out of 20
for how many they say exist. The average random guess score would be 20/2 = 10.
Use the one sample t test.
4) 30 subjects listen to a short talk and are
offered 5 test items afterwards. Each item consists of four sentences, one of
which occurred in the talk, while the others are similar but did not. In each
item subjects have to pick the sentence that they had heard. Thus they can get
a score of max 5 correct. Average random guess score would be 5/4 = 1.25. Use
the one sample t test.
One protection against guessing is to include
a ‘don’t know’ option and encourage respondents to use it. However, often in
tests you do not want to allow this: you want to force a response.
If blind guessing has been encouraged, or
appears to have been used a lot, then some researchers adjust all cases’ scores
for guessing (relevant in multi-item test examples 3 and 4 above) as follows:
Adjusted
score = raw _ ____maximum possible score - raw
score____
score number of alternatives offered
on each item - 1
On the new scale someone who scores full marks
still gets the same full mark/max possible score but someone who scores the
guess rate score gets 0. So in example 4 someone with a raw score of 3 in fact
receives adjusted 2.33. In example 3 someone with a raw score of 10 scores 0.
In those same multiple choice instruments (and
indeed others) people may answer with ‘bias’. That is, although they are not
randomly picking options, they still do not always answer truthfully (whether
consciously or not). So again the measurement is not valid, though it may be
reliable, in that subjects may choose the same response to the same item on any
occasion.
Response bias may be affected by a number of
things, associated either with the subjects or the measurer or the instrument
itself, including
·
Researcher effect. The researcher may without realizing it convey the
idea that he expects attitudes to be favorable, answers to be ‘yes’ etc. and
the subjects may respond to this.
·
Subject confidence. For individual personality reasons, or maybe due to
cultural factors, subjects may be cautious and choose the midpoint on bipolar
rating scales (e.g. ‘neither agree nor disagree’) even when they have an
opinion. Or they may be overconfident and characteristically say ‘yes’.
·
Subject wish to be cooperative. For individual personality reasons, or
maybe due to cultural factors or young age, subjects may interpret being
cooperative as saying ‘yes’.
·
Instrument factors. If an instrument presents a lot of items with the
same response choices (e.g. all yes/no, or all an agree-disagree scale) and if
the ones responded to first elicit a similar response choice then this can form
the basis for a ‘set’ and other items may be automatically answered by
selecting the same option.
·
Cost or benefit perceived by subject. In a vocab test where a list of
words have to be indicated as known or not known, the testees may see it as a
benefit to themselves to get as high a ‘known’ score as possible, so will tend
to overdo that choice and the tester will want to check for this. When deciding
if learners pass or fail an English test for air pilots, the examiners may feel
that there is a big risk in passing someone who is not really up to the
standard, so a benefit in erring on the side of failing too many candidates.
A way to check for bias is to include
additional items where you know in advance what the answer should be for
subjects like these. Then if you don’t get the expected answer on those, you can
see that subjects may be exhibiting bias in general. A form of control, or
construct validation.
In a special instance, this evidence may be
used to adjust scores for bias. Take the case of the vocab test mentioned above
where subjects have to say if words they see exist or are known. Several of the
factors mentioned above might favour ‘yes bias’. One way to counter this is to
focus the testee’s attention on ‘no’ rather than ‘yes’ by making the task to
mark which words they do not know / do not exist, rather than to
mark those that do. However, we can also check and adjust for yes bias
as follows.
Though the test is of claimed knowledge of
real words, it is possible to intermingle randomly in the test items some
non-existent words to be judged. We know that the subjects cannot know them,
because they could never have met them. Hence their response should be ‘no’ for
all these. If we get ‘yes’ responses for some of the unreal words, we have
evidence of ‘yes bias’ and can quantify
it. One could do a similar trick in grammaticality judgment tests, by including
sentences with structures impossible in the languages under consideration,
along with those of interest to us.
|
Stimuli: |
Real words / True items (focus of test) |
Non-existing words / False items (used as
controls / yes bias checks) |
|
Response: yes, known / exists |
True positive Hit |
False positive False alarm |
|
Response: no, not known / does not exist |
False negative Miss |
True negative Correct rejection |
Some researchers simply exclude any cases who give two or more false positive responses. If you need to adjust the scores for this rather than just exclude people, this is more complex (See me).
In response time experiments it is common to filter the data by eliminating (a) extreme response times, and/or (b) response times where the response was in fact wrong.
a) Suppose subjects respond to 50 stimuli, representing three conditions (i.e. ten stimuli of each of three types of interest, with 20 distractors). Maybe they have to judge the existence or not of the word they see as fast as possible. Within each set of ten, for each person, it is common practice to eliminate responses where the value is way above or below the mean response time for that person in that condition. The argument is that if the time is excessively long, the subjects were not giving the spontaneous intuitive responses the psycholinguist wants, but referring to other types of knowledge such as explicitly learnt rules (i.e. ‘thinking too hard); if the times are very short, maybe they were not thinking at all but just pressing a key at random to get on with the task as fast as possible.
Commonly a distance of two standard deviations above and below the mean is taken. Anything outside that for a person on an item within any condition is regarded as inadmissible and treated as missing. The mean score for a condition for a person is then calculated using the remaining responses.
To get SPSS to do this, we first assume that the data is entered as usual with a column for the response times to each stimulus and a row for each person. Imagine columns labeled st1, st2 etc. with the first ten columns representing response times for one condition / stimulus type.
Suppose you are working on filtering st1, turning any extreme values into missings.
The result can be achieved by getting SPSS to create new columns in turn for each stimulus, via the Transform… Compute facility. At the dialog box enter the name of the new column top right as target variable – e.g. st1f. Next enter the original column as the numeric expression, st1. Next click on If and opt for Include if case satisfies condition. Then write the condition so that the scores you want to keep pass the condition. E.g.
st1 <
(MEAN(st1,st2,st3,st4,st5,st6,st7,st8,st9,st10) + (2*(SD(st1,st2,st3,st4,st5,st6,st7,st8,st9,st10))))
AND
st1 > (MEAN(st1,st2,st3,st4,st5,st6,st7,st8,st9,st10) -(2*(SD(st1,st2,st3,st4,st5,st6,st7,st8,st9,st10))))
The new column st1f will have
missings where the data was extreme. Alter the statement to do st2, and
so on in turn the same way.
b) The same
sort of thing can be done to get response time data turned into missing values
where the responses were wrong. If there is a separate set of columns sta1,
sta2… recording accuracy of response as 1 or 0 for each stimulus, then
write the If condition for st1 simply as sta1 = 1 .
After doing either of the above you will need to
combine the columns for the relevant sets of items to create summary scores for
cases for each condition (e.g. st1f through st10f). That has to
be done usually by getting the mean for each person over the
non-missing items that they have scores for.