Distribution shapes in histograms

 

A histogram displays numerical scores for a group or condition. Along the bottom is the score scale (interval scale), and up the side is plotted the frequency or number of cases that scored each score. The linked bars that make up the histogram rise high or low reflecting how many people scored each score or range of scores on the bottom scale. Often the score scale on the bottom is grouped – e.g. instead of showing a bar for people who scored 5 and another for people who scored 6  etc… the histogram shows one bar for those who scored 5.001-6, another for 6.001-8 etc. See examples below. You can tell SPSS the interval size you want, and what end points of the scale you want displayed. These choices can often make the graph look quite different!

 

Looking at a histogram one can see visually

 

  • what score or range of scores was most often obtained by the cases, = where the heap is high, = the mode. This may or may not be close to the average (= mean) score depending on how symmetrical the shape is
  • how spread out the scores are along the score scale, reflected also in the SD
  • details of the shape of the distribution, which we consider below

 

REAL DISTRIBUTION SHAPES

 

More or less symmetrical distribution

 

A single heap with more or less the same sized ‘tails’ on each side. This arises typically where the cases measured are a more or less homogeneous group exhibiting some degree of variation, as is to be expected, and most cases are scoring some way from the fixed ends of the score scale, if it has any.

 

Here is an example, which is about as symmetrical as you often get in real data. 77 middle class learners of English in Colleges in Pakistan were asked to show their degree of agreement, on a scale 1-5, to each of 10 statements which collectively measure Interest in Foreign Languages (e.g. ‘I would really like to learn a lot of foreign languages’). The totals for this variable therefore potentially range between 10 and 50, but in fact most people scored well in between, average 33. The middle of the scale is of course 30.

 

 

The spread (reflected by the size of the SD) can be greater or smaller. Here it is quite high, approaching half the maximum it could be: that max here is half the scale length, 40/2=20 (see article on SD).

 

 

Moderately skewed distribution

 

A distribution heaped on the left with a longer tail on the right is said to be positively skewed. One heaped on the right with a longer tail on the left is said to be negatively skewed. Often the heap is on the side nearest a fixed end of the scale.

 

Example. 72 lower class learners of English in Colleges in Pakistan were asked to show their degree of agreement, on a scale 1-5, to each of 10 statements which collectively measure the Parental Encouragement they receive (e.g. ‘My parents/guardians want me to learn English’). The totals for this variable therefore potentially range between 10 and 50: in fact most people scored fairly near the bottom of the scale, mean 21, thus creating a positive skew in the results.

 

 

 

J shaped distribution

 

The J is an extreme case of negative skew. The reverse J or ‘ski-jump’ shape is an extreme case of positive skew. In both of these typically the cases are scoring tight up against a fixed end limit of a scale. The J is like a heap which has one half missing, because the scale comes to an end so there is nowhere for the expected other half of the distribution to fall.

 

Here is an example of  a J shaped distribution. 40 teachers of English composition in Saudi Arabia were asked to say how often they wrote words of praise on student compositions, on a scale 4=always to 0=never. The results graph as follows and we can see that most claimed to be full of praise, thus creating this shape. If the results had been for student test scores, for example, one would have said that the test was rather easy for the subjects and therefore ‘ceiling effect’ is manifested.

 

 

Below is a reverse J distribution…. An extreme example of positive skew (compare the milder version in the last section). Arabic learners of English were required to write a first draft and a final draft of a composition. By comparing drafts the numbers of revisions they made were counted. One type, shown here, is the number of revisions of units of phrase size (as against single words or whole sentences) made by each person. Since many writers made no revisions of this sort at all, we get ‘floor effect’ with a heap bunched against the bottom end of the scale which is 0 (the top end of this scale has no fixed limit, of course).

 

 

U shape, or any of the above with gaps/marked low points indicating more than one heap

 

Disatributions like this are indications that the cases you have graphed may not be a homogeneous group but really two or more groups from more than one population. Hence you have as it were more than one symmetric or skewed heap displayed, one for each group that scores rather differently. You can find an example of this sort of distribution shape where I discuss ways of dividing subjects into distinct groups when initially they all have obtained scores on a continuous scale. One way of doing that is by a simple form of ‘cluster analysis’ - visually using low points or spaces in the distribution to decide where to cut the scale and say ‘all above this score are group A and all below group B’.

 

Here is another example. 217 learners of English in Colleges in Pakistan were asked to show their degree of agreement, on a scale 1-5, to each of 5 statements which collectively measure their perception of how far their Pakistani identity is threatened by English (e.g. ‘When I use English, I don’t feel that I am Pakistani any more’). The totals for this variable therefore potentially range between 5 and 25. As we see, the subjects have polarised into two types of respondent. The majority see quite a high threat, forming a negatively skewed heap on the right, while a smaller group, forming a more symmetrical heap on the left, sees a relatively low threat. If we were to cut the scale at 16 and treat the groups as separate, one has 68 members, mean 9.5, the other 149, mean 21.3. In short, we have probably got two quite different kinds of people portrayed here: on closer investigation it emerged that in fact the low group is nearly all upper class, the high group almost exclusively middle and working class.

 

 

WHY LOOK AT DISTRIBUTION SHAPES?…

 

OK, so we can look at the shapes we get, give them names, and say a bit about what might they might indicate about our results… but why else are they important?

 

The answer is that it may be relevant to the further statistical procedures you want to use. Many popular significance tests such as t tests and ANOVAs are of the ‘parametric’ type, meaning they require the data to have certain properties, one of which is ‘normality of distribution of the population’. Usually we don’t know the distribution shapes of the populations which we claim to have sampled for each group or condition involved, so have to assess them from the samples… I.e. if we want to do things properly (and not just skip over thinking about the distributional prerequisites of sig tests altogether) we have to

EITHER do tests to see if the shape of the sample could readily be from a population with the normal distribution (e.g. the Kolmogorov-Smirnov or KS one sample test under nonparametric in SPSS)

OR assess the population shape from the sample distributions shape just by eye…. This is what we pursue below…

 

By and large, if the shape we see is skewed or discontinuous it is less likely to be from a normal distribution… but

  • with small samples the sample could have quite a non-normal shape and still possibly be from a population which has the normal distribution
  • some non-normal shapes can be converted to normal shape.

 

In general we need to be familiar with some ideal shapes, esp. this so-called normal distribution shape….

 

 

Three families of ideal distribution shapes

 

Ideal distribution shapes are always smooth and perfect, unlike the ones you get from actual data. That is because they are defined by mathematical formulae not the actual scores, responses etc. of real people, which tend to be irregular.

 

The formulae are of the form y = some function of x. Here x is the score scale on the bottom of the histogram, and y is the frequency scale up the side. A simple example of a histogram defined by a formula would be y=3x, which defines the histogram below. I.e. it says the number of people scoring any score is three times that score. It defines a skewed triangular distribution shape. However, this is not a very useful ideal distribution shape, since real data does not usually

pattern anything like that.

 

Obviously there is an endless number of ideal distribution shapes possible, as the mathematical formulae one could use are limitless. We are interested just in distribution shapes that seem to be close to real data. If you like, the shapes we think the data would have, were it not for the sort of random variation that real people are subject to. Three stand out as specially useful, though their formulae are much more complicated than that of the example above.

 

FROM HERE ON, THIS IS INCOMPLETE!

 

The Normal Distribution

 

This shape, also known as the Gaussian curve, has a fearsome formula which crucially includes the mean and SD of the distribution on the right hand side. I.e. y is a function of the mean and SD of the distribution, as well as x. I.e. you get different normal curves depending on what the mean is, and the spread of scores. They are always symmetrical, but some can be quite flat, some very tall and thin.

 

One simple way to judge the normality of an actual sample is to get SPSS to superimpose a normal curve on the histogram of the actual data one has gathered. SPSS provides the version of the normal curve that fits the mean and SD of your data. Here we see it done for the histogram we looked at above. Intuitively the fit is not too bad… The highest point of the actual data is in the right place, and although there is an outsize bar on the left hand side (interval 10-25), it is compensated by a low one next door (25-30).

 

 

The Log-Normal Distribution

 

 

 

The Poisson Distribution