Some of the more interesting investigations can be considered by examining numerical data across groups. The methods required here aren’t really new. All that is required is to make a numerical plot for each group. Here two convenient methods are introduced: side-by-side box plots and hollow histograms.
We will take a look again at the county data set and compare the median household income for counties that gained population from 2000 to 2010 versus counties that had no gain. While we might like to make a causal connection here, remember that these are observational data and so such an interpretation would be unjustified.
There were 2,041 counties where the population increased from 2000 to 2010, and there were 1,099 counties with no gain (all but one were a loss). A random sample of 100 counties from the first group and 50 from the second group are shown in Table 1.14 to give a better sense of some of the raw data.
| population gain | no gain | |||||||
|---|---|---|---|---|---|---|---|---|
| 41.2 | 33.1 | 30.4 | 37.3 | 79.1 | 34.5 | 40.3 | 33.5 | 34.8 | 
| 22.9 | 39.9 | 31.4 | 45.1 | 50.6 | 59.4 | 29.5 | 31.8 | 41.3 | 
| 47.9 | 36.4 | 42.2 | 43.2 | 31.8 | 36.9 | 28 | 39.1 | 42.8 | 
| 50.1 | 27.3 | 37.5 | 53.5 | 26.1 | 57.2 | 38.1 | 39.5 | 22.3 | 
| 57.4 | 42.6 | 40.6 | 48.8 | 28.1 | 29.4 | 43.3 | 37.5 | 47.1 | 
| 43.8 | 26 | 33.8 | 35.7 | 38.5 | 42.3 | 43.7 | 36.7 | 36 | 
| 41.3 | 40.5 | 68.3 | 31 | 46.7 | 30.5 | 35.8 | 38.7 | 39.8 | 
| 68.3 | 48.3 | 38.7 | 62 | 37.6 | 32.2 | 46 | 42.3 | 48.2 | 
| 42.6 | 53.6 | 50.7 | 35.1 | 30.6 | 56.8 | 38.6 | 31.9 | 31.1 | 
| 66.4 | 41.4 | 34.3 | 38.9 | 37.3 | 41.7 | 37.6 | 29.3 | 30.1 | 
| 51.9 | 83.3 | 46.3 | 48.4 | 40.8 | 42.6 | 57.5 | 32.6 | 31.1 | 
| 44.5 | 34 | 48.7 | 45.2 | 34.7 | 32.2 | 46.2 | 26.5 | 40.1 | 
| 39.4 | 38.6 | 40 | 57.3 | 45.2 | 33.1 | 38.4 | 46.7 | 25.9 | 
| 43.8 | 71.7 | 45.1 | 32.2 | 63.3 | 54.7 | 36.4 | 41.5 | 45.7 | 
| 71.3 | 36.3 | 36.4 | 41 | 37 | 66.7 | 39.7 | 37 | 37.7 | 
| 50.2 | 45.8 | 45.7 | 60.2 | 53.1 | 21.4 | 29.3 | 50.1 | |
| 35.8 | 40.4 | 51.5 | 66.4 | 36.1 | 43.6 | 39.8 | ||
The side-by-side box plot is a traditional tool for comparing across groups. An example is shown in the left panel of Figure LABEL:countyIncomeSplitByPopGain, where there are two box plots, one for each group, placed into one plotting window and drawn on the same scale.
R> popgain=(county[,4] - county[,3])0
R> boxplot(county[,10]popgain,names=c(’gain’,’no gain’))
Another useful plotting method uses hollow histograms to compare numerical data across groups. These are just the outlines of histograms of each group put on the same plot, as shown in the right panel of Figure LABEL:countyIncomeSplitByPopGain.
R> histPlot(county[which(!popgain),10],breaks=50,hollow=T)
R> histPlot(county[which(popgain),10],breaks=50,hollow=T,add=T)
Use the plots in Figure LABEL:countyIncomeSplitByPopGain to compare the incomes for counties across the two groups. What do you notice about the approximate centre of each group? What do you notice about the variability between groups? Is the shape relatively consistent between groups? How many prominent modes are there for each group?
Answer. Answers may vary a little. The counties with population gains tend to have higher income (median of about $45,000) versus counties without a gain (median of about $40,000). The variability is also slightly larger for the population gain group. This is evident in the IQR, which is about 50% bigger in the gain group. Both distributions show slight to moderate right skew and are unimodal. There is a secondary small bump at about $60,000 for the no gain group, visible in the hollow histogram plot, that seems out of place. (Looking into the data set, we would find that 8 of these 15 counties are in Alaska and Texas.) The box plots indicate there are many observations far above the median in each group, though we should anticipate that many observations will fall beyond the whiskers when using such a large data set.
What components of each plot in Figure LABEL:countyIncomeSplitByPopGain do you find most useful?
Answer. Answers will vary. The side-by-side box plots are especially useful for comparing centres and spreads, while the hollow histograms are more useful for seeing distribution shape, skew, and groups of anomalies.