Bluebell flower production

These data are from an undergraduate project in environmental science. The data are counts of numbers of flowers from individual bluebell (Hyacinthoides non-scripta) plants at two woods near Ashford in Southern England (2007). There are 527 observations in the Kings Wood sample and 365 in the Stubbs Cross Wood sample. The following table shows a small sample of the data.

Table 1. Number of flowers from bluebell plants at two woods.

Kings Stubbs Cross
5 4
5 6
10 5
5 6
7 4
9 4
9 5
11 5
4 6
7 7

Download

You can download the dataset as a TXT file using this link: <bluebell-flowers.txt>. The file is Tab separated and will open in a text editor or spreadsheet. The datafile has two columns: Site and Flowers.

Usage

You can use these data to practice/illustrate various topics:

  • Sampling (e.g. getting smaller samples from the main dataset).
  • Data distribution (e.g. are the data normally distributed?).
  • Summary statistics.
  • Differences hypothesis test (e.g. Wilcoxon rank sum test).
  • Graphics (e.g. box-whisker plot).

Keywords:

Plant, Bluebell, flower, data distribution, non-parametric, differences, U-test, graphics, boxplot, sampling. Poisson distribution. Permutation test. Shapiro-Wilk, Kolmogorov-Smirnov.

Examples

The following examples will give you a few ideas about how you might explore or use these data.

Data distribution

There are two samples. Since these are count data they may well not be normally distributed. It is likely that these follow a Poisson distribution. You can try histograms to visualise the distribution.

Distribution for number of flowers per bluebell plant.

You might also use a hypothesis test, such as Shapiro-Wilk, to look at the normality (or otherwise) of the samples. A Kolmogorov-Smirnov test could also be used to look for goodness of fit to a Poisson “shape”.

Sampling and subsets

Each sample has a large number of observations (527 and 365) so you could use the dataset as a database and use random sampling to pick smaller sets. Look at a running mean (or median) for increasing numbers of observations: how many observations does it take for the average to “settle down”?

Data summary

The data are in two separate samples and you can summarise these using various statistics. The data are probably not normally distributed. It might be interesting to look at smaller subsets of the data. You might also want to look at standard error by taking successive means from different samples.

Differences test

Since the data are counts (of flowers) we would not necessarily expect them to be normally distributed. You could check this by making a histogram or carrying out a test of normality (such as a Shapiro-Wilk test). With such large sample sizes, we might argue that central limit theory allows us to try a t-test. A non-parametric alternative might be the U-test.

You could also try a generalized linear model with a Poisson distribution or do the analysis using permutation.

Graphics

There are various graphical methods that could be used to visualise these data, the box-whisker plot is the most informative. The boxplot shows median values as a stripe, inter-quartile ranges as a box, and the extreme range as the whiskers. In the plot of the “complete” dataset we can see a number of outlying points, representing a few individual plants with very high flower counts.

Number of flowers for bluebell plants at two woods in Southern UK.

References

Constantinou, J. (2007) Undergraduate research project. S206: Environmental Science, Open University.

Links

Data examples:

Custom R functions:

General data science articles:

  • DataAnalytics Knowledge Base. For general topics and articles about data science, including Learning R: the statistical programming language
  • DataAnalytics Tips and Tricks. for articles covering a range of topics in data science, including Using R, Using Excel, quantitative data analysis, predictive data analysis and a lot more besides.

See our Publications Page for an overview of our book on Ecology, Environmental Science and R: the statistical programming language.