Dr. Mark Gardener

GO...
Gardeners Own Home
Navigation Index
Using R Introduction
About Us

On this page...

Making Data
Combine command
Types of Data
Entering data with scan()
Multiple variables
More types of data
Variables within data
Transposing data
Making text columns
Missing values
Stacking data
Selecting columns
Naming columns
Unstacking data

Using R for statistical analyses - More on data

This page is intended to be a help in getting to grips with the powerful statistical program called R. It is not intended as a course in statistics (see here for details about those). If you have an analysis to perform I hope that you will be able to find the commands you need here and copy/paste them into R to get going.

I run training courses in data management, visualisation and analysis using Excel and R: The Statistical Programming Environment. From 2013 courses will be held at The Field Studies Council Field Centre at Slapton Ley in Devon. Alternatively I can come to you and provide the training at your workplace. See details on my Courses Page.

On this page learn how to create and manipulate data without using a spreadsheet. Learn more about reading data files.

See also: R Courses | R Tips, Tricks & Hints | MonogRaphs | Writer's bloc


My publications about R

See my books about R on my Publications page

Statistics for Ecologists | Beginning R | The Essential R Reference | Community Ecology | Managing Data

Statistics for Ecologists, cover Beginning R, coverEssentaial R Reference, coverCommunity Ecology, cover Managing Data Using Excel, cover

Statistics for Ecologists is available now from Pelagic Publishing. Get a 20% discount using the S4E20 code!
Beginning R is available from Wrox the publisher or see the entry on Amazon.co.uk.
The Essential R Reference is available from the publisher Wiley now (see the entry on Amazon.co.uk)!
Community Ecology is available now from Pelagic Publishing.

Managing Data Using Excel is available now from Pelagic Publishing. Get £5 discount using the MDUE20 code!

I have more projects in hand - visit my Publications page from time to time. You might also like my random essays on selected R topics in MonogRaphs. See also my Writer's Bloc page, details about my latest writing project including R scripts developed for the book.


Skip directly to the 1st topic

R is Open Source

R is Free

Get R at the R Project Page

What is R?

R is an open-source (GPL) statistical environment modeled after S and S-Plus. The S language was developed in the late 1980s at AT&T labs. The R project was started by Robert Gentleman and Ross Ihaka (hence the name, R) of the Statistics Department of the University of Auckland in 1995. It has quickly gained a widespread audience. It is currently maintained by the R core-development team, a hard-working, international team of volunteer developers. The R project web page is the main site for information on R. At this site are directions for obtaining the software, accompanying packages and other sources of documentation.

R is a powerful statistical program but it is first and foremost a programming language. Many routines have been written for R by people all over the world and made freely available from the R project website as "packages". However, the basic installation (for Linux, Windows or Mac) contains a powerful set of tools for most purposes.

Because R is a programming language it can seem a bit daunting; you have to type in commands to get it to work. However, it does have a Graphical User Interface (GUI) to make things easier. You can also copy and paste text from other applications into it (e.g. word processors). So, if you have a library of these commands it is easy to pop in the ones you need for the task at hand. That is the purpose of this web page; to provide a library of basic commands that the user can copy and paste into R to perform a variety of statistical analyses.


Top

Navigation index

Introduction

Getting started with R:

Top
What is R?
Introduction
Data files
Inputting data
Seeing your data in R
What data are loaded?
Removing data sets
Help and Documentation


Data2

More about manipulating data and entering data without using a spreadsheet:

Making Data
Combine command
Types of Data
Entering data with scan()
Multiple variables
More types of data
Variables within data
Transposing data
Making text columns
Missing values
Stacking data
Selecting columns
Naming columns
Unstacking data


Help and Documentation

A short section on how to find more help with R

 

Basic Statistics

Some statistical tests:

Basic stats
Mean
Variance
Quantile
Length

T-test
Variance unequal
Variance Equal
Paired t-test
T-test Step by Step

U-test
Two sample test
Paired test
U-test Step by Step

Paired tests
T-test: see T-test
Wilcoxon: see U-test

Chi Squared
Yates Correction for 2x2 matrix
Chi-Squared Step by Step

Goodness of Fit test
Goodness of Fit Step by Step


Non-Parametric stats

Stats on multiple samples when you have non-parametric data.

Kruskal Wallis test
Kruskal-Wallis Stacked
Kruskal Post-Hoc test
Studentized Range Q
Selecting sub-sets
Friedman test
Friedman post-hoc
Rank data ANOVA

 

Correlation

Getting started with correlation and a basic graph:

Correlation
Correlation and Significance tests
Graphing the Correlation
Correlation step by step


Regression

Multiple regression analysis:

Multiple Regression
Linear regression models
Regression coefficients
Beta coefficients
R squared
Graphing the regression
Regression step by step


ANOVA

Analysis of variance:

ANOVA analysis of variance
One-Way ANOVA
Simple Post-hoc test
ANOVA Models
ANOVA Step by Step

 

Graphs

Getting started with graphs, some basic types:

Introduction
Bar charts
Multi-category
Stacked bars
Frequency plots
Horizontal bars

Histograms

Box-whisker plots
Single sample
Multi-sample
Horizontal plot


Graphs2

More graphical methods:

Scatter plot

Stem-Leaf plots

Pie charts


Graphs3

More advanced graphical methods:

Line Plots
Plot types
Time series
Custom axes

Bottom


Top

Navigation Index

 

read.csv() is the most useful command for entering large and complex data sets into R.

Creating data

With larger data sets the most useful method of creating and storing your information remains the use of a spreadsheet. R can read spreadsheet files in .XLS format but it is probably better to use .CSV. This format is readily opened by text editors and can be easily modified. Your original data set can be kept in native spreadsheet format and you can use 'save as' to create a .CSV file for the analysis you want to run. To remind yourself about creating and reading CSV files see the introduction page.

The most useful function to read data into R is the read.csv() command. Here is a recap:

variable = read.csv(file.choose(), header=TRUE, row.names=#)

file.choose() opens an explorer=type window allowing you to select your file.
header=TRUE reads the 1st row as a list of column names (you can set this to FALSE).
row.names=# this command tells R which column contains row names (if any).

This is not the only way to get data into R as we shall find out now.


Top

Navigation Index

 

The c() command is used extensively in R, especially as a parameter within other finctions. It is also a quick way to enter small amounts of data.

Combine values command

If you wish to enter a small vector of data it may not be worthwhile creating a spreadsheet and saving it as a CSV file and then reading it into R. It would be much easier to type the data in directly. There are several ways to do this. The first one is using the c() command (c is short for combine). An example will demonstrate it's use:

data1 = c(2, 4, 5, 2, 3, 7, 8, 4)

Here we have created a variable called data1 and assigned the values in the brackets to it. We may now use the variable we created like any other.

We can use the c() command to append data to an existing vector e.g.

data1 = c(data1, 12, 14, 11, 9)

Now we have added 4 values to our existing variable. This command is used as part of other functions in R. For example in graphing it is possible to set the limits of the x and y axes, this command is called from within the plot() function like so:

plot(data, xlim= c(lower, upper), ylim= c(lower, upper), ...other commands)

See the section on scatter plots for more information on this command.


Top

Navigation Index

 

Numeric values can be entered 'as is' but text values must be in "quotes" when using the c() command.

Types of data

The values we entered using our c() command were obviously numeric. We can enter text values merely my enclosing them in (double) quotes so:

dates = c("Jan", "Feb", "Mar", "Apr", "May")

We now have a variable called dates which contains five text values.

What if we were to type in the months without quotes? Let's try and see:

month = (Jan, Feb, Mar, Apr, May)
Error: syntax error

Oh dear. So, it appears that we either have to have numbers or text values in quotes. It is possible to get one other data type but we will cover that when we get to it later on.


Top

Navigation Index

 

scan() is a useful command for adding larger amounts of data. The basic command accepts numeric values only. To read in text values we must use scan(what="char")

Typing in values using scan()

Typing in values using the c() command is fine but when you have substantial sample size you don't necessarily want to type all the commas! R provides another way of entering data using the scan() command. In basic form the scan command works like this:

more.data = scan()

1:

The 1: indicates that R is waiting for you to type in the first element of your data. What we need to do now is to type in some values; this time we separate them with spaces and don't bother with the commas. You can press the enter key to spread over several lines. Data entry will stop when you enter a blank line e.g.

1: 2 5 6.2 33 25 1.3 8
8: 111
9:
Read 8 items
>

To see what we entered type the name of the variable e.g.

more.data
[1] 2.0 5.0 6.2 33.0 25.0 1.3 8.0 111.0
>

We can see that R has appended decimals to our data so that the precision matches for all items in the vector.

If we try the same thing but with text labels what happens?

more.months = scan()
1: jan feb mar apr
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got 'jan'
>

It looks like we might need to enter the values in quotes again. It is a real pain to enter lots of quotes so let's find a way around that. Try this:

more.months = scan(what="char")
1: jan feb mar apr may jun
7: jul
8:
Read 7 items
> more.months
[1] "jan" "feb" "mar" "apr" "may" "jun" "jul"
>

That's better; now we don't have to type quotes around each item we merely type what="char" to tell the function to expect text values. In fact we cannot read text values into the scan() command in any other way. In addition we cannot mix text and numeric values.


Top

Navigation Index

Multiple variables

When you only have 1-2 variables to input and these are of moderate length, it may be worthwhile entering them using scan() or c() commands. However, when you have more data it is usually better to enter the data into a spreadsheet first and then save as a CSV file for input to R. This subject was introduced earlier (see data files) but here we'll add a bit more detail.


Types of data (again)

So far we have looked at two types of data item, numeric and text. Let's get a data file to illustrate:

twoway = read.csv(file.choose())
twoway

 
height
plant
water
1
9
vulgaris
lo
2
11
vulgaris
lo
3
6
vulgaris
lo
4
14
vulgaris
mid
5
17
vulgaris
mid
6
19
vulgaris
mid
7
28
vulgaris
hi
8
31
vulgaris
hi
9
32
vulgaris
hi
10
7
sativa
lo
11
6
sativa
lo
12
5
sativa
lo
13
14
sativa
mid
14
17
sativa
mid
15
15
sativa
mid
16
44
sativa
hi
17
38
sativa
hi
18
37
sativa
hi

We have three variables, height, plant and water. This is the sort of thing you would expect to form the basis for a two-way analysis of variance. In order for R to read the variables from this data file we would attach() the main variable e.g. attach(twoway). However, it is possible to read the variables without doing this.


Top

Navigation Index

 

To access variables from within larger data sets we can use one of several methods:

attach(data.frame) aloows the variables to be accessed by typing the name.

data.frame$variable reads a variable directly.

data.frame[row, col] allows you to access a specific row, column or element.

Variables inside data sets

To see the height variable we type the following:

twoway$height

[1] 9 11 6 14 17 19 28 31 32 7 6 5 14 17 15 44 38 37

We see the vector of numbers, it's obviously a numeric variable. Notice how we type the name of the original variable then append a dollar sign and the name of the variable within it that we wish to see.

If we look at the water variable next:

twoway$water

[1] lo lo lo mid mid mid hi hi hi lo lo lo mid mid mid hi
[17] hi hi
Levels: hi lo mid

This is something new; the variable doesn't appear to be text (the items are not enclosed in quotes). The first couple of lines show us the data items in the order they are in the table and then we see a line starting with "Levels:" This line shows us that there are three 'things' in the water variable, lo, mid and hi. This type of variable is a factor (as opposed to character or numeric). R assumes that all text values in your CSV file are either headings or are factors unless you specifically tell it otherwize. We will cover this later.

A single variable is termed a vector. When we create a larger data file (e.g. as a CSV file) the resulting variable (e.g. twoway above) is called a data frame. We can display the individual variables from the data frame by using the $ symbol as we have just seen. However, there is another way. The data frame is composed of rows and columns; we can pull-out individual items using the following syntax:

data.frame[row, col]

So, to see the height variable we type:

twoway[,1]

[1] 9 11 6 14 17 19 28 31 32 7 6 5 14 17 15 44 38 37

Since we left the row blank all rows are displayed.

If we wish to see the water variable we type:

twoway[,3]

[1] lo lo lo mid mid mid hi hi hi lo lo lo mid mid mid hi
[17] hi hi
Levels: hi lo mid

We can display a single row of course:

twoway[4,]

 
height
plant
water
4
14
vulgaris
mid

Top

Navigation Index

 

The transpose command t() is a fast way to re-arrange a data frame by switching rows and columns.

Transposing data frames

Once you create and enter a CSV file of data you create a data frame. Here is a simple example showing monthly mean temperatures for an Antarctic research station:

vostok

 
month
temp
1
Jan
-32.0
2
Feb
-47.3
3
Mar
-57.2
4
Apr
-62.9
5
May
-61.0
6
Jun
-70.6
7
Jul
-65.5
8
Aug
-68.2
9
Sep
-63.2
10
Oct
-58.0
11
Nov
-42.0
12
Dec
-30.4

Apart from the fact that it is decidedly chilly we can see that we have two variables, month and temp arranged in two columns. If we wished to create a bar chart of these data it may be more useful to have the data arranged in 12 columns, one for each month, rather than the two. We can switch around a data frame using the transpose command t(). To do that we merely type t(dataname) e.g.

t(vostok)

  1 2 3 4 5 6 7 8 9 10 11 12
month "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
temp "-32.0" "-47.3" "-57.2" "-62.9" "-61.0" "-70.6" "-65.5" "-68.2" "-63.2" "-58.0" "-42.0" "-30.4"

The data frame has now been switched around. Also we can see that all the data are enclosed in quotes as if they were text. What has happened is that R has taken the data from the data.frame and made it into a matrix. This is a separate type of data item that I won't cover here.

The t() function is useful for producing barplots that may contain both row and column headings as it allows you to display (and therefore graph) the data sorted by row or column.

To see an individual row or column in a matrix we cannot use the $ notation but we can use the [row, col] method e.g.

t(vostok)[2,]

  1 2 3 4 5 6 7 8 9 10 11 12
temp "-32.0" "-47.3" "-57.2" "-62.9" "-61.0" "-70.6" "-65.5" "-68.2" "-63.2" "-58.0" "-42.0" "-30.4"

This displays the second row only (the temperatures). To see the 2nd column only we type:

t(vostok)[,2]

month temp
"Feb" "-47.3"

Interestingly it does not display as we might expect (although it is the 2nd column). We can replace a single number in the square brackets for an expression. So if for example we wanted to see the 2nd, 3rd and 4th columns we could type:

t(vostok)[,2:4]

  2 3 4
month "Feb" "Mar" "Apr"
temp "-47.3" "-57.2" "-62.9"

The expression now reads, columns 2 to 4. For a more complex arrangement we can use the c() function that we have come across before (see creating data above and the section on scatter plots) e.g.

t(vostok)[, c(1, 2, 6, 7)]

  1 2 6 7
month "Jan" "Feb" "Jun" "Jul"
temp "-32.0" "-47.3" "-70.6" "-65.5"

Top

Navigation Index

Making text columns in data frames

If we create a data frame in our spreadsheet and save the result as a CSV file for reading into R we get a selection of numeric and factor variables. However, we may wish to have R regard some of the variables as text (i.e. character variables). To do this we append a separate command to the read.csv function.

In the example above we only had 2 columns, the file was read into R using a basic command:

vostok = read.csv(file.choose())

Since the CSV file already contained the column headings no other parameters were required. However, if we wish to alter the 1st column (month) from a factor to a character we need to use the as.is=# parameter like so:

vostok = read.csv(file.choose(), as.is=1)

Now the 1st column of data will be read as character rather than as a factor. If you wish to include several columns you can use syntax similar to above e.g. x:y or c(x, y, z)


Top

Navigation Index

Missing values

A data frame consists of a regtangular matrix consisting of a number of columns, each containing a series of data as numbers or text. If one column is shorter than the others it will be padded out with NA values. These are ignored by most stats tests but may be included in routines to calculate the mean or median for example. In most cases you may ignore the NA values by including the parameter na.rm= TRUE (see the section on basic stats).


Top

Navigation Index

Stacking data

The data fram you are working with may contain several columns, each containing a sample of numeric data. Here is a sample data file (called sugars). Each column shows the growth of an insect fed on a particular diet. These data were used in the demonstration of one-way ANOVA:

sugars

 
C
G
F
F.G
S
test
1
75
57
58
58
62
63
2
67
58
61
59
66
64
3
70
60
56
58
65
66
4
75
59
58
61
63
65
5
65
62
57
57
64
67
6
71
60
56
56
62
68
7
67
60
61
58
65
64
8
67
57
60
57
65
NA
9
76
59
57
57
62
NA
10
68
61
58
59
67
NA

We can see the data are in 6 columns, each representing a sample. These are the sort of data that would likely be analysed using ANOVA. However, the aov() routine in R requires the data to be organized in a slightly different manner. What is required are two columns only, one for the growth data (i.e. the numbers) and one for the factors (i.e. the types of treatment, the sugars). Ideally you would have entered the data into your spreadsheet in the appropriate manner right at the start but, if for some reason this was not done then all is not lost.

R provides a routine to take the individual columns and stack them together to form a new data frame in the correct fashion for our ANOVA. The command is stack(data.frame) and if we perform this on our sugar data we see something like the following:

stack(sugars)

 
values
ind
1
75
C
2
67
C
3
70
C
4
75
C
5
65
C
6
71
C
7
67
C
8
67
C
9
76
C
10
68
C
11
57
G
...

The function creates two columns, the numbers are placed in a column entitled values whilst the factors are entitled ind. We can now perform our analysis on the stacked data, either by assigning it to a new variable name (easiest option) or replacing the variables in the aov() expression with the stack() variables e.g.

carbs = stack(sugars)
aov(values ~ ind, data= carbs)

or...

aov(stack(sugars)$values ~ stack(sugars)$ind)


Selecting columns

It is possible that you may want to extract only some of the columns from a data frame. The stack() command allows you to select which columns to make into the new stacked variable.

In general terms the command is:

stack(data, select= c(var1, var2))

Notice how the list of variables we wish to extract is in the c(item1, item2) format that we have come across before (see also the examples in the section on scatter plots). For the example above, if we wished to extract only "pure" sugars we might use the following command:

sugar.st = stack(sugars, select= c(C, F, G, S))

The new data frams now contains two columns entitled values and ind as before but we have missed out the samples for F.G and test.


Naming the stacked columns

It is possible to give more meaningful names to the two columns of your new stacked data frame. To do this we use the names() command. In this instance we would type:

names(carb) = c("growth", "sugar")

You will notice how the names are assigned using the c() function that we came across earlier (see also the examples in the section on scatter plots).


Top

Navigation Index

Unstack

The opposite of stacking is unstacking! Using the example above, we have our stacked sugar/growth data and wish to extract the various samples into individual variables. We use the unstack(data.frame) command so:

unstack(carbs)

$C
[1] 75 67 70 75 65 71 67 67 76 68

$F
[1] 58 61 56 58 57 56 61 60 57 58

$F.G
[1] 58 59 58 61 57 56 58 57 57 59

$G
[1] 57 58 60 59 62 60 60 57 59 61

$S
[1] 62 66 65 63 64 62 65 65 62 67

$test
[1] 63 64 66 65 67 68 64

Now we have a list of six vectors, one for each sample (i.e. sugar). To see a single sample we use the $ notation e.g.

unstack(carb)$F

$F
[1] 58 61 56 58 57 56 61 60 57 58

This can be useful to extract a single sample for some other analysis.

 

Gardeners Own Home
Top
Navigation Index