Before you can do any statistical analysis, you have to be able to manipulate your data. Formatting, sub-setting and correcting or removing data errors are crucial skills.

# load useful packages
library(tidyverse)

The data to wrangle

  • ID: measurement/row identifier.
  • DAY.IN.EXPT: day on which measurements were taken.
  • LARVAE: number of competing larvae in jar.
  • JAR: experimental jar identifier.
  • WEIGHT: mean weight of larvae (g).
  • AREA: mean area covered by larvae (mm2).
# read in the data
hoverflies <- read.csv("data/BleraGrowth4Quiz.csv")
# check the structure
str(hoverflies)
## 'data.frame':    1162 obs. of  6 variables:
##  $ ID         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DAY.IN.EXPT: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ LARVAE     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ JAR        : Factor w/ 240 levels "S40W140L1.1",..: 61 72 74 75 76 77 78 79 80 62 ...
##  $ WEIGHT     : num  0.0056 0.0252 0.033 0.0317 0.0178 0.0177 0.0341 0.0341 0.016 0.0385 ...
##  $ AREA       : num  0.066 0.218 0.321 0.248 0.147 0.207 0.281 0.303 0.18 0.32 ...
# get a feel of the first few rows
head(hoverflies)
ID DAY.IN.EXPT LARVAE JAR WEIGHT AREA
1 1 1 S40W70L1.1 0.0056 0.066
2 1 1 S40W70L1.2 0.0252 0.218
3 1 1 S40W70L1.3 0.0330 0.321
4 1 1 S40W70L1.4 0.0317 0.248
5 1 1 S40W70L1.5 0.0178 0.147
6 1 1 S40W70L1.6 0.0177 0.207

Reformatting vector types

In these data our experimental condition LARVAE is a number. We may want to have it as a factor instead.

# overwrite column with re-formatted version of data
hoverflies$LARVAE <- as.factor(hoverflies$LARVAE)
str(hoverflies$LARVAE)
##  Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...

If we need it as a number, it’s easily re-formatted with as.numeric().

# overwrite column with re-formatted version of data
hoverflies$LARVAE <- as.numeric(hoverflies$LARVAE)
str(hoverflies$LARVAE)
##  num [1:1162] 1 1 1 1 1 1 1 1 1 1 ...

There are lots of as.something() commands in R, but these are the most useful for now.


Subsetting

There are a variety of ways you can access parts of your data. It’s a good idea to try and become familiar with all of them.

Horizontally

Each row of data can be accessed with [rownumber,]. The comma is important here, without it you will get the first column instead (that would be the totally wrong data!).

hoverflies[1,] # The first row
ID DAY.IN.EXPT LARVAE JAR WEIGHT AREA
1 1 1 S40W70L1.1 0.0056 0.066

We can also match rows by asking if a particular condition is TRUE. Operators, like == are used in these expressions.

  • == Equal to
  • != Not equal to
  • > Greater than
  • >= Greater than or equal to
  • < Less than
  • <= Less than or equal to
# subset larvae that had no competition
hoverflies %>% filter(LARVAE == 1)
# subset records taken on any day except 27
hoverflies %>% filter(DAY.IN.EXPT != 27)

There are operators which combine conditional statements too.

  • & AND, where both conditions must return TRUE.
  • | OR, where at least one condition must return TRUE.
# subset records taken before day 27 which had competing larvae.
hoverflies %>% filter(DAY.IN.EXPT < 27 & LARVAE > 1)
# subset records from day 14 or 108
hoverflies %>% filter(DAY.IN.EXPT == 14 | DAY.IN.EXPT == 108)

Vertically

Each vector stored in a column can be accessed with $columnname (this is why column names never have spaces), pull(), or with [,columnnumber]. The comma is important here, without it you will get the entire column still formatted as a data-frame, much like select(). You’ll have the same data, but the difference in structure will affect how you can use it.

hoverflies$ID # the contents of ID as a vector
hoverflies[,1] # also the contents of ID (column 1) as a vector
hoverflies %>% pull(ID) # also the contents of ID as a vector

hoverflies[1] # a dataframe containing ID (column 1)
hoverflies %>% select(ID) # also a dataframe cotaining ID (column 1)

Values inside columns can be accessed with $columnname[rownumber].

hoverflies$JAR[1] # the first value in JAR

Multiple columns can be subset with select().

# subset by selecting comma separated columns
hoverflies %>% select(WEIGHT, AREA)

By both dimensions

If you want to filter by a condition you usually want to do this first, as vertical selection of columns removes the other data you might be filtering against.

# subset weights of larvae that competed (retaining dataframe structure and column WEIGHT)
hoverflies %>% filter(LARVAE > 1) %>% select(WEIGHT)
# same as above but with the contents as a simple vector
hoverflies %>% filter(LARVAE > 1) %>% pull(WEIGHT)

Individual values can also be accessed with [rownumber, columnnumber].

hoverflies[1,4] # the first value in JAR (column 4)

Fixing errors

There’s an extra heavy larvae here that must be a mistake.

hist(hoverflies$WEIGHT)

Correcting errors

If you know the correct weight you could update the value in the data frame.

# overwrite dataframe with corrected version
hoverflies <- hoverflies %>% mutate(
  WEIGHT = replace(
    WEIGHT, # replace with the same column data 
    which(WEIGHT > 0.6), # but where this is true
    0.3 # change the value to this
  )
)
# check correction
hist(hoverflies$WEIGHT)

Removing errors

If you don’t know the correction for certain, you should remove the entire row.

# overwrite the datafame with subset that removes error
hoverflies <- hoverflies %>% filter(WEIGHT < 0.3)
# check removal
hist(hoverflies$WEIGHT)


Making wide data long

Data recorded in spreadsheets is often in wide format, where many values that represent the same thing can be spread across multiple columns. Converting this kind of data to long format will make it much easier to work with, so it’s important to get comfortable with–you’ll do it a lot.

# get some wide data
exercise_wide <- read.csv("data/exercise.csv")
# check the structure
str(exercise_wide)
## 'data.frame':    32 obs. of  31 variables:
##  $ Before: int  80 88 70 79 68 59 56 84 56 56 ...
##  $ X1    : int  132 160 160 160 136 140 120 140 88 114 ...
##  $ X2    : int  124 104 120 120 116 72 108 136 80 54 ...
##  $ X3    : int  108 112 120 112 108 68 76 104 70 NA ...
##  $ X4    : int  108 96 108 104 92 56 72 104 60 NA ...
##  $ X5    : int  84 100 100 84 96 NA 68 100 56 NA ...
##  $ X6    : int  76 96 96 100 96 NA 63 88 NA NA ...
##  $ X7    : int  NA 80 84 92 92 NA 61 100 NA NA ...
##  $ X8    : int  NA NA 64 92 88 NA 56 96 NA NA ...
##  $ X9    : int  NA NA NA 92 76 NA NA 92 NA NA ...
##  $ X10   : int  NA NA NA 80 NA NA NA 84 NA NA ...
##  $ X11   : int  NA NA NA 92 NA NA NA 88 NA NA ...
##  $ X12   : int  NA NA NA 90 NA NA NA 80 NA NA ...
##  $ X13   : int  NA NA NA 92 NA NA NA 80 NA NA ...
##  $ X14   : int  NA NA NA 91 NA NA NA 84 NA NA ...
##  $ X15   : int  NA NA NA 95 NA NA NA 80 NA NA ...
##  $ X16   : int  NA NA NA 88 NA NA NA 76 NA NA ...
##  $ X17   : int  NA NA NA 81 NA NA NA NA NA NA ...
##  $ X18   : int  NA NA NA 89 NA NA NA NA NA NA ...
##  $ X19   : int  NA NA NA 79 NA NA NA NA NA NA ...
##  $ X20   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ X21   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ X22   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ X23   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ X24   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ X25   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ X26   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ X27   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ X28   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ X29   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ X30   : logi  NA NA NA NA NA NA ...
# get a feel of the data
head(exercise_wide)
Before X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30
80 132 124 108 108 84 76 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
88 160 104 112 96 100 96 80 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
70 160 120 120 108 100 96 84 64 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
79 160 120 112 104 84 100 92 92 92 80 92 90 92 91 95 88 81 89 79 NA NA NA NA NA NA NA NA NA NA NA
68 136 116 108 92 96 96 92 88 76 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
59 140 72 68 56 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

In this dataset each row represents pulse rates of one student, before exercise and every minute thereafter until their pulse returned to it’s resting rate. We have many values which represent the same thing (pulse rate) spread across different columns. For this data to be long we need one column which contains all the pulse rates. That means we’ll need another for identifying each student, and one for the time pulse rate was taken too.

# find out how many rows (ie: people) we have
n_people <- length(exercise_wide$Before)
# add a new column for the person identifier
exercise_wide$person <- as.factor(1:n_people)
# make wide data long...
exercise_long <- gather(
  exercise_wide, # data to gather
  time, # new key column (ie: the gathered column headings)
  pulse.rate, # new val column (the contents of those columns)
  Before:X30, # columns to gather
  factor_key = T # keys as factor
)
# check the new structure
str(exercise_long)
## 'data.frame':    992 obs. of  3 variables:
##  $ person    : Factor w/ 32 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ time      : Factor w/ 31 levels "Before","X1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ pulse.rate: int  80 88 70 79 68 59 56 84 56 56 ...
# see the difference
head(exercise_long)
person time pulse.rate
1 Before 80
2 Before 88
3 Before 70
4 Before 79
5 Before 68
6 Before 59

Getting rid of NAs or NULLs

This data file had lot’s of empty cells which convert to the NAs you can see in the wide data head above. Now the columns are all gathered, getting rid of these dud rows is easy.

There are special functions to check for NA and NULL values which can be used within conditions; is.na() and is.null(). These return TRUE if the object passed to the function is NA or NULL respectively. To filter those which do NOT match, just use the exclamation mark operator ! infront of the function.

# subset rows that contain actual pulse rates 
# (ie: those that are not NA)
exercise_long %>% filter(!is.na(pulse.rate))

Fixing gathered keys that have lost numerical format

In this case, we had numerical data as column headings for all the minutes after exercise when pulse rate was recorded. Because columns are objects in R, they get prefixed with X automatically (you can’t have objects that begin with numbers).

# let's take a look
str(exercise_long$time)
##  Factor w/ 31 levels "Before","X1",..: 1 1 1 1 1 1 1 1 1 1 ...

To plot anything useful with our time column these need converting back to real numbers.

# first let's subset all the times after exercise
after_exercise <- exercise_long %>% filter(time != "Before")
# now we can replace the time column with new formatted times
after_exercise$time <- as.numeric( # makes them numeric
  # deletes Xs from factors by replacing them with an empty string
  gsub("X","", after_exercise$time) 
)
# check the structure
str(after_exercise$time)
##  num [1:960] 1 1 1 1 1 1 1 1 1 1 ...

If you wanted you could assign the before times as a minus number and merge these subsets back together. Earlier we used mutate to fix a single value within a dataframe, but here we can just replace the whole column with a repeated sequence using rep(x, times).

# subset all the times before exercise
before_exercise <- exercise_long %>% filter(time == "Before")
# update all before times to -5
before_exercise$time <- rep(-5, length(before_exercise$time))

Combining data

So far, we’ve been through many ways to pull data apart, getting individual rows and columns as well as groups of each, but what if we need to put data together?

Vertically

To bind sets of rows together they need to have the same columns. Our before and after exercise subsets are both sets of rows with the same column headings. They can be combined vertically by binding the rows together.

# row bind subsets into a new dataframe
exercise_long <- rbind(before_exercise, after_exercise)
# check structure
str(exercise_long)
## 'data.frame':    992 obs. of  3 variables:
##  $ person    : Factor w/ 32 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ time      : num  -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 ...
##  $ pulse.rate: int  80 88 70 79 68 59 56 84 56 56 ...

Horizontally

To bind columns together they need to contain vectors of the exact same length. Use cbind() just as rbind() was used above. Or, for finer control of column placement, use add_column() from the {tidyverse}.

# overwrite data frame with version that has an extra column
exercise_long <- add_column(
  exercise_long, # data to add a column to
  "exercise.type" = as.factor( # new column name = contents
    rep("cycling", length(exercise_long$person))
  ),
  .after = 1 # after column 1
)
# check the structure
str(exercise_long)
## 'data.frame':    992 obs. of  4 variables:
##  $ person       : Factor w/ 32 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ exercise.type: Factor w/ 1 level "cycling": 1 1 1 1 1 1 1 1 1 1 ...
##  $ time         : num  -5 -5 -5 -5 -5 -5 -5 -5 -5 -5 ...
##  $ pulse.rate   : int  80 88 70 79 68 59 56 84 56 56 ...