3 Description

3.1 Introduction

  • Author starts the chapter by introducing linear-model analysis in which the goal is to compare the effects of different treatments
    • Each treatment is applied to a group of experimental units (biological replicates)
    • If the treatments produce a change in the replicates, then there will be a quantifiable difference between groups
  • This chapter focuses on treatment effects and hypohthesis testing is saved for later

3.2 Darwin’s maize pollination data

  • Darwin’s book The Effects of Cross and Self-Fertilization in the Vegetable Kingdom (1876) describes he produced maize seeds by pollinating the flowers of the parent from the same individual or with pollen from another plant
  • Pairs of seeds taken from selfed or out-crossed plants were then grown in pots
  • The height of the young seedlings was recorded as a measure for fitness
  • Darwin wanted to know whether inbreeding reduced fitness of the selfed plants
  • Darwin’s data is in the package ‘SMPracticals’
  • The dataframe that contains the data is ‘darwin’
  • R terminology
    • single values = scalars
    • columns of numerical data = vectors
head(darwin)
#>   pot pair  type height
#> 1   I    1 Cross 23.500
#> 2   I    1  Self 17.375
#> 3   I    2 Cross 12.000
#> 4   I    2  Self 20.375
#> 5   I    3 Cross 21.000
#> 6   I    3  Self 20.000

3.2.1 Known your data

Understand structure of the dataframe:

str(darwin)
#> 'data.frame':    30 obs. of  4 variables:
#>  $ pot   : Factor w/ 4 levels "I","II","III",..: 1 1 1 1 1 1 2 2 2 2 ...
#>  $ pair  : Factor w/ 15 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
#>  $ type  : Factor w/ 2 levels "Cross","Self": 1 2 1 2 1 2 1 2 1 2 ...
#>  $ height: num  23.5 17.4 12 20.4 21 ...
  • Structure:
    • The first three variables are Factors
      • These are categorical varaibles that divide data into discrete groups called levels
    • The last variable is a numerical one. It is a continuous measure of heigth

Generate summary statistics for the dataframe:

summary(darwin)
#>   pot          pair       type        height     
#>  I  : 6   1      : 2   Cross:15   Min.   :12.00  
#>  II : 6   2      : 2   Self :15   1st Qu.:17.53  
#>  III:10   3      : 2              Median :18.88  
#>  IV : 8   4      : 2              Mean   :18.88  
#>           5      : 2              3rd Qu.:21.38  
#>           6      : 2              Max.   :23.50  
#>           (Other):18

3.2.2 Summarizing and describing data

  • Plant height is the response variable
  • The treatments involved are two types of hand-pollination (selfing and crossing). This is the explanatory variable
  • The goal is to see if we can explain the differences in plant height as a function of pollination treatment
  • There are 15 replicates in each treatment group, planted in pairs
  • This dataframe is in the long or tidy format, which is typically ideal
  • Darwin data in the wide format can be found in the package, Sleuth3, and specifcally in the dataframe, exo0428
head(ex0428)
#>   Cross  Self
#> 1 23.50 17.38
#> 2 12.00 20.38
#> 3 21.00 20.00
#> 4 22.00 20.00
#> 5 19.13 18.38
#> 6 21.50 18.63

Visualize the data in a boxplot:

fig3_1 <- ggplot(darwin, aes(x=type, y=height)) + geom_boxplot() + 
  ylab("height (in)") + theme_bw() + scale_y_continuous(limits = c(0,25),  breaks=c(0,5,10,15,20,25))
fig3_1
  • Figure takeaways:
    • Bold central horizontal line is the median
    • Top horizontal part of box is the third quartile
    • Bottom horizontal part of the box is the first quartile
      • These contain the middle of the data
    • The whiskers contain 95% of the data
    • The datapoints are outliers

Violin plot can show the higher and lower ‘centers of gravity’ of the distributions of the data:

fig3_2 <- ggplot(darwin, aes(x=type, y=height)) +
  geom_violin() + 
  ylab("height (in)") + theme_bw() + scale_y_continuous(limits = c(0,25),  breaks=c(0,5,10,15,20,25)) + geom_jitter()
fig3_2
  • The author sums up the goal of the analysis by quoting a statistician, Nate Silver, who says that the aim is to find if there is any systematic pattern (signal) in the data that stands out above the background variability (noise)
    • Signal is often quantified through finding the average (also median and mode)

Determine the mean for all data (ignoring treatment groups):

with(data = darwin, mean(height)) #average height for all 30 plants 
#> [1] 18.88333
  • Now the problem is how to quantify variability.
  • Variance is quantified through a process called ‘least squares’, more on this later
  • Variance is also known as mean squares and is referred to as
    • \(S^{2}\) when a sample of data is being discussed (as here) and
    • \(\sigma\)\(^{2}\) (sigma) when the population from which the sample is drawn is being discussed

Calculate the variance in the data as a whole:

var(darwin$height) 
#> [1] 10.11846
  • The catch about this estimate of variability is not on the same scale as our data
  • This is on a squared scale, where as the data at hand is unsquared
  • This means that the measure of signal (plant height) is in inches and the varibility of the plant height is in squared inches, which doesn’t make sense
  • So you have to take the square root of variance to get back to the same scale as the original data
    • Effectively, this means that you have to find the standard devation (SD):
      \(SD = \sqrt{s^2}\)
  • Again, \(\sigma\) is used when referring to the whole population standard deviation
  • The latin \(s\) is used for the sample SD
  • SD is the average difference between an individual measurement (i.e. a single plant height measurement) and the mean value

Calculate the SD in R:

a <- sd(darwin$height)
a
#> [1] 3.180953

Double check derivation of SD:

b = sd(darwin$height)
isTRUE(a==b)
#> [1] TRUE

3.2.3 Comparing groups

Calculate means of groups (this approach does not scale well):

mean(ex0428$Cross)
#> [1] 20.19333
mean(ex0428$Self)
#> [1] 17.57667

Calculate summary statistics more efficiently:

with(data = darwin, tapply(height, type, mean))
#>    Cross     Self 
#> 20.19167 17.57500
with(data = darwin, tapply(height, type, sd))
#>    Cross     Self 
#> 3.616945 2.051676

Assign calculated values to objects:

means <- tapply(darwin$height, darwin$type,mean)
sds <- tapply(darwin$height, darwin$type,sd)

Create a new column to contain text to soon merge into the dataframe:

pollination <- c("Crossed","Selfed")
dar_sum_stats <- data.frame(pollination, means, sds) 
dar_sum_stats
#>       pollination    means      sds
#> Cross     Crossed 20.19167 3.616945
#> Self       Selfed 17.57500 2.051676

Plot these summary statistics:

fig3_3 <- ggplot(dar_sum_stats, aes(x=pollination, y=means)) +
  geom_pointrange(aes(ymin = means - sds, max = means + sds)) + 
  ylab("height (in)") + theme_bw() 
fig3_3