Saturday, May 28, 2011

Some Statistics with R examples

These are notes I made while working through the book "Statistics Explained". They're rather terse but include R examples so I thought they might possibly be useful. If the maths doesn't look right it's because your browser doesn't support MathML. Currently only Firefox does.

Statistics Summary

Author:Ben Reynwar
Copyright:Whatever wikipedia uses since there are probably bits cut and pasted.
Created:2011 April 1
Last Edited:2011 May 28

Type I error - finding an effect that isn't real.

Type II error - not finding a real effect.

Some Useful Distributions

Normal Distribution

Chi-square Distribution

The chi-square distribution with k degrees of freedom is the distribution of the sum of squares of k independent standard normal variables.

It is a special case of the gamma distribution.

Useful for estimating variance of a population. Suppose we have n observations from a normal population N(μ,σ2) and S is their standard deviation then (n-1)S2σ2 has a chi squared distribution with n-1 degrees of freedom. Since σ is the only unknown once can get confidence intervals for the population variance.

> # Number of observations in sample.
> n <- 10
> # Set alpha for 95% confidence interval (two-sided)
> alpha <- 0.05
> # Get a sample.  Pretend we don't know mean or standard deviation.
> sample = rnorm(n, 25, 8)
> top = var(sample)*(n-1)
> # Get bounds for confidence interval for the standard deviation.
> lower = sqrt(top/qchisq(1-alpha/2, n-1))
> upper = sqrt(top/qchisq(alpha/2, n-1))
> c(lower, upper)
[1] 3.531557 9.373243


The ratio of two independent random variables both of which have chi-squared distributions has an F-distribution.

> # We take two samples from a normal population and take the ratio of variances.
> n1 <- 3; n2 <- 4
> # A 95% confidence interval for our result is:
> alpha <- 0.05
> c(qf(alpha/2, n1, n2), qf(1-alpha/2, n1, n2))
[1] 0.06622087 9.97919853
> mu <- 10; sigma <- 5
> sample1 <- rnorm(n1, mu, sigma)
> sample2 <- rnorm(n2, mu, sigma)
> ratio <- var(sample1)/var(sample2)
> # Calculated ratio is:
> ratio
[1] 0.2974744

Students's t-distribution

  • An estimated distibution of sample means. Different from a normal distribution since it takes into account the uncertainty in the standard deviation.
  • Depends on number of degrees of freedom of the standard deviation used.
> # Calculate the value of x for which the cumulative distribution of the
> # t-distribution is 0.05, for degrees of freedom = 1000.
> qt(0.05, 1000)
[1] -1.646379
> # Now for degrees for freedom = 3.
> qt(0.05, 3)
[1] -2.353363
> # Get the density of a t-distribution.
> dt(0.05, 1000)
[1] 0.3983438
> # Get the cumultive distribution
> pt(1, 1000)
[1] 0.8412238

Student's t-test

  • Assesses the statistical significance of the difference between two sample means.
  • Can have paired or unpaired samples (related or independent).
  • Assumes the two samples have equal variances. Often used as long as one is not more than three times as big as the other.
  • Welch's t-test is an extension that does not assume equal variances.

Independent Example

> # Generate a random vector containing 10 values from a normal distibution
> # with mean 10 and standard deviation 2.
> x = rnorm(10, 10, 2)
> x
 [1]  7.288907 10.679290  7.021853 12.356291 11.973576 13.452286 10.765847
 [8] 12.815315 10.776181 11.478132
> # Generate another similar random vector.
> y = rnorm(10, 10, 2)
> y
 [1] 10.228506  9.307230  9.007056 11.699393 12.524014 15.061729  9.449236
 [8] 14.378682 10.251995  9.862443
> # Perform the t-test on them.
> # The p-value is the chance that the difference between the means would be
> # this large if the null hypothesis were true.
> t.test(x, y, var.equal=TRUE)

    Two Sample t-test

data:  x and y
t = -0.3273, df = 18, p-value = 0.7472
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.346419  1.713897
sample estimates:
mean of x mean of y
 10.86077  11.17703

Analysis of Variance

Variance Ratio(F) = (Between conditions variance)/(Error variance)

Assume samples come from normally distributed populations with equal variances.

F statistic depends on the dof for the two kinds of variances. These should always be given with the F value:

F(dfbt.conds,dferror) = calculated value

The p-value is then found from a table or calculation.

One factor independent measures ANOVA

  • also called completely randomised design ANOVA
  • We have K conditions, with ni samples in the ith condition, and N samples overall.
  • Y_i denotes the sample mean in the ith condition.
  • Y_ denotes the overall mean of the data.
  • Yij is the jth observation in the ith condition.

Between conditions variance = ini(Y_i-Y_)2/(K-1) you Error variance = ij(Yij-Y_i)2/(N-K)

For two conditions it is mathematically identical to the t-test.

> # Number of points in each data set.
> ni <- 4
> # Generate four sets of data with different means.
> a <- rnorm(ni, 10, 2)
> b <- rnorm(ni, 12, 2)
> c <- rnorm(ni, 14, 2)
> d <- rnorm(ni, 16, 2)
> # Merge them all together into a data frame.
> values <- c(a, b, c, d)
> letters <- c(rep('a', ni), rep('b', ni), rep('c', ni), rep('d', ni))
> df <- data.frame(letter=letters, value=values)
> # Perform an anova analysis.
> fit <- aov(value ~ letter, data=df)
> summary(fit)
            Df Sum Sq Mean Sq F value    Pr(>F)
letter       3 91.902 30.6342  24.829 1.964e-05 ***
Residuals   12 14.805  1.2338
Signif. codes:  0***0.001**0.01* ‘ ’ 1

Post-hoc Tests

If we find an effect with the F-test then we need to work out where it is coming from. We do this with post-hoc tests. The risk is the increased chance of a type I error.

Least Significant Difference Test
  • takes not account of the number of comparisons being made.
  • increased risk of Type I error is simply accepted.
Neuman-Keuls Test
  • ??
Duncan Test
  • ??
Tukey Test
  • We have K conditions, with n samples in each. N=nK.
  • Studentized Range is the range of samples divided by an estimate of their standard deviation.
  • We find the value of the Studentized Range for which their is some defined chance that the condition results will be under.
  • If any two conditions deviate by more than this amount we can say they are significant.
  • Depends on number of conditions and dof in standard deviation calculation.
> th <- TukeyHSD(fit)
> th
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = value ~ letter, data = df)

         diff       lwr      upr     p adj
b-a 0.7224116 -1.609443 3.054266 0.7949923
c-a 4.8536226  2.521768 7.185477 0.0002394
d-a 5.3724864  3.040632 7.704341 0.0000919
c-b 4.1312110  1.799356 6.463066 0.0009963
d-b 4.6500748  2.318220 6.981929 0.0003539
d-c 0.5188638 -1.812991 2.850718 0.9097766

> # Now try to do the same thing but more manually for (b-a) comparison.
> # Variance of residual errors.
> vare <- mean(c(var(a), var(b), var(c), var(d)))
> # Normalize the difference between the means the estimate of the standard deviation
> # of the means.
> st_range <- (mean(d)-mean(a))/(sqrt(vare/ni))
> # ptukey: ptukey(q, nmeans, df, nranges = 1, lower.tail = TRUE, log.p = FALSE)
> #  q - a given studentized range
> #  nmeans - the number of samples (i.e. number of means in this case)
> #  df - degrees of freedom in the calculation of the stdev.
> # It returns the probability that the studentized range of the sample of means is
> # less than the given value of q.
> # For our example nmeans is clearly 4 (a,b,c and d)
> # And df is 4*(ni-1) because we calculated the variance from 4 sets of samples each of
> # which had (ni-1) degrees of freedom.
> manual <- 1 - ptukey(st_range, 4, 4*(ni-1))
> manual
[1] 9.19243e-05
> th$letter["d-a", "p adj"] - manual < 10e-6
[1] TRUE
Scheffé Test
  • Very similar to Turkey method except we do not limit to pairwise comparisons.

  • Let μ1, ..., μr be the means of some variable in r disjoint populations.

  • Begin cut and paste from wikipedia: An arbitrary contrast is defined by




    If μ1, ..., μr are all equal to each other, then all contrasts among them are 0. Otherwise, some contrasts differ from 0.

    Technically there are infinitely many contrasts. The simultaneous confidence coefficient is exactly 1 − α, whether the factor level sample sizes are equal or unequal. (Usually only a finite number of comparisons are of interest. In this case, Scheffé's method is typically quite conservative, and the experimental error rate will generally be much smaller than α.)

    We estimate C by


    for which the estimated variance is


    It can be shown that the probability is 1 − α that all confidence limits of the type


> # Perform a Scheffe test to see if the average of samples a and b is significantly
> # different from the average of c and d.
> cs <- c(-0.5, -0.5, 0.5, 0.5)
> means <- tapply(df$value, df$letter, mean)
> c.hat <- sum(cs*means)
> s2 <- vare/ni * sum(cs^2)
> fval = c.hat^2/s2/(4*(ni-1))
> fval
[1] 6.100458
> # The chance of any linear combination deviating by this much if all samples
> # were from the same normal population.
> 1 - pf(fval, 3, 4*(ni-1))
[1] 0.009188318

Analyzing Frequency Data

A population can be divided in c categories. The chance of an observation being of category i is pi. We make N observations and find ni in category i.

i(ni-Npi)2Npi can be approximated by a chi-squared distribution with c-1 degrees of freedom as long as Npi is greater than 5 for all categories.

Possible Uses:

  • A goodness of fit test. To see if a sample seems to match a normal distribution bin the observations and compare the observed frequencies to those expected.
  • A test of independence. If we have two types of categories and each observation is a member of one of each types of categories then we can check if the categories are independent. We just see how the frequencies deviate from what would be expected if they were independent.

Linear Correlation

We have a sample of (x, y) pairs. Pearson correllation coefficient is:


r=1 is perfect positive correlation, r=0 is uncorrelated, r=-1 is perfect negative correlation. It is the slope of the line of best-fit through the reduced variables.

The distribution of r is approximately a Studentized t-distribution. To be exact the variable t (t=rn-21-r2) has this distribution.

> # Create some correlated data.
> n <- 10
> a <- rnorm(n, 12, 1)
> b <- 3*a + 6 + rnorm(n, 0, 3)
> r <- cor(a, b)
> r
[1] 0.5985606
> t <- r * sqrt((n-2)/(1-r^2))
> t
[1] 2.113384
> # The probability of a correlation being this far from 0 by chance is:
> 2 * (1 - pt(t, n-2))
[1] 0.06751678

Conversion of reStructuredText to html.

I've recently decided to use reStructuredText for making notes and needed a method to convert them into html. The reStructuredText contains code snippets as well as mathematical notation so the conversion process needed to be able to handle that.

docutils is the obvious candidate to do the conversion, however it doesn't do syntax highlighting or MathML out of the box, so I needed to find extensions that could.

I decided to use Pygments for the syntax highlighting of the code snippets. The Pygments package comes with a file that creates a directive called 'sourcecode' that can then be used to define code snippets.

For the maths I found rst2mathml which adds a directive converting tex math notation in a reStructuredText file to MathML in the html.

So the list of steps to get this working was:

  • Install some stuff needed for rst.

    sudo apt-get install python-docutils

  • Download

  • Found in the Pygments constellation and made a copy renamed to in the same directory as

  • Add the following line to the top of so that the sourcecode directive can be used.

    import rst_directive

  • Create a stylesheet for syntax highlighting.

    pygmentize -S default -f html -a .highlight > style.css

  • Create script to do conversion.

#!/usr/bin/env bash

echo input file name is $1
python /home/ben/Documents/Notes/ --stylesheet-path=/home/ben/Documents/Notes/style.css $1 > $stem.xhtml