4.2 Summary statistics

Below are some other useful methods for a vector or list of numbers:

Operator Description
min(x) returns minimum value in list or vector
max(x) returns the maximum value in list or vector
sum(x) returns the total sum of a numeric list or vector
mean(x) returns the mean value of list or vector
sd(x) returns the standard deviation of list or vector
summary(x) returns basic statistic summary of vector
which.min(x) returns the position of the list or vector with the minimum value
which.max(x) returns the position of the list or vector with the maximum value
range(data) returns the minimum and maximum range of a list/vector
duplicated() provides a vector of logicals indicating which elements of a vector have already been seen in that vector. In other words, the first time a value is seen it will return FALSE, and then if it occurs again it will return TRUE)
unique() provides a non-redundant list of all values in a vector; any duplicated values will be output only

We will use some of these functions to explore our temperature data, let’s look at the temperature from year 1996:

Examples:

temp <- minTemp$X1996

# Find various statistical summaries of that test data. The semicolon
# separates commands on the same line.
min(temp)
max(temp)
mean(temp)
sd(temp)
range(temp)
## [1] 1.9
## [1] 25.9
## [1] 15.40874
## [1] 5.157272
## [1]  1.9 25.9

There is very handy function, summary() that returns the descriptive summary statitics for you given a vector of numbers:

summary(temp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.90   11.70   16.25   15.41   19.10   25.90
# Find the position of the minimum and maximum values
which.min(temp)
which.max(temp)
## [1] 200
## [1] 32
head(duplicated(temp))   # only show the first 6 elements
length(unique(temp))     # number of unique values in `temp`
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
## [1] 171

4.2.1 Two-dimensional datasets

While the above can be used on matrices and data frames, the answer will always be a single number. There are times when you have a table of numbers and only want to operate on the rows or columns. This can be done using the following operations:

Operator Description
colSums(x) sum of column values
rowSums(x) sum of row values
colMeans(x) mean of column values
rowMeans(x) mean of row values

This time instead of working on only one year, let’s work on all 69 years worth of data:

  • What is the average temperature for each year?
  • What is the average temperature for each day?
head(colMeans(minTemp))
head(rowMeans(minTemp))
##    X1949    X1950    X1951    X1952    X1953    X1954 
##       NA       NA       NA 15.34098       NA       NA 
## Jan-1 Jan-2 Jan-3 Jan-4 Jan-5 Jan-6 
##    NA    NA    NA    NA    NA    NA

Notice how there are alot of NAs being returned, this is because there are some missing values and R cannot calculate the mean when there are missing values. You can specify the parameter na.rm=T to remove the missing valuese (NA) before calculating the mean:

head(colMeans(minTemp, na.rm=T))
##    X1949    X1950    X1951    X1952    X1953    X1954 
## 12.12783 15.35397 13.35623 15.34098 14.72247 15.82493
head(rowMeans(minTemp, na.rm=T))
##    Jan-1    Jan-2    Jan-3    Jan-4    Jan-5    Jan-6 
## 20.53676 20.65441 20.89118 20.79265 20.82941 20.72941

Unfortunately there is no colSd() or rowSd() equivalent to finding the standard deviation of the columns or rows of a matrix. We will see how to do this later in the workshop.

Summary statistics

  1. Try the example codes above.

  2. Try using summary() on the the variable minTemp, what do you get?

Challenge if you are ahead try the following:

  1. What is the lowest ever temperature recorded by this station? Hint: Remember to remove NA values.

  2. Using the which() function, can you find the row and column position of the lowest recorded temperature? (Hint: Look at the help documentation for the which() function.)

    1. Using the row and column what is the year, month and date of this lowest temperature?