4.2 Summary statistics
Below are some other useful methods for a vector or list of numbers:
Operator | Description |
---|---|
min(x) |
returns minimum value in list or vector |
max(x) |
returns the maximum value in list or vector |
sum(x) |
returns the total sum of a numeric list or vector |
mean(x) |
returns the mean value of list or vector |
sd(x) |
returns the standard deviation of list or vector |
summary(x) |
returns basic statistic summary of vector |
which.min(x) |
returns the position of the list or vector with the minimum value |
which.max(x) |
returns the position of the list or vector with the maximum value |
range(data) |
returns the minimum and maximum range of a list/vector |
duplicated() |
provides a vector of logicals indicating which elements of a vector have already been seen in that vector. In other words, the first time a value is seen it will return FALSE, and then if it occurs again it will return TRUE) |
unique() |
provides a non-redundant list of all values in a vector; any duplicated values will be output only |
We will use some of these functions to explore our temperature data, let’s look at the temperature from year 1996:
Examples:
temp <- minTemp$X1996
# Find various statistical summaries of that test data. The semicolon
# separates commands on the same line.
min(temp)
max(temp)
mean(temp)
sd(temp)
range(temp)
## [1] 1.9
## [1] 25.9
## [1] 15.40874
## [1] 5.157272
## [1] 1.9 25.9
There is very handy function, summary()
that returns the descriptive summary statitics for you given a vector of numbers:
summary(temp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.90 11.70 16.25 15.41 19.10 25.90
# Find the position of the minimum and maximum values
which.min(temp)
which.max(temp)
## [1] 200
## [1] 32
head(duplicated(temp)) # only show the first 6 elements
length(unique(temp)) # number of unique values in `temp`
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
## [1] 171
4.2.1 Two-dimensional datasets
While the above can be used on matrices and data frames, the answer will always be a single number. There are times when you have a table of numbers and only want to operate on the rows or columns. This can be done using the following operations:
Operator | Description |
---|---|
colSums(x) |
sum of column values |
rowSums(x) |
sum of row values |
colMeans(x) |
mean of column values |
rowMeans(x) |
mean of row values |
This time instead of working on only one year, let’s work on all 69 years worth of data:
- What is the average temperature for each year?
- What is the average temperature for each day?
head(colMeans(minTemp))
head(rowMeans(minTemp))
## X1949 X1950 X1951 X1952 X1953 X1954
## NA NA NA 15.34098 NA NA
## Jan-1 Jan-2 Jan-3 Jan-4 Jan-5 Jan-6
## NA NA NA NA NA NA
Notice how there are alot of NAs being returned, this is because there are some missing values and R cannot calculate the mean when there are missing values. You can specify the parameter na.rm=T
to remove the missing valuese (NA) before calculating the mean:
head(colMeans(minTemp, na.rm=T))
## X1949 X1950 X1951 X1952 X1953 X1954
## 12.12783 15.35397 13.35623 15.34098 14.72247 15.82493
head(rowMeans(minTemp, na.rm=T))
## Jan-1 Jan-2 Jan-3 Jan-4 Jan-5 Jan-6
## 20.53676 20.65441 20.89118 20.79265 20.82941 20.72941
Unfortunately there is no colSd()
or rowSd()
equivalent to finding the standard deviation of the columns or rows of a matrix. We will see how to do this later in the workshop.
Summary statistics
-
Try the example codes above.
-
Try using
summary()
on the the variable minTemp, what do you get?
Challenge if you are ahead try the following:
-
What is the lowest ever temperature recorded by this station? Hint: Remember to remove NA values.
-
Using the
which()
function, can you find the row and column position of the lowest recorded temperature? (Hint: Look at the help documentation for thewhich()
function.)- Using the row and column what is the year, month and date of this lowest temperature?