4.1 Simple Data cleaning
We have read in our temperature data from the Brisbane Airport station in the previous chapter. Before we do any simple statistical analysis in this chapter we need to first check on the data we uploaded. Data cleaning is an essential step prior to any data analysis, but is beyond the scope of this workshop. (Have a look at our Data preparation, processing and reporting with R workshop.)
One quick way to check our temperature data is the range of the dataset:
range(minTemp, na.rm=T)
## [1] -2.2 99999.9
This tells us we have temperatures ranging from -2.2 degrees to 99999.9 degrees. Alarm bells should be going off in your head for the maximum range. Could this be an error in the reading or something else? Visiting the source of where the data is obtained from may provide more information.
The Bureau of Metorology site tells us that missing data is also encoded with 99999.9, given that this is the case let’s change the data and use NA
for missing data as the value 99999.9 will interfere when we perform simple summary statistics like calculating the average temperature.
# create a copy of the original dataset just in case
minTemp.ORI <- minTemp
minTemp[minTemp == 99999.9] <- NA
range(minTemp, na.rm=T)
## [1] -2.2 28.1
Recheck the range after any modification to the data, this is much more realistic. Now we can perform some basic summary functions on the dataset.