Saturday, July 21, 2012

The magic of the year 1901

The year 1901 is rather magical. Well it is for R provided you run it under Linux. Let me show you why. I have four data points, one from 1900, two from 1901, and one from 1902.

dates  <- c("11/11/1900", "01/01/1901", "30/05/1901", "01/01/1902")
values <- c(      1,         2,           0.7,              0.1 )

I convert them in two different ways; as a Date, and as a POSIXct. For both conversions the same format string is used.

date1 <- as.Date(    dates, format="%d/%m/%Y" )
date2 <- as.POSIXct( dates, format="%d/%m/%Y" )

I plot them like this

plot( date1, values )
plot( date2, values )

Now you try and spot the difference.

Both graphs have the same shape, but different breaks. In the first graph the maximum appears to be in 1901, in the second graph in 1900. This is caused by a bug in the conversion from a string to R POSIXct class.

> as.POSIXct( "1901-01-01", format="%Y-%m-%d" ) 
[1] "1900-12-31 23:59:28 AMT"

Two things are wrong here.

  1. We somehow shifted 32 seconds into the past, (thereby moving from 1901 to 1900, which causes the difference in the two graphs).
  2. We also moved to the CET time zone, where I live, to the Amazonian time zone (AMT).
The conversion works fine for dates in more recent past.

> as.POSIXct("2012-01-01", format="%Y-%m-%d" )
[1] "2012-01-01 CET"

It even works properly for dates before the Unix epoch 1970-01-01.

> as.POSIXct("1957-01-01", format="%Y-%m-%d" )
[1] "1957-01-01 CET"

But around 1940 strange things happen to the timezone, and in december 1901 the 32 second time shift happens.

> as.POSIXct("1940-01-01", format="%Y-%m-%d" )
[1] "1940-01-01 NET"

How to fix this

Be explicit, don't leave R guessing what time zone to use. Set the environment variable TZ to a time zone of your liking before you start R.

$ export TZ=CET
$ R
> as.POSIXct( "1901-01-01", format="%Y-%m-%d" ) 
[1] "1901-01-01 CET"
> as.POSIXct( "1940-01-01", format="%Y-%m-%d" ) 
[1] "1940-01-01 CET"

Thursday, July 19, 2012

Time zones

Say we have some following raw data. It consists of a timestamp and a corresponding value. There is a peak at exactly midnight (00:00:00). Each timestamp is fully specified. It contains a date, a time of day, and a time zone offset indication. In this case +0000, meaning the data is 0 hours away the UTC timezone.

"timestamp","value"
"25-04-2012 22:00:00 +0000",0
"25-04-2012 22:15:00 +0000",0
"25-04-2012 22:30:00 +0000",1
"25-04-2012 22:45:00 +0000",2
"25-04-2012 23:00:00 +0000",5
"25-04-2012 23:15:00 +0000",11
"25-04-2012 23:30:00 +0000",17
"25-04-2012 23:45:00 +0000",19
"26-04-2012 00:00:00 +0000",20
"26-04-2012 00:15:00 +0000",19
"26-04-2012 00:30:00 +0000",17
"26-04-2012 00:45:00 +0000",11
"26-04-2012 01:00:00 +0000",5
"26-04-2012 01:15:00 +0000",2
"26-04-2012 01:30:00 +0000",1
"26-04-2012 01:45:00 +0000",0
"26-04-2012 02:00:00 +0000",0

This data is stored in a file called peak2.dat and we read it as follows:

dataset <- read.csv( file="peak2.dat",as.is=TRUE)

Then we convert the timestamps to POSIXct objects with the aid of strptime. Here we use the %z field to also read the time zone offset:

# Convert timestamps
dataset$timestamp2 <- strptime( format="%d-%m-%Y %H:%M:%S %z",.
                                dataset$timestamp,
                                tz="UTC" )

And use ggplot to make a nice plot of the data. The resulting graph looks something like.

p1 <- ggplot( dataset, aes( timestamp2, value ) ) +.
      geom_point() + scale_x_datetime() 

Something odd happened. The peak that was at 00:00 is now at 02:00 hours.

The reason for this is that the timestamps in the graph are displayed in the timezone of the machine R runs on. In my case this was CET, which is two hours ahead of CET (during summertime).

Notice that if you make the same plot with plot() instead of ggplot() the result is different.

plot( dataset$timestamp2, dataset$value, main="plot()", 
      xlab='timestamp2',
      ylab='value' )

The peak now shows at 00:00 hours instead of 02:00.

So which one is correct. It depends; sometimes you want 02:00, sometimes you want 00:00. Let me give you an example. Say you live in the Germany. You have a collegue living the Iceland. She did some interesting experiment and needs your help analyzing the data. She sends you the data with time stamps with an UTC timezone, the timezone of Iceland (GMT). Also, unlike Germany, Iceland does not have daylight saving.

You analyse the data and make a nice plot with ggplot. You call her an say, "well I see a strange spike 2 o'clock in your data". Then you better tell her it was 2 o'clock your time, or she might go on a wild goose chase trying to figure out what happend during her experiment at 2 o'clock her time. Which depending on the day of the year is 3 o'clock or 4 o'clock your time (depending if you have winter or summer time). In such a case it is much easier to view a graph of the data in the same timezone as where the data is from. Thereby avoiding having to constantly convert back and forth between the timezones.

On the other hand if you have some measuring device that records all timestamps in UTC, and it is located in the same time zone as you, you probably want all time stamps shown in your time zone.

So sometimes you want the data to be shown in your timezone, sometimes in its original timezone. So how can this be acomplished. There are a number of variables that influence how data is shown.

  • The time zone offset of the data (%z field),
  • The parameter tz of the strptime function,
  • The timezone set in your operating system's clock,
  • The environment variable TZ.

The most important variable is the time zone offset. It indicates what the offset is of your timestamp from UTC. It does not indicate the exact timezone, as several timezones can have the same offset from UTC. However if your data includes a time zone offset, use it. With this offset the timestamp defines a single point in time. Without this offset timestamps are ambiguous and the time zone your data is in depends on other variables.

One of these is the tz parameter of the strptime function. It lets you specify the name of a timezone. This parameter does several things. If your timestamp does not include a time zone offset, tz is used to interpret your timestamp.

> x <- strptime( "25-03-2012 02:23:00", 
               format="%d-%m-%Y %H:%M:%S", tz="EST" )
> x
[1] "2012-03-25 02:23:00 EST"

It is also possible to use both time zone offset and tz In this case tz is used when displaying your data.

> x <- strptime( "25-03-2012 02:23:00 +0000", 
                 format="%d-%m-%Y %H:%M:%S %z", tz="EST" )
> x
[1] "2012-03-24 21:23:00 EST"

The value of tz is stored together with the converted timestamp.

> dput(x)
> structure(list(sec = 0, min = 23L, hour = 21L, mday = 24L, mon = 2L, 
    year = 112L, wday = 6L, yday = 83L, isdst = 0L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = c("EST", "EST", "EST"
))

This information is used by some functions that display data.

If you don't specify either time zone offset or tz, R uses the time zone of your OS. But does not store the timezone information.

> x <- strptime( "25-03-2012 02:23:00", format="%d-%m-%Y %H:%M:%S")
> x
[1] "2012-03-25 02:23:00"
> dput(x)
structure(list(sec = 0, min = 23L, hour = 2L, mday = 25L, mon = 2L, 
    year = 112L, wday = 0L, yday = 84L, isdst = 1L), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"))

On Unix like systems you can override this with the environment variable TZ. It puts R temporarily in a different timezone. You would use it as follows:

$ export TZ=EST
$ R 
> x <- strptime( "25-03-2012 02:23:00", format="%d-%m-%Y %H:%M:%S")
> x
[1] "2012-03-25 02:23:00 EST"

The table below shows the effect of these variables on how plot() and ggplot() show the data from the example above. The table shows where the peak of the graph is located for various combinations of the variables. The timezone of the operating system's clock is fixed to CET. The other variables are varied.

time zone offset - UTC
tz strptime - UTC EST - UTC EST
-ggplot 00:00 02:00 07:00 02:00 02:00 02:00
-plot 00:00 00:00 00:00 00:00 00:00 19:00
UTCggplot 00:00 00:00 05:00 00:00 00:00 00:00
UTCplot 00:00 00:00 00:00 00:00 00:00 19:00
CETggplot 00:00 02:00 07:00 02:00 02:00 02:00
CETplot 00:00 00:00 00:00 02:00 00:00 19:00
TZ Plot kind

When we look at the table it is clear that plot and ggplot() behave quite differently. When the timestamps have a timezone indicator, the tz parameter does not have any influence on where ggplot() shows the maximum. For plot() it is the oposite. If no time zone offset is specified, plot() always shows the peak at 00:00.

Conclusion

  • When you make a graph with a time axis be aware in which time zone the breaks on the axis are displayed. Otherwise points of interest might not be where you think they are.
  • Use timestamps with a time zone offset indication,
  • If you want plot() and ggplot() to behave the same, do not use tz but do set TZ.
  • The only(*) way to make ggplot display your data in a different time zone than your OS's, is to set TZ.

(*) ggplot2 used to have a tz parameter for its scale_x_datetime but that seems to be gone in the current release.