This vignette explains how to setup data consisting of observations and forecasts, such that it can be used for onlineforecast models. A generic introduction and description is in available in onlineforecasting. The code is available here. More information on onlineforecasting.org.

First load the package:

```
# Load the package
library(onlineforecast)
```

In the package different data sets are included. The
`Dbuilding`

holds the data used for the example of heat load
forecasting in the building-heat-load-forecasting vignette.

When the package is loaded the data is also loaded, so we can access it directly. Let’s start out by:

```
# Keep it in D to simplify notation
<- Dbuilding D
```

The class is ‘data.ĺist’:

```
# The class of D
class(D)
## [1] "data.list" "list"
```

Actually, a ‘data.list’ is simply a ‘list’, but we made the class ‘data.list’ in order to have functions for the particular format of data - the format is explained in this document.

It consists of vectors of time, vectors of observations (model output) and data.frames of forecasts (model input):

```
# Print the names to see the variables in the data
names(D)
## [1] "t" "heatload" "heatloadtotal" "Taobs"
## [5] "Iobs" "Ta" "I"
```

An overview of the content can be generated by:

```
summary.default(D)
## Length Class Mode
## t 1824 POSIXct numeric
## heatload 1824 -none- numeric
## heatloadtotal 1824 -none- numeric
## Taobs 1824 -none- numeric
## Iobs 1824 -none- numeric
## Ta 36 data.frame list
## I 36 data.frame list
```

where it can be seen that `t`

is a time vector,
`heatload`

is a vector, and `Ta`

and
`I`

are data.frames.

A function giving a summary, including checks of the format of the ‘data.list’ is:

```
summary(D)
##
## Length of time vector 't': 1824
##
## NAs length class
## $heatload 0% ok numeric
## $heatloadtotal 0% ok numeric
## $Taobs 0% ok numeric
## $Iobs 0% ok numeric
##
## maxHorizonNAs NAs nrow colnames sameclass class
## $Ta 0% 0% ok ok ok numeric
## $I 0% 0% ok ok ok numeric
```

The ‘NA’ columns indicate the proportion of NAs. If there is a
`ok`

in a column, then the check of the variables format is
passed. See the help with `?summary.data.list`

to learn which
checks are performed.

First, lets have a look at `D$t`

, which is the vector of
time points:

```
# The time
class(D$t)
## [1] "POSIXct" "POSIXt"
head(D$t)
## [1] "2010-12-15 01:00:00 UTC" "2010-12-15 02:00:00 UTC"
## [3] "2010-12-15 03:00:00 UTC" "2010-12-15 04:00:00 UTC"
## [5] "2010-12-15 05:00:00 UTC" "2010-12-15 06:00:00 UTC"
tail(D$t)
## [1] "2011-02-28 19:00:00 UTC" "2011-02-28 20:00:00 UTC"
## [3] "2011-02-28 21:00:00 UTC" "2011-02-28 22:00:00 UTC"
## [5] "2011-02-28 23:00:00 UTC" "2011-03-01 00:00:00 UTC"
```

Hence, the vector is of the class `POSIXct`

. It is not a
necessity, `t`

can also simply be a numeric, but for plotting
and many operations, its very useful to use the ‘POSIXct’ class (see
`?POSIXt`

).

Rules for the time vector:

It must be named

`t`

.There must be no gaps or NA values in

`t`

, since only equidistant time series can be used in the models (the other variables can have NAs).Its best to keep the time zone in

`UTC`

or`GMT`

(not providing any time zone`tz`

can give rise to problems).

Use the basic R functions for handling the time class. Most needed operations can be done with:

```
?as.POSIXct ?strftime
```

A helper function is provided with the `ct`

function which
can be called using `?`

, or `?ct`

. See example
below:

```
# Convert from a time stamp (tz="GMT" per default)
ct("2019-01-01 11:00")
## [1] "2019-01-01 11:00:00 GMT"
# Convert from unix time
ct(3840928387)
## [1] "2091-09-18 04:33:07 GMT"
```

Note that for all functions where a time value as a character is
given, the time zone is always “GMT” (or “UTC”, but this can result in
warnings, but they can be ignored). For some operations the package
`lubridate`

can be very helpful.

Note the rules for observations:

In a

`data.list`

observations must be vectors.The vectors must have the same length as the time

`t`

vector.Observation as numerical vectors can be used directly as model output (if observations are to used as model inputs, they must be setup in a data.frame as explained below in Section Forecasts).

In the current data, a time series of hourly heat load observations is included:

```
str(D$heatload)
## num [1:1824] 5.92 5.85 5.85 5.88 5.85 ...
```

It must have the same length as the time vector:

```
# Same length as time
length(D$t)
## [1] 1824
length(D$heatload)
## [1] 1824
```

A simple plot can be generated by:

`plot(D$t, D$heatload, type="l", xlab="Time", ylab="Headload (kW)")`

The convention used in all examples is that the time points are always set to the time interval end point, e.g.:

```
# The observation
$heatload[2]
D## [1] 5.85
# Represents the average load between
$t[1]
D## [1] "2010-12-15 01:00:00 UTC"
# and
$t[2]
D## [1] "2010-12-15 02:00:00 UTC"
```

The main idea behind setting the time point at the end of the interval is: Working with values averaged over the time interval, such values are available at the end of the time interval, not before. Especially, in real-time applications this is a useful convention.

As described in onlineforecasting the setup of forecasts for model inputs always follows the same format - as presented in the following. This is also the format of the forecasts generated by functions in the package. Hence all forecasts must follow this format.

The rules are:

All values at row

`i`

are available at the`i`

’th value in time`t`

.All columns must be named with

`k`

followed by an integer indicating the horizon in steps (e.g. the column named`k8`

hold the 8-step forecasts).

Have a look at the forecasts of the global radiation:

```
# Global radiation forecasts
head(D$I)
## k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k14
## 1 0 0 0.0 0.0 0.0 0.0 0.0 46.9 119.52 168.41 181.49 158.5 97.6 19.4
## 2 0 0 0.0 0.0 0.0 0.0 46.9 119.5 168.41 181.49 158.52 97.6 19.4 0.0
## 3 0 0 0.0 0.0 0.0 46.9 119.5 168.4 181.49 158.52 97.64 19.4 0.0 0.0
## 4 0 0 0.0 0.0 49.9 125.6 175.0 190.6 165.10 99.86 9.94 0.0 0.0 0.0
## 5 0 0 0.0 49.9 125.6 175.0 190.6 165.1 99.86 9.94 0.00 0.0 0.0 0.0
## 6 0 0 49.9 125.6 175.0 190.6 165.1 99.9 9.94 0.00 0.00 0.0 0.0 0.0
## k15 k16 k17 k18 k19 k20 k21 k22 k23 k24 k25 k26 k27 k28 k29 k30 k31
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 12.2
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 12.2 11.8
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 42.4 42.8 58.1
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0.0 42.4 42.8 58.1 44.8
## 6 0 0 0 0 0 0 0 0 0 0 0 0 42.4 42.8 58.1 44.8 254.1
## k32 k33 k34 k35 k36
## 1 12.2 11.8 15.5 12.3 24.1
## 2 11.8 15.5 12.3 24.1 38.7
## 3 15.5 12.3 24.1 38.7 31.4
## 4 44.8 254.1 20.6 30.3 0.0
## 5 254.1 169.4 17.2 0.0 0.0
## 6 168.5 40.4 0.0 0.0 0.0
```

At the first time point:

```
# First time point
$t[1]
D## [1] "2010-12-15 01:00:00 UTC"
```

the available forecast ahead in time is at the first row:

```
# The forecast available ahead in time is in the first row
$I[1, ]
D## k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k14 k15 k16 k17 k18 k19 k20
## 1 0 0 0 0 0 0 0 46.9 120 168 181 159 97.6 19.4 0 0 0 0 0 0
## k21 k22 k23 k24 k25 k26 k27 k28 k29 k30 k31 k32 k33 k34 k35 k36
## 1 0 0 0 0 0 0 0 0 0 0 0 12.2 11.8 15.5 12.3 24.1
```

We can plot that by:

```
<- 1:ncol(D$I)
i plot(D$t[i], D$I[1, ], type="l", xlab="Time", ylab="Global radiation forecast (I in W/m²)")
```

So this is the forecast available ahead in time at 2010-12-15 01:00:00.

The column in `I`

named `k8`

holds the 8-step
horizon forecasts, which, since the steps are hourly, is an equi-distant
time series. Picking out the entire series can be done by
`D$I$k8`

- hence a plot (together with the observations) can
be generated by:

```
# Just pick some points by
<- 200:296
i plot(D$t[i], D$I$k8[i], type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
# Add the observations
lines(D$t[i], D$Iobs[i])
legend("topright", c("8-step forecasts","Observations"), bg="white", lty=1, col=2:1)
```

Notice how the are not aligned, since the forecasts are 8 hours ahead. To align them the forecasts must be lagged 8 steps by:

```
plot(D$t[i], lagvec(D$I$k8[i], 8), type="l", col=2, xlab="Time", ylab="Global radiation (W/m²)")
lines(D$t[i], D$Iobs[i])
legend("topright", c("8-step forecasts lagged","Observations"), bg="white", lty=1, col=2:1)
```

A few simple plotting functions are included in the package.

The plot function provided with the package actually does this lagging with plotting forecasts:

`plot_ts(D, patterns=c("^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))`

The argument `patterns`

is vector of a regular expressions
(see `?regex`

), which is used to match the variables to
include in the plot. See the help with `?plot_ts`

for more
details.

An interactive plot can be generated using (first install the package
`plotly`

):

`plotly_ts(D, patterns=c("heatload$","^I"), c("2010-12-15","2010-12-18"), kseq=c(1,8,24,36))`

Note that the `patterns`

argument is a vector of regular
expressions, which determines which variables from `D`

to
plot.

When modelling with the objective of forecasting, it’s always a good start to have a look at scatter plots between the model inputs and the model output. For example the heatload vs. ambient temperature 8-step forecast:

```
par(mfrow=c(1,2))
plot(D$Ta$k8, D$heatload)
plot(lagvec(D$Ta$k8, 8), D$heatload)
```

So lagging (thus aligning in time) makes less slightly less scatter.

A wrapper for the `pairs`

function is provided for a
`data.list`

, which can generate very useful explorative
plots:

`pairs(D, nms=c("heatload","Taobs","Ta","t"), kseq=c(1,8,24))`

Note how the sequence of included horizons are specified in the
`kseq`

argument, and note that the forecasts are lagged to be
aligned in time. See `?pairs.data.list`

for more details.

Just as a quick side note: This is the principle used for fitting onlineforecast models, simply shift forecasts to align with the observations:

```
# Lag the 8-step forecasts to be aligned with the observations
<- lagvec(D$I$k8, 8)
x # Take a smaller range
<- x[i]
x # Take the observations
<- D$Iobs[i]
y # Fit a linear regression model
<- lm(y ~ x)
fit # Plot the result
plot(x, y, xlab="8-step forecasts (W/m²)", ylab="Obsservations (W/m²)", main="Global radiation")
abline(fit)
```

Seen over time the 8-step forecasts are:

```
plot(D$t[i], predict.lm(fit, newdata=data.frame(x)), type="l", ylim=c(0,max(y)), xlab="Time", ylab="Global radiation (W/m^2)", col=2)
lines(D$t[i], y)
legend("topright", c("8-step forecasts lagged","Observations"), lty=1, col=2:1)
```

Of course that model was very simple, see how to make a better model in [building-heat-load-forecasting] and more information on the [website].

Taking a subset of a `data.list`

is very useful and it can
easily be done in different ways using the `subset`

function
(i.e. it’s really the `subset.data.list`

function called
when:

```
# Take the 1 to 4 values of each variable in D
<- subset(D, 1:4)
Dsub summary(Dsub)
##
## Length of time vector 't': 4
##
## NAs length class
## $heatload 0% ok numeric
## $heatloadtotal 0% ok numeric
## $Taobs 0% ok numeric
## $Iobs 0% ok numeric
##
## maxHorizonNAs NAs nrow colnames sameclass class
## $Ta 0% 0% ok ok ok numeric
## $I 0% 0% ok ok ok numeric
```

Another useful function for taking data in a time range is:

```
which(in_range("2010-12-20",D$t,"2010-12-21"))
## [1] 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139
## [20] 140 141 142 143 144
```

always check the help of function for more details
(i.e. `?in_range`

)

Actually, it’s easy to take subset from a period by:

```
<- subset(D, c("2010-12-20","2010-12-21"))
Dsub summary(Dsub)
##
## Length of time vector 't': 24
##
## NAs length class
## $heatload 0% ok numeric
## $heatloadtotal 0% ok numeric
## $Taobs 0% ok numeric
## $Iobs 0% ok numeric
##
## maxHorizonNAs NAs nrow colnames sameclass class
## $Ta 0% 0% ok ok ok numeric
## $I 0% 0% ok ok ok numeric
$t
Dsub## [1] "2010-12-20 01:00:00 UTC" "2010-12-20 02:00:00 UTC"
## [3] "2010-12-20 03:00:00 UTC" "2010-12-20 04:00:00 UTC"
## [5] "2010-12-20 05:00:00 UTC" "2010-12-20 06:00:00 UTC"
## [7] "2010-12-20 07:00:00 UTC" "2010-12-20 08:00:00 UTC"
## [9] "2010-12-20 09:00:00 UTC" "2010-12-20 10:00:00 UTC"
## [11] "2010-12-20 11:00:00 UTC" "2010-12-20 12:00:00 UTC"
## [13] "2010-12-20 13:00:00 UTC" "2010-12-20 14:00:00 UTC"
## [15] "2010-12-20 15:00:00 UTC" "2010-12-20 16:00:00 UTC"
## [17] "2010-12-20 17:00:00 UTC" "2010-12-20 18:00:00 UTC"
## [19] "2010-12-20 19:00:00 UTC" "2010-12-20 20:00:00 UTC"
## [21] "2010-12-20 21:00:00 UTC" "2010-12-20 22:00:00 UTC"
## [23] "2010-12-20 23:00:00 UTC" "2010-12-21 00:00:00 UTC"
```

It can be really useful to bring the data.list on a format of a
`data.frame`

or equivalently `data.table`

for
processing.

Bringing to `data.frame`

can easily be done by:

```
<- as.data.frame(Dsub)
Df names(Df)
## [1] "t" "heatload" "heatloadtotal" "Taobs"
## [5] "Iobs" "Ta.k1" "Ta.k2" "Ta.k3"
## [9] "Ta.k4" "Ta.k5" "Ta.k6" "Ta.k7"
## [13] "Ta.k8" "Ta.k9" "Ta.k10" "Ta.k11"
## [17] "Ta.k12" "Ta.k13" "Ta.k14" "Ta.k15"
## [21] "Ta.k16" "Ta.k17" "Ta.k18" "Ta.k19"
## [25] "Ta.k20" "Ta.k21" "Ta.k22" "Ta.k23"
## [29] "Ta.k24" "Ta.k25" "Ta.k26" "Ta.k27"
## [33] "Ta.k28" "Ta.k29" "Ta.k30" "Ta.k31"
## [37] "Ta.k32" "Ta.k33" "Ta.k34" "Ta.k35"
## [41] "Ta.k36" "I.k1" "I.k2" "I.k3"
## [45] "I.k4" "I.k5" "I.k6" "I.k7"
## [49] "I.k8" "I.k9" "I.k10" "I.k11"
## [53] "I.k12" "I.k13" "I.k14" "I.k15"
## [57] "I.k16" "I.k17" "I.k18" "I.k19"
## [61] "I.k20" "I.k21" "I.k22" "I.k23"
## [65] "I.k24" "I.k25" "I.k26" "I.k27"
## [69] "I.k28" "I.k29" "I.k30" "I.k31"
## [73] "I.k32" "I.k33" "I.k34" "I.k35"
## [77] "I.k36"
```

So the forecasts are just bind with the time and observations, and
`.kxx`

is added to the column names.

It can be converted to a `data.table`

by:

```
library(data.table)
setDT(Df)
class(Df)
## [1] "data.table" "data.frame"
```

After processing it is easily converted back to the
`data.list`

again by:

```
# Set back to data.frame
setDF(Df)
# Convert to a data.list
<- as.data.list(Df)
Dsub2 # Compare it with the original Dsub
summary(Dsub2)
##
## Length of time vector 't': 24
##
## NAs length class
## $heatload 0% ok numeric
## $heatloadtotal 0% ok numeric
## $Taobs 0% ok numeric
## $Iobs 0% ok numeric
##
## maxHorizonNAs NAs nrow colnames sameclass class
## $Ta 0% 0% ok ok ok numeric
## $I 0% 0% ok ok ok numeric
summary(Dsub)
##
## Length of time vector 't': 24
##
## NAs length class
## $heatload 0% ok numeric
## $heatloadtotal 0% ok numeric
## $Taobs 0% ok numeric
## $Iobs 0% ok numeric
##
## maxHorizonNAs NAs nrow colnames sameclass class
## $Ta 0% 0% ok ok ok numeric
## $I 0% 0% ok ok ok numeric
```