0% found this document useful (0 votes)
145 views50 pages

Gnuplot Tips for Data Visualization

This document discusses different options for customizing graphs and plots generated using the gnuplot command in the gretl software. It provides examples of: 1. Using gnuplot to plot different variables against each other and time, with options to change the line style and show a fitted line. 2. Controlling the type of fitted line shown using options like --fit=none, --fit=inverse, and --fit=loess. 3. Using a plot command block for more structured customization than gnuplot, with options like fit=linear that specify the fitted line type.

Uploaded by

Taha Najid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views50 pages

Gnuplot Tips for Data Visualization

This document discusses different options for customizing graphs and plots generated using the gnuplot command in the gretl software. It provides examples of: 1. Using gnuplot to plot different variables against each other and time, with options to change the line style and show a fitted line. 2. Controlling the type of fitted line shown using options like --fit=none, --fit=inverse, and --fit=loess. 3. Using a plot command block for more structured customization than gnuplot, with options like fit=linear that specify the fitted line type.

Uploaded by

Taha Najid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Chapter 6.

Graphs and plots 39

You may ask, why bother changing the size in the gnuplot command file? After all, PDF and EPS are
both vector formats, so the graphs can be scaled at will. True, but a uniform scaling will also affect
the font size, which may end looking wrong. You can get optimal results by experimenting with
the font and size options to gnuplot’s set term command. Here are some examples (comments
follow below).

# pdfcairo, regular size, slightly amended


set term pdfcairo font "Sans,6" size 5in,3.5in
# or small size
set term pdfcairo font "Sans,5" size 3in,2in

# pdf, regular size, slightly amended


set term pdf font "Helvetica,8" size 5in,3.5in
# or small
set term pdf font "Helvetica,6" size 3in,2in

# postscript, regular
set term post eps solid font "Helvetica,16"
# or small
set term post eps solid font "Helvetica,12" size 3in,2in

On the first line we set a sans serif font for pdfcairo at a suitable size for a 5 × 3.5 inch plot
(which you may find looks better than the rather “letterboxy” default of 5 × 3). And on the second
we illustrate what you might do to get a smaller 3 × 2 inch plot. You can specify the plot size in
centimeters if you prefer, as in

set term pdfcairo font "Sans,6" size 6cm,4cm

We then repeat the exercise for the pdf terminal. Notice that here we’re specifying one of the 35
standard PostScript fonts, namely Helvetica. Unlike pdfcairo, the plain pdf driver is unlikely to
be able to find fonts other than these.
In the third pair of lines we illustrate options for the postscript driver (which, as you see, can
be abbreviated as post). Note that here we have added the option solid. Unlike most other
drivers, this one uses dashed lines unless you specify the solid option. Also note that we’ve
(apparently) specified a much larger font in this case. That’s because the eps option in effect tells
the postscript driver to work at half-size (among other things), so we need to double the font
size.
Table 6.1 summarizes the basics for the three drivers we have mentioned.

Terminal default size (inches) suggested font

pdfcairo 5×3 Sans,6


pdf 5×3 Helvetica,8
post eps 5 × 3.5 Helvetica,16

Table 6.1: Drivers for publication-quality graphics

To find out more about gnuplot visit www.gnuplot.info. This site has documentation for the current
version of the program in various formats.

Additional tips
To be written. Line widths, enhanced text. Show a “before and after” example.
Chapter 6. Graphs and plots 40

6.2 Plotting graphs from scripts


When working with scripts, you may want to have a graph shown onto your display or saved into a
file. In fact, if in your usual workflow you find yourself creating similar graphs over and over again,
you might want to consider the option of writing a script which automates this process for you.
gretl gives you two main tools for doing this: one is a command called gnuplot, whose main use
is to create standard plot quickly. The other one is the plot command block, which has a more
elaborate syntax but offers you more control on output.

The gnuplot command


The gnuplot command is described at length in the Gretl Command Reference and the online help
system. Here, we just summarize its main features: basically, it consists of the gnuplot keyword,
followed by a list of items, telling the command what you want plotted and a list of options, telling
it how you want it plotted.
For example, the line

gnuplot y1 y2 x

will give you a basic XY plot of the two series y1 and y2 on the vertical axis versus the series x on
the horizontal axis. In general, the arguments to the gnuplot command is a list of series, the last
of which goes on the x-axis, while all the other ones go onto the y-axis. By default, the gnuplot
command gives you a scatterplot. If you just have one variable on the y-axis, then gretl will also
draw a the OLS interpolation, if the fit is good enough.2
Several aspects of the behavior described above can be modified. You do this by appending options
to the command. Most options can be broadly grouped in three categories:

1. Plot styles: we support points (the default choice), lines, lines and points together, and im-
pulses (vertical lines).

2. Algorithm for the fitted line: here you can choose between linear, quadratic and cubic inter-
polation, but also more exotic choices, such as semi-log, inverse or loess (non-parametric). Of
course, you can also turn this feature off.

3. Input and output: you can choose whether you want your graph on your computer screen
(and possibly use the in-built graphical widget to further customize it — see above, page 37),
or rather save it to a file. We support several graphical formats, among which PNG and PDF,
to make it easy to incorporate your plots into text documents.

The following script uses the AWM dataset to exemplify some traditional plots in macroeconomics:

open AWM.gdt --quiet

# --- consumption and income, different styles ------------

gnuplot PCR YER


gnuplot PCR YER --output=display
gnuplot PCR YER --output=display --time-series
gnuplot PCR YER --output=display --time-series --with-lines

# --- Phillips’ curve, different fitted lines -------------

gnuplot INFQ URX --output=display

2 The technical condition for this is that the two-tailed p-value for the slope coefficient should be under 10%.
Chapter 6. Graphs and plots 41

gnuplot INFQ URX --fit=none --output=display


gnuplot INFQ URX --fit=inverse --output=display
gnuplot INFQ URX --fit=loess --output=display

These examples use variables from the “area-wide model” dataset by the European Central Bank
(ECB) which is shipped with gretl in the AWM.gdt file. PCR is aggregate private real consumption
and YER is real GDP. The first command line above thus plots consumption against income as a
kind of Keynesian consumption function. More precisely, it produces a simple scatter plot with
an automatically linear fitted line. If this is executed in the gretl console the plot will be directly
shown in a new window, but if this line is contained in a script then instead a file with the plot
commands will be saved for later execution. The second example line changes this behavior for a
script command and forces the plot to be shown directly.
The third line instead asks for a plot of the two variables as two separate curves against time on
the x-axis. Each observation point is drawn separately with a certain symbol determined by gnuplot
defaults. If you add the option --with-lines the points will be connected with a continuous line
and the symbols omitted.
The second set of example lines above demonstrate how the fitted line in the scatter plot can be
controlled from gretl’s side. The option --fit=none overrides gnuplot’s default to draw a line if it
deems the fit to be “good enough”. The effect of --fit=inverse is to consider the variable on the
y-axis as a function of 1/X instead of X and draw the corresponding hyperbolic branch. For the
workings of a Loess fit (locally-weighted polynomial regression) please refer to the documentation
of the loess function.
For more detail, consult the Gretl Command Reference.

The plot command block


The plot environment is a way to pass information to Gnuplot in a more structured way, so that
customization of basic plots becomes easier. It has the following characteristics:
The block starts with the plot keyword, followed by a required parameter: the name of a list, a
single series or a matrix. This parameter specifies the data to be plotted. The starting line may be
prefixed with the savename <- apparatus to save a plot as an icon in the GUI program. The block
ends with end plot.
Inside the block you have zero or more lines of these types, identified by an initial keyword:

option: specify a single option (details below)

options: specify multiple options on a single line; if more than one option is given on a line, the
options should be separated by spaces.

literal: a command to be passed to gnuplot literally

printf: a printf statement whose result will be passed to gnuplot literally; this allows the use of
string variables without having to resort to @-style string substitution.

The options available are basically those of the current gnuplot command, but with a few dif-
ferences. For one thing you don’t need the leading double-dash in an "option" (or "options") line.
Besides that,

• You can’t use the option --matrix=whatever with plot: that possibility is handled by pro-
viding the name of a matrix on the initial plot line.

• The --input=filename option is not supported: use gnuplot for the case where you’re
supplying the entire plot specification yourself.
Chapter 6. Graphs and plots 42

• The several options pertaining to the presence and type of a fitted line, are replaced in plot
by a single option fit which requires a parameter. Supported values for the parameter are:
none, linear, quadratic, cubic, inverse, semilog and loess. Example:
option fit=quadratic

As with gnuplot, the default is to show a linear fit in an X-Y scatter if it’s significant at the 10
percent level.
Here’s a simple example, the plot specification from the “bandplot” package, which shows how
to achieve the same result via the gnuplot command and a plot block, respectively — the latter
occupies a few more lines but is clearer

gnuplot 1 2 3 4 --with-lines --matrix=plotmat \


--fit=none --output=display \
{ set linetype 3 lc rgb "#0000ff"; set title "@title"; \
set nokey; set xlabel "@xname"; }

plot plotmat
options with-lines fit=none
literal set linetype 3 lc rgb "#0000ff"
literal set nokey
printf "set title \"%s\"", title
printf "set xlabel \"%s\"", xname
end plot --output=display

Note that --output=display is appended to end plot; also note that if you give a matrix to plot
it’s assumed you want to plot all the columns. In addition, if you give a single series and the dataset
is time series, it’s assumed you want a time-series plot.

Example: Plotting an histogram together with a density


Listing 6.1 contains a slightly more elaborate example: here we load the Mroz example dataset and
calculate the log of the individual’s wage. Then, we match the histogram of a discretized version
of the same variable (obtained via the aggregate() function) versus the theoretical density if data
were Gaussian.
There are a few points to note:

• The data for the plot are passed through a matrix in which we set column names via the
cnameset function; those names are then automatically used by the plot environment.

• In this example, we make extensive use of the set literal construct for refining the plot by
passing instruction to gnuplot; the power of gnuplot is impossible to overstate. We encourage
you to visit the “demos” version of gnuplot’s website (http://gnuplot.sourceforge.net/)
and revel in amazement.

• In the plot environment you can use all the quantities you have in your script. This is the
way we calibrate the histogram width (try setting the scalar k in the script to different values).
Note that the printf command has a special meaning inside a plot environment.

• The script displays the plot on your screen. If you want to save it to a file instead, replace
--output=display at the end with --output=filename.

• It’s OK to insert comments in the plot environment; actually, it’s a rather good idea to com-
ment as much as possible (as always)!

The output from the script is shown in Figure 6.3.


Chapter 6. Graphs and plots 43

Listing 6.1: Plotting the log wage from the Mroz example dataset [Download ▼]

set verbose off


open mroz87.gdt --quiet

series lWW = log(WW)


scalar m = mean(lWW)
scalar s = sd(lWW)

###
### prepare matrix with data for plot
###

# number of valid observations


scalar n = nobs(lWW)
# discretize log wage
scalar k = 4
series disc_lWW = round(lWW*k)/k
# get frequencies
matrix f = aggregate(null, disc_lWW)
# add density
phi = dnorm((f[,1] - m)/s) / (s*k)
# put columns together and add labels
plotmat = f[,2]./n ~ phi ~ f[,1]
strings cnames = defarray("frequency", "density", "log wage")
cnameset(plotmat, cnames)

###
### create plot
###

plot plotmat
# move legend
literal set key outside rmargin
# set line style
literal set linetype 2 dashtype 2 linewidth 2
# set histogram color
literal set linetype 1 lc rgb "#777777"
# set histogram style
literal set style fill solid 0.25 border
# set histogram width
printf "set boxwidth %4.2f\n", 0.5/k
options with-lines=2 with-boxes=1
end plot --output=display
Chapter 6. Graphs and plots 44

0.18
frequency
density
0.16

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0
-2 -1 0 1 2 3
log wage

Figure 6.3: Output from listing 6.1

Listing 6.2: Plotting t densities for varying degrees of freedom [Download ▼]

set verbose off

function string tplot(scalar m)


return sprintf("stud(x,%d) title \"t(%d)\"", m, m)
end function

matrix dfs = {2, 4, 16}

plot
literal set xrange [-4.5:4.5]
literal set yrange [0:0.45]
literal Binv(p,q) = exp(lgamma(p+q)-lgamma(p)-lgamma(q))
literal stud(x,m) = Binv(0.5*m,0.5)/sqrt(m)*(1.0+(x*x)/m)**(-0.5*(m+1.0))
printf "plot %s, %s, %s", tplot(dfs[1]), tplot(dfs[2]), tplot(dfs[3])
end plot --output=display
Chapter 6. Graphs and plots 45

Example: Plotting Student’s t densities


The power of the printf statement in a plot block becomes apparent when used jointly with
user-defined functions, as exemplified in Listing 6.2, in which we create a plot showing the den-
sity functions of Student’s t distribution for three different settings of the “degrees of freedom”
parameter (note that plotting a t density is very easy to do from the GUI: just go to the Tools >
Distribution graphs menu).
First we define a user function called tplot, which returns a string with the ingredients to pass
to the gnuplot plot statement, as a function of a scalar parameter (the degrees of freedom in our
case). Next, this function is used within the plot block to plot the appropriate density. Note that
most of the statements to mathematically define the function to plot are outsourced to gnuplot via
the literal command.
The output from the script is shown in Figure 6.4.

0.45
t(2)
t(4)
0.4 t(16)

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
-4 -3 -2 -1 0 1 2 3 4

Figure 6.4: Output from listing 6.2


Chapter 6. Graphs and plots 46

6.3 Boxplots
These plots (after Tukey and Chambers) display the distribution of a variable. Its shape depends
on a few quantities, defined as follows:

xmin sample minimum


Q1 first quartile
m median
x̄ mean
Q3 third quartile
xmax sample maximum
R = Q3 − Q1 interquartile range

The central box encloses the middle 50 percent of the data, i.e. goes from Q1 to Q3 ; therefore, its
height equals R. A line is drawn across the box at the median m and a “+” sign identifies the mean
x̄.
The length of the “whiskers” depends on the presence of outliers. The top whisker extends from
the top of the box up to a maximum of 1.5 times the interquartile range, but can be shorter if the
sample maximum is lower than that value; that is, it reaches min[xmax , Q3 + 1.5R]. Observations
larger than Q3 + 1.5R, if any, are considered outliers and represented individually via dots.3 The
bottom whisker obeys the same logic, with obvious adjustments. Figure 6.5 provides an example
of all this by using the variable FAMINC from the sample dataset mroz87.

FAMINC

xmax

80000

outliers

60000

40000

Q3 x̄

20000 m
Q1

0 xmin

Figure 6.5: Sample boxplot

In the case of boxplots with confidence intervals, dotted lines show the limits of an approximate 90
3 To give you an intuitive idea, if a variable is normally distributed, the chances of picking an outlier by this definition

are slightly below 0.7%.


Chapter 6. Graphs and plots 47

percent confidence interval for the median. This is obtained by the bootstrap method, which can
take a while if the data series is very long. For details on constructing boxplots, see the entry for
boxplot in the Gretl Command Reference or use the Help button that appears when you select one
of the boxplot items under the menu item “View, Graph specified vars” in the main gretl window.

Factorized boxplots
A nice feature which is quite useful for data visualization is the conditional, or factorized boxplot.
This type of plot allows you to examine the distribution of a variable conditional on the value of
some discrete factor.
As an example, we’ll use one of the datasets supplied with gretl, that is rac3d, which contains an
example taken from Cameron and Trivedi (2013) on the health conditions of 5190 people. The
script below compares the unconditional (marginal) distribution of the number of illnesses in the
past 2 weeks with the distribution of the same variable, conditional on age classes.

open rac3d.gdt
# unconditional boxplot
boxplot ILLNESS --output=display
# create a discrete variable for age class:
# 0 = below 20, 1 = between 20 and 39, etc
series age_class = floor(AGE/0.2)
# conditional boxplot
boxplot ILLNESS age_class --factorized --output=display

After running the code above, you should see two graphs similar to Figure 6.6. By comparing the
marginal plot to the factorized one, the effect of age on the mean number of illnesses is quite
evident: by joining the green crosses you get what is technically known as the conditional mean
function, or regression function if you prefer.

ILLNESS Distribution of ILLNESS by age_class

5 5

4
4

3
ILLNESS

2
2

1
1

0 0 1 2 3
age_class

Figure 6.6: Conditional and unconditional distribution of illnesses


Chapter 7

Joining data sources

7.1 Introduction
Gretl provides two commands for adding data from file to an existing dataset in the program’s
workspace, namely append and join. The append command, which has been available for a long
time, is relatively simple and is described in the Gretl Command Reference. Here we focus on the
join command, which is much more flexible and sophisticated. This chapter gives an overview
of the functionality of join along with a detailed account of its syntax and options. We provide
several toy examples and discuss one real-world case at length.
First, a note on terminology: in the following we use the terms “left-hand” and “inner” to refer
to the dataset that is already in memory, and the terms “right-hand” and “outer” to refer to the
dataset in the file from which additional data are to be drawn.
Two main features of join are worth emphasizing at the outset:

• “Key” variables can be used to match specific observations (rows) in the inner and outer
datasets, and this match need not be 1 to 1.

• A row filter may be applied to screen out unwanted observations in the outer dataset.

As will be explained below, these features support rather complex concatenation and manipulation
of data from different sources.
A further aspect of join should be noted — one that makes this command particularly useful when
dealing with very large data files. That is, when gretl executes a join operation it does not, in gen-
eral, read into memory the entire content of the right-hand side dataset. Only those columns that
are actually needed for the operation are read in full. This makes join faster and less demanding
of computer memory than the methods available in most other software. On the other hand, gretl’s
asymmetrical treatment of the “inner” and “outer” datasets in join may require some getting used
to, for users of other packages.

7.2 Basic syntax


The minimal invocation of join is
join filename varname
where filename is the name of a data file and varname is the name of a series to be imported.
Only two sorts of data file are supported at present: delimited text files (where the delimiter may
be comma, space, tab or semicolon) and “native” gretl data files (gdt or gdtb). A series named
varname may already be present in the left-hand dataset, but that is not required. The series to be
imported may be numerical or string-valued. For most of the discussion below we assume that just
a single series is imported by each join command, but see section 7.7 for an account of multiple
imports.
The effect of the minimal version of join is this: gretl looks for a data column labeled varname in
the specified file; if such a column is found and the number of observations on the right matches
the number of observations in the current sample range on the left, then the values from the right
are copied into the relevant range of observations on the left. If varname does not already exist

48
Chapter 7. Joining data sources 49

on the left, any observations outside of the current sample are set to NA; if it exists already then
observations outside of the current sample are left unchanged.
The case where you want to rename a series on import is handled by the --data option. This option
has one required argument, the name by which the series is known on the right. At this point we
need to explain something about right-hand variable names (column headings).

Right-hand names
We accept on input arbitrary column heading strings, but if these strings do not qualify as valid
gretl identifiers they are automatically converted, and in the context of join you must use the
converted names. A gretl identifier must start with a letter, contain nothing but (ASCII) letters,
digits and the underscore character, and must not exceed 31 characters. The rules used in name
conversion are:

1. Skip any leading non-letters.

2. Until the 31-character is reached or the input is exhausted: transcribe “legal” characters; skip
“illegal” characters apart from spaces; and replace one or more consecutive spaces with an
underscore, unless the last character transcribed is an underscore in which case space is
skipped.

In the unlikely event that this policy yields an empty string, we replace the original with coln,
where n is replaced by the 1-based index of the column in question among those used in the
join operation. If you are in doubt regarding the converted name of a given column, the function
fixname() can be used as a check: it takes the original string as an argument and returns the
converted name. Examples:

? eval fixname("valid_identifier")
valid_identifier
? eval fixname("12. Some name")
Some_name

Returning to the use of the --data option, suppose we have a column headed "12. Some name"
on the right and wish to import it as x. After figuring how the right-hand name converts, we can do

join foo.csv x --data="Some_name"

No right-hand names?
Some data files have no column headings; they jump straight into the data (and you need to deter-
mine from accompanying documentation what the columns represent). Since gretl expects column
headings, you have to take steps to get the importation right. It is generally a good idea to insert a
suitable header row into the data file. However, if for some reason that’s not practical, you should
give the --no-header option, in which case gretl will name the columns on the right as col1, col2
and so on. If you do not do either of these things you will likely lose the first row of data, since
gretl will attempt to make variable names out of it, as described above.

7.3 Filtering
Rows from the outer dataset can be filtered using the --filter option. The required parameter
for this option is a Boolean condition, that is, an expression which evaluates to non-zero (true,
include the row) or zero (false, skip the row) for each of the outer rows. The filter expression may
include any of the following terms: up to three “right-hand” series (under their converted names as
Chapter 7. Joining data sources 50

explained above); scalar or string variables defined “on the left”; any of the operators and functions
available in gretl (including user-defined functions); and numeric or string constants.
Here are a few simple examples of potentially valid filter options (assuming that the specified right-
hand side columns are found):

# 1. relationship between two right-hand variables


--filter="x15<=x17"

# 2. comparison of right-hand variable with constant


--filter="nkids>2"

# 3. comparison of string-valued right-hand variable with string constant


--filter="SEX==\"F\""

# 4. filter on valid values of a right-hand variable


--filter=!missing(income)

# 5. compound condition
--filter="x < 100 && (x > 0 || y > 0)"

Note that if you are comparing against a string constant (as in example 3 above) it is necessary
to put the string in “escaped” double-quotes (each double-quote preceded by a backslash) so the
interpreter knows that F is not supposed to be the name of a variable.
It is safest to enclose the whole filter expression in double quotes, however this is not strictly
required unless the expression contains spaces or the equals sign.
In general, an error is flagged if a missing value is encountered in a series referenced in a filter
expression. This is because the condition then becomes indeterminate; taking example 2 above, if
the nkids value is NA on any given row we are not in a position to evaluate the condition nkids>2.
However, you can use the missing() function — or ok(), which is a shorthand for !missing()— if
you need a filter that keys off the missing or non-missing status of a variable.

7.4 Matching with keys


Things get interesting when we come to key-matching. The purpose of this facility is perhaps best
introduced by example. Suppose that (as with many survey and census-based datasets) we have a
dataset that is composed of two or more related files, each having a different unit of observation;
for example we have a “persons” data file and a “households” data file. Table 7.1 shows a simple,
artificial case. The file people.csv contains a unique identifier for the individuals, pid. The
households file, hholds.csv, contains the unique household identifier hid, which is also present
in the persons file.
As a first example of join with keys, let’s add the household-level variable xh to the persons
dataset:

open people.csv --quiet


join hholds.csv xh --ikey=hid
print --byobs

The basic key option is named ikey; this indicates “inner key”, that is, the key variable found in the
left-hand or inner dataset. By default it is assumed that the right-hand dataset contains a column of
the same name, though as we’ll see below that assumption can be overridden. The join command
above says, find a series named xh in the right-hand dataset and add it to the left-hand one, using
the values of hid to match rows. Looking at the data in Table 7.1 we can see how this should
work. Persons 1 and 2 are both members of household 1, so they should both get values of 1 for
xh; persons 3 and 4 are members of household 2, so that xh = 4; and so on. Note that the order
Chapter 7. Joining data sources 51

in which the key values occur on the right-hand side does not matter. The gretl output from the
print command is shown in the lower panel of Table 7.1.

people.csv hholds.csv

pid,hid,gender,age,xp hid,country,xh
1,1,M,50,1 1,US,1
2,1,F,40,2 6,IT,12
3,2,M,30,3 3,UK,6
4,2,F,25,2 4,IT,8
5,3,M,40,3 2,US,4
6,4,F,35,4 5,IT,10
7,4,M,70,3
8,4,F,60,3
9,5,F,20,4
10,6,M,40,4

pid hid xh

1 1 1
2 1 1
3 2 4
4 2 4
5 3 6
6 4 8
7 4 8
8 4 8
9 5 10
10 6 12

Table 7.1: Two linked CSV data files, and the effect of a join

Note that key variables are treated conceptually as integers. If a specified key contains fractional
values these are truncated.
Two extensions of the basic key mechanism are available.

• If the outer dataset contains a relevant key variable but it goes under a different name from
the inner key, you can use the --okey option to specify the outer key. (As with other right-
hand names, this does not have to be a valid gretl identifier.) So, for example, if hholds.csv
contained the hid information, but under the name HHOLD, the join command above could
be modified as
join hholds.csv xh --ikey=hid --okey=HHOLD

• If a single key is not sufficient to generate the matches you want, you can specify a double key
in the form of two series names separated by a comma; in this case the importation of data is
restricted to those rows on which both keys match. The syntax here is, for example

join foo.csv x --ikey=key1,key2

Again, the --okey option may be used if the corresponding right-hand columns are named
differently. The same number of keys must be given on the left and the right, but when a
Chapter 7. Joining data sources 52

double key is used and only one of the key names differs on the right, the name that is in
common may be omitted (although the comma separator must be retained). For example, the
second of the following lines is acceptable shorthand for the first:
join foo.csv x --ikey=key1,Lkey2 --okey=key1,Rkey2
join foo.csv x --ikey=key1,Lkey2 --okey=,Rkey2

The number of key-matches


The example shown in Table 7.1 is an instance of a 1 to 1 match: applying the matching criterion
produces exactly one value of the variable xh corresponding to each row of the inner dataset. Three
other possibilities arise:

• Some rows on the left have multiple matches on the right (“1 to n matching”).

• Some rows on the right have multiple matches on the left (“n to 1 matching”).

• Some rows in the inner dataset have no match on the right.

The first case is addressed in detail in the next section; here we discuss the others.
The n to 1 case is straightforward. If a particular key value (or combination of key values) occurs at
each of n > 1 observations on the left but at a single observation on the right, then the right-hand
value is entered at each of the matching slots on the left.
The handling of the case where there’s no match on the right depends on whether the join operation
is adding a new series to the inner dataset or modifying an existing one. If it’s a new series, then
unmatched rows automatically get NA for the imported data. However, if join is pulling in values
for a series already present on the left only matched rows will be updated. In other words we do
not overwite an existing value on the left with NA when there’s no match on the right.
These defaults may not produce the desired results in every case but gretl provides the means to
modify the effect if need be. We will illustrate with two scenarios.
First consider adding a new series recording “number of hours worked” when the inner dataset
contains individuals and the outer file contains data on jobs. If an individual does not appear in
the jobs file, we may want to take her hours worked as implicitly zero rather than NA. In this case
gretl’s misszero() function can be used to turn NA into 0 in the imported series.
Second, consider updating a series via join when the outer file is presumed to contain all available
updated values, such that “no match” should be taken as an implicit NA. In that case we want the
(presumably out-of-date) values on any unmatched rows to be overwritten with NA. Let the series
in question be called x (both on the left and the right) and let the common key be called pid. The
solution is then

join update.csv tmpvar --data=x --ikey=pid


x = tmpvar

As a new variable, tmpvar will get NA for all unmatched rows; we then transcribe its values into
x. In a more complicated case one might use the smpl command to limit the sample range before
assigning tmpvar to x, or use the conditional assignment operator ?:.
One further point: given some missing values in an imported series you may want to know whether
(a) the NAs were explicitly represented in the outer data file or (b) they arose due to “no match”. You
can find this out by using a method described in the following section, namely the count variant of
the aggregation option: this will give you a series with 0 values for all and only unmatched rows.
Chapter 7. Joining data sources 53

7.5 Aggregation
In the case of 1 to n matching of rows (n > 1) the user must specify an “aggregation method”; that
is, a method for mapping from n rows down to one. This is handled by the --aggr option which
requires a single argument from the following list:

Code Value returned

count count of matches


avg mean of matching values
sum sum of matching values
min minimum of matching values
max maximum of matching values
seq:i the ith matching value (e.g. seq:2)
min(aux) minimum of matching values of auxiliary variable
max(aux) maximum of matching values of auxiliary variable

Note that the count aggregation method is special, in that there is no need for a “data series” on
the right; the imported series is simply a function of the specified key(s). All the other methods
require that “actual data” are found on the right. Also note that when count is used, the value
returned when no match is found is (as one might expect) zero rather than NA.
The basic use of the seq method is shown above: following the colon you give a positive integer rep-
resenting the (1-based) position of the observation in the sequence of matched rows. Alternatively,
a negative integer can be used to count down from the last match (seq:-1 selects the last match,
seq:-2 the second-last match, and so on). If the specified sequence number is out of bounds for a
given observation this method returns NA.
Referring again to the data in Table 7.1, suppose we want to import data from the persons file into
a dataset established at household level. Here’s an example where we use the individual age data
from people.csv to add the average and minimum age of household members.

open hholds.csv --quiet


join people.csv avgage --ikey=hid --data=age --aggr=avg
join people.csv minage --ikey=hid --data=age --aggr=min

Here’s a further example where we add to the household data the sum of the personal data xp, with
the twist that we apply filters to get the sum specifically for household members under the age of
40, and for women.

open hholds.csv --quiet


join people.csv young_xp --ikey=hid --filter="age<40" --data=xp --aggr=sum
join people.csv female_xp --ikey=hid --filter="gender==\"F\"" --data=xp --aggr=sum

The possibility of using an auxiliary variable with the min and max modes of aggregation gives extra
flexibility. For example, suppose we want for each household the income of its oldest member:

open hholds.csv --quiet


join people.csv oldest_xp --ikey=hid --data=xp --aggr=max(age)

7.6 String-valued key variables


The examples above use numerical variables (household and individual ID numbers) in the match-
ing process. It is also possible to use string-valued variables, in which case a match means that the
string values of the key variables compare equal (with case sensitivity). When using double keys,
Chapter 7. Joining data sources 54

you can mix numerical and string keys, but naturally you cannot mix a string variable on the left
(via ikey) with a numerical one on the right (via okey), or vice versa.
Here’s a simple example. Suppose that alongside hholds.csv we have a file countries.csv with
the following content:

country,GDP
UK,100
US,500
IT,150
FR,180

The variable country, which is also found in hholds.csv, is string-valued. We can pull the GDP of
the country in which the household resides into our households dataset with

open hholds.csv -q
join countries.csv GDP --ikey=country

which gives

hid country GDP

1 1 1 500
2 6 2 150
3 3 3 100
4 4 2 150
5 2 1 500
6 5 2 150

7.7 Importing multiple series


The examples given so far have been limited in one respect. While several columns in the outer data
file may be referenced (as keys, or in filtering or aggregation) only one column has actually provided
data—and correspondingly only one series in the inner dataset has been created or modified — per
invocation of join. However, join can handle the importation of several series at once. This
section gives an account of the required syntax along with certain restrictions that apply to the
multiple-import case.
There are two ways to specify more than one series for importation:

1. The varname field in the command can take the form of a space-separated list of names rather
than a single name.

2. Alternatively, you can give the name of an array of strings in place of varname: the elements
of this array should be the names of the series to import.

Here are the limitations:

1. The --data option, which permits the renaming of a series on import, is not available. When
importing multiple series you are obliged to accept their “outer” names, fixed up as described
in section 7.2.

2. While the other join options are available, they necessarily apply uniformly to all the series
imported via a given command. This means that if you want to import several series but using
different keys, filters or aggregation methods you must use a sequence of commands.

Here are a couple of examples of multiple imports.


Chapter 7. Joining data sources 55

# open base datafile containing keys


open PUMSdata.gdt

# join using a list of import names


join ss13pnc.csv SCHL WAGP WKHP --ikey=SERIALNO,SPORDER

# using a strings array: may be worthwhile if the array


# will be used for more than one purpose
strings S = defarray("SCHL", "WAGP", "WKHP")
join ss13pnc.csv S --ikey=SERIALNO,SPORDER

7.8 A real-world case


For a real use-case for join with cross-sectional data, we turn to the Bank of Italy’s Survey on House-
hold Income and Wealth (SHIW).1 In ASCII form the 2010 survey results comprise 47 MB of data in
29 files. In this exercise we will draw on five of the SHIW files to construct a replica of the dataset
used in Thomas Mroz’s famous paper (Mroz, 1987) on women’s labor force participation, which
contains data on married women between the age of 30 and 60 along with certain characteristics
of their households and husbands.
Our general strategy is as follows: we create a “core” dataset by opening the file carcom10.csv,
which contains basic data on the individuals. After dropping unwanted individuals (all but married
women), we use the resulting dataset as a base for pulling in further data via the join command.
The complete script to do the job is given in the Appendix to this chapter; here we walk through
the script with comments interspersed. We assume that all the relevant files from the Bank of Italy
survey are contained in a subdirectory called SHIW.
Starting with carcom10.csv, we use the --cols option to the open command to import specific
series, namely NQUEST (household ID number), NORD (sequence number for individuals within each
household), SEX (male = 1, female = 2), PARENT (status in household: 1 = head of household, 2 =
spouse of head, etc.), STACIV (marital status: married = 1), STUDIO (educational level, coded from
1 to 8), ETA (age in years) and ACOM4C (size of town).

open SHIW/carcom10.csv --cols=1,2,3,4,9,10,29,41

We then restrict the sample to married women from 30 to 60 years of age, and additionally restrict
the sample of women to those who are either heads of households or spouses of the head.

smpl SEX==2 && ETA>=30 && ETA<=60 && STACIV==1 --restrict


smpl PARENT<3 --restrict

For compatibility with the Mroz dataset as presented in the gretl data file mroz87.gdt, we rename
the age and education variables as WA and WE respectively, we compute the CIT dummy and finally
we store the reduced base dataset in gretl format.

rename ETA WA
rename STUDIO WE
series CIT = (ACOM4C > 2)

store mroz_rep.gdt

The next step will be to get data on working hours from the jobs file allb1.csv. There’s a com-
plication here. We need the total hours worked over the course of the year (for both the women

1 Details of the survey can be found at http://www.bancaditalia.it/statistiche/indcamp/bilfait/dismicro.

The ASCII (CSV) data files for the 2010 survey are available at http://www.bancaditalia.it/statistiche/indcamp/
bilfait/dismicro/annuale/ascii/ind10_ascii.zip.
Chapter 7. Joining data sources 56

and their husbands). This is not available as such, but the variables ORETOT and MESILAV give,
respectively, average hours worked per week and the number of months worked in 2010, each on a
per-job basis. If each person held at most one job over the year we could compute his or her annual
hours as

HRS = ORETOT * 52 * MESILAV/12

However, some people had more than one job, and in this case what we want is the sum of annual
hours across their jobs. We could use join with the seq aggregation method to construct this sum,
but it is probably more straightforward to read the allb1 data, compute the HRS values per job as
shown above, and save the results to a temporary CSV file.

open SHIW/allb1.csv --cols=1,2,8,11 --quiet


series HRS = misszero(ORETOT) * 52 * misszero(MESILAV)/12
store HRS.csv NQUEST NORD HRS

Now we can reopen the base dataset and join the hours variable from HRS.csv. Note that we need
a double key here: the women are uniquely identified by the combination of NQUEST and NORD. We
don’t need an okey specification since these keys go under the same names in the right-hand file.
We define labor force participation, LFP, based on hours.

open mroz_rep.gdt
join HRS.csv WHRS --ikey=NQUEST,NORD --data=HRS --aggr=sum
WHRS = misszero(WHRS)
LFP = WHRS > 0

For reference, here’s how we could have used seq to avoid writing a temporary file:

join SHIW/allb1.csv njobs --ikey=NQUEST,NORD --data=ORETOT --aggr=count


series WHRS = 0
loop i=1..max(njobs)
join SHIW/allb1.csv htmp --ikey=NQUEST,NORD --data=ORETOT --aggr="seq:$i"
join SHIW/allb1.csv mtmp --ikey=NQUEST,NORD --data=MESILAV --aggr="seq:$i"
WHRS += misszero(htmp) * 52 * misszero(mtmp)/12
endloop

To generate the work experience variable, AX, we use the file lavoro.csv: this contains a variable
named ETALAV which records the age at which the person first started work.

join SHIW/lavoro.csv ETALAV --ikey=NQUEST,NORD


series AX = misszero(WA - ETALAV)

We compute the woman’s hourly wage, WW, as the ratio of total employment income to annual
working hours. This requires drawing the series YL (payroll income) and YM (net self-employment
income) from the persons file rper10.csv.

join SHIW/rper10.csv YL YM --ikey=NQUEST,NORD --aggr=sum


series WW = LFP ? (YL + YM)/WHRS : 0

The family’s net disposable income is available as Y in the file rfam10.csv; we import this as
FAMINC.

join SHIW/rfam10.csv FAMINC --ikey=NQUEST --data=Y

Data on number of children are now obtained by applying the count method. For the Mroz repli-
cation we want the number of children under the age of 6, and also the number aged 6 to 18.
Chapter 7. Joining data sources 57

join SHIW/carcom10.csv KIDS --ikey=NQUEST --aggr=count --filter="ETA<=18"


join SHIW/carcom10.csv KL6 --ikey=NQUEST --aggr=count --filter=ETA<6
series K618 = KIDS - KL6

We want to add data on the women’s husbands, but how do we find them? To do this we create an
additional inner key which we’ll call H_ID (husband ID), by sub-sampling in turn on the observations
falling into each of two classes: (a) those where the woman is recorded as head of household and
(b) those where the husband has that status. In each case we want the individual ID (NORD) of the
household member whose status is complementary to that of the woman in question. So for case
(a) we subsample using PARENT==1 (head of household) and filter the join using PARENT==2 (spouse
of head); in case (b) we do the converse. We thus construct H_ID piece-wise.

# for women who are household heads


smpl PARENT==1 --restrict --replace
join SHIW/carcom10.csv H_ID --ikey=NQUEST --data=NORD --filter="PARENT==2"
# for women who are not household heads
smpl PARENT==2 --restrict --replace
join SHIW/carcom10.csv H_ID --ikey=NQUEST --data=NORD --filter="PARENT==1"
smpl full

Now we can use our new inner key to retrieve the husbands’ data, matching H_ID on the left with
NORD on the right within each household.

join SHIW/carcom10.csv HA --ikey=NQUEST,H_ID --okey=NQUEST,NORD --data=ETA


join SHIW/carcom10.csv HE --ikey=NQUEST,H_ID --okey=NQUEST,NORD --data=STUDIO
join HRS.csv HHRS --ikey=NQUEST,H_ID --okey=NQUEST,NORD --data=HRS --aggr=sum
HHRS = misszero(HHRS)

The remainder of the script is straightforward and does not require discussion here: we recode
the education variables for compatibility; delete some intermediate series that are not needed any
more; add informative labels; and save the final product. See the Appendix for details.
To compare the results from this dataset with those from the earlier US data used by Mroz, one can
copy the input file heckit.inp (supplied with the gretl package) and substitute mroz_rep.gdt for
mroz87.gdt. It turns out that the results are qualitatively very similar.

7.9 The representation of dates


Up to this point all the data we have considered have been cross-sectional. In the following sections
we discuss data that have a time dimension, and before proceeding it may be useful to say some-
thing about the representation of dates. Gretl takes the ISO 8601 standard as its reference point
but provides mean of converting dates provided in other formats; it also offers a set of calendrical
functions for manipulating dates (isodate, isoconv, epochday and others).
ISO 8601 recognizes two formats for daily dates, “extended” and “basic”. In both formats dates
are given as 4-digit year, 2-digit month and 2-digit day, in that order. In extended format a dash
is inserted between the fields—as in 2013-10-21 or more generally YYYY-MM-DD— while in basic
format the fields are run together (YYYYMMDD). Extended format is more easily parsed by human
readers while basic format is more suitable for computer processing, since one can apply ordinary
arithmetic to compare dates as equal, earlier or later. The standard also recognizes YYYY-MM as
representing year and month, e.g. 2010-11 for November 2010,2 as well as a plain four-digit number
for year alone.
One problem for economists is that the “quarter” is not a period covered by ISO 8601. This could
be presented by YYYY-Q (with only one digit following the dash) but in gretl output we in fact use a
colon, as in 2013:2 for the second quarter of 2013. (For printed output of months gretl also uses
2 The form YYYYMM is not recognized for year and month.
Chapter 7. Joining data sources 58

a colon, as in 2013:06. A difficulty with following ISO here is that in a statistical context a string
such as 1980-10 may look more like a subtraction than a date.) Anyway, at present we are more
interested in the parsing of dates on input rather than in what gretl prints. And in that context note
that “excess precision” is acceptable: a month may be represented by its first day (e.g. 2005-05-01
for May, 2005), and a quarter may be represented by its first month and day (2005-07-01 for the
third quarter of 2005).
Some additional points regarding dates will be taken up as they become relevant in practical cases
of joining data.

7.10 Time-series data


Suppose our left-hand dataset is recognized by gretl as time series with a supported frequency
(annual, quarterly, monthly, weekly, daily or hourly). This will be the case if the original data were
read from a file that contained suitable time or date information, or if a time-series interpretation
has been imposed using either the setobs command or its GUI equivalent. Then — apart, perhaps,
from some very special cases—joining additional data is bound to involve matching observations
by time-period. In this case, contrary to the cross-sectional case, the inner dataset has a natural
ordering of which gretl is aware; hence, no “inner key” is required.
If, in addition, the file from data which are to be joined is in native gretl format and contains time-
series information, keys are not needed at all. Three cases can arise: the frequency of the outer
dataset may be the same, lower or higher than that of the inner dataset. In the first two cases
join should work without any special apparatus; lower-frequency values will be repeated for each
high-frequency period. In the third case, however, an aggregation method must be specified: gretl
needs to know how to map higher-frequency data into the existing dataset (by averaging, summing,
or whatever).
If the outer data file is not in native gretl format we need a means of identifying the period of each
observation on the right, an outer key which we’ll call a “time key”. The join command provides
a simple (but limited) default for extracting period information from the outer data file, plus an
option that can be used if the default is not applicable, as follows.

• The default assumptions are: (1) the time key appears in the first column; (2) the heading
of this column is either left blank or is one of obs, date, year, period, observation, or
observation_date (on a case-insensitive comparison); and (3) the time format conforms
to ISO 8601 where applicable (“extended” daily date format YYYY-MM-DD, monthly format
YYYY-MM, or annual format YYYY).

• If dates do not appear in the first column of the outer file, or if the column heading or format
is not as just described, the --tkey option can be used to indicate which column should be
used and/or what format should be assumed.

Setting the time-key column and/or format


The --tkey option requires a parameter holding the name of the column in which the time key
is located and/or a string specifying the format in which dates/times are written in the time-key
column. This parameter should be enclosed in double-quotes. If both elements are present they
should be separated by a comma; if only a format is given it should be preceded by a comma. Some
examples:

--tkey="Period,%m/%d/%Y"
--tkey="Period"
--tkey="obsperiod"
--tkey=",%Ym%m"
Chapter 7. Joining data sources 59

The first of these applies if Period is not the first column on the right, and dates are given in the
US format of month, day, year, separated by slashes. The second implies that although Period is
not the first column, the date format is ISO 8601. The third again implies that the date format is
OK; here the name is required even if obsperiod is the first column since this heading is not one
recognized by gretl’s heuristic. The last example implies that dates are in the first column (with
one of the recognized headings), but are given in the non-standard format year, “m”, month.
The date format string should be composed using the codes employed by the POSIX function
strptime; Table 7.2 contains a list of the most relevant codes.3

Code Meaning
%% The % character.
%b The month name according to the current locale, either abbreviated
or in full.
%C The century number (0–99).
%d The day of month (1–31).
%D Equivalent to %m/%d/%y. (This is the American style date, very con-
fusing to non-Americans, especially since %d/%m/%y is widely used in
Europe. The ISO 8601 standard format is %Y-%m-%d.)
%H The hour (0–23).
%j The day number in the year (1–366).
%m The month number (1–12).
%n Arbitrary whitespace.
%q The quarter (1–4).
%w The weekday number (0–6) with Sunday = 0.
%y The year within century (0–99). When a century is not otherwise spec-
ified, values in the range 69–99 refer to years in the twentieth century
(1969–1999); values in the range 00–68 refer to years in the twenty-
first century (2000–2068).
%Y The year, including century (for example, 1991).

Table 7.2: Date format codes

Example: daily stock prices


We show below the first few lines of a file named IBM.csv containing stock-price data for IBM
corporation.

Date,Open,High,Low,Close,Volume,Adj Close
2013-08-02,195.50,195.50,193.22,195.16,3861000,195.16
2013-08-01,196.65,197.17,195.41,195.81,2856900,195.81
2013-07-31,194.49,196.91,194.49,195.04,3810000,195.04

Note that the data are in reverse time-series order — that won’t matter to join, the data can appear
in any order. Also note that the first column is headed Date and holds daily dates as ISO 8601
extended. That means we can pull the data into gretl very easily. In the following fragment we
create a suitably dimensioned empty daily dataset then rely on the default behavior of join with
time-series data to import the closing stock price.

nulldata 500
setobs 5 2012-01-01
join IBM.csv Close
3 The %q code for quarter is not present in strptime; it is added for use with join since quarterly data are common

in macroeconomics.
Chapter 7. Joining data sources 60

To make explicit what we’re doing, we could accomplish exactly the same using the --tkey option:

join IBM.csv Close --tkey="Date,%Y-%m-%d"

Example: OECD quarterly data


Table 7.3 shows an excerpt from a CSV file provided by the OECD statistical site (stat.oecd.org)
in response to a request for GDP at constant prices for several countries.4

Frequency,Period,Country,Value,Flags
"Quarterly","Q1-1960","France",463876.148126845,E
"Quarterly","Q1-1960","Germany",768802.119278467,E
"Quarterly","Q1-1960","Italy",414629.791450547,E
"Quarterly","Q1-1960","United Kingdom",578437.090291889,E
"Quarterly","Q2-1960","France",465618.977328614,E
"Quarterly","Q2-1960","Germany",782484.138122549,E
"Quarterly","Q2-1960","Italy",420714.910290157,E
"Quarterly","Q2-1960","United Kingdom",572853.474696578,E
"Quarterly","Q3-1960","France",469104.41925852,E
"Quarterly","Q3-1960","Germany",809532.161494483,E
"Quarterly","Q3-1960","Italy",426893.675840156,E
"Quarterly","Q3-1960","United Kingdom",581252.066618986,E
"Quarterly","Q4-1960","France",474664.327992619,E
"Quarterly","Q4-1960","Germany",817806.132384948,E
"Quarterly","Q4-1960","Italy",427221.338414114,E
...

Table 7.3: Example of CSV file as provided by the OECD statistical website

This is an instance of data in what we call atomic format, that is, a format in which each line of the
outer file contains a single data-point and extracting data mainly requires filtering the appropriate
lines. The outer time key is under the Period heading, and has the format Q<quarter>-<year>.
Assuming that the file in Table 7.3 has the name oecd.csv, the following script reconstructs the
time series of Gross Domestic Product for several countries:

nulldata 220
setobs 4 1960:1

join oecd.csv FRA --tkey="Period,Q%q-%Y" --data=Value --filter="Country==\"France\""


join oecd.csv GER --tkey="Period,Q%q-%Y" --data=Value --filter="Country==\"Germany\""
join oecd.csv ITA --tkey="Period,Q%q-%Y" --data=Value --filter="Country==\"Italy\""
join oecd.csv UK --tkey="Period,Q%q-%Y" --data=Value --filter="Country==\"United Kingdom\""

Note the use of the format codes %q for the quarter and %Y for the 4-digit year. A touch of elegance
could have been added by storing the invariant options to join using the setopt command, as in

setopt join persist --tkey="Period,Q%q-%Y" --data=Value


join oecd.csv FRA --filter="Country==\"France\""
join oecd.csv GER --filter="Country==\"Germany\""
join oecd.csv ITA --filter="Country==\"Italy\""
join oecd.csv UK --filter="Country==\"United Kingdom\""
setopt join clear

If one were importing a large number of such series it might be worth rewriting the sequence of
joins as a loop, as in

4 Retrieved 2013-08-05. The OECD files in fact contain two leading columns with very long labels; these are irrelevant

to the present example and can be omitted without altering the sample script.
Chapter 7. Joining data sources 61

strings countries = defarray("France", "Germany", "Italy", "United Kingdom")


strings vnames = defarray("FRA", "GER", "ITA", "UK")
setopt join persist --tkey="Period,Q%q-%Y" --data=Value

loop foreach i countries


vname = vnames[i]
join oecd.csv @vname --filter="Country==\"$i\""
endloop
setopt join clear

7.11 Special handling of time columns


When dealing with straight time series data the tkey mechanism described above should suffice
in almost all cases. In some contexts, however, time enters the picture in a more complex way;
examples include panel data (see section 7.12) and so-called realtime data (see chapter 8). To handle
such cases join provides the --tconvert option. This can be used to select certain columns in
the right-hand data file for special treatment: strings representing dates in these columns will be
converted to numerical values: 8-digit numbers on the pattern YYYYMMDD (ISO basic daily format).
Once dates are in this form it is easy to use them in key-matching or filtering.
By default it is assumed that the strings in the selected columns are in ISO extended format,
YYYY-MM-DD. If that is not the case you can supply a time-format string using the --tconv-fmt
option. The format string should be written using the codes shown in Table 7.2.
Here are some examples:

# select one column for treatment


--tconvert=start_date

# select two columns for treatment


--tconvert="start_date,end_date"

# specify US-style daily date format


--tconv-fmt="%m/%d/%Y"

# specify quarterly date-strings (as in 2004q1)


--tconv-fmt="%Yq%q"

Some points to note:

• If a specified column is not selected for a substantive role in the join operation (as data to be
imported, as a key, or as an auxiliary variable for use in aggregation) the column in question
is not read and so no conversion is carried out.

• If a specified column contains numerical rather than string values, no conversion is carried
out.

• If a string value in a selected column fails parsing using the relevant time format (user-
specified or default), the converted value is NA.

• On successful conversion, the output is always in daily-date form as stated above. If you
specify a monthly or quarterly time format, the converted date is the first day of the month
or quarter.

7.12 Panel data


In section 7.10 we gave an example of reading quarterly GDP data for several countries from an
OECD file. In that context we imported each country’s data as a distinct time-series variable. Now
Chapter 7. Joining data sources 62

suppose we want the GDP data in panel format instead (stacked time series). How can we do this
with join?
As a reminder, here’s what the OECD data look like:

Frequency,Period,Country,Value,Flags
"Quarterly","Q1-1960","France",463876.148126845,E
"Quarterly","Q1-1960","Germany",768802.119278467,E
"Quarterly","Q1-1960","Italy",414629.791450547,E
"Quarterly","Q1-1960","United Kingdom",578437.090291889,E
"Quarterly","Q2-1960","France",465618.977328614,E

and so on. If we have four countries and quarterly observations running from 1960:1 to 2013:2 (T
= 214 quarters) we might set up our panel workspace like this:

scalar N = 4
scalar T = 214
scalar NT = N*T
nulldata NT --preserve
setobs T 1.1 --stacked-time-series

The relevant outer keys are obvious: Country for the country and Period for the time period. Our
task is now to construct matching keys in the inner dataset. This can be done via two panel-specific
options to the setobs command. Let’s work on the time dimension first:

setobs 4 1960:1 --panel-time


series quarter = $obsdate

This variant of setobs allows us to tell gretl that time in our panel is quarterly, starting in the
first quarter of 1960. Having set that, the accessor $obsdate will give us a series of 8-digit dates
representing the first day of each quarter — 19600101, 19600401, 19600701, and so on, repeating
for each country. As we explained in section 7.11, we can use the --tconvert option on the outer
series Period to get exactly matching values (in this case using a format of Q%q-%Y for parsing the
Period values).
Now for the country names:

string cstrs = sprintf("France Germany Italy \"United Kingdom\"")


setobs country cstrs --panel-groups

Here we write into the string cstrs the names of the countries, using escaped double-quotes to
handle the space in “United Kingdom”, then pass this string to setobs with the --panel-groups
option, preceded by the identifier country. This asks gretl to construct a string-valued series
named country, in which each name will repeat T times.
We’re now ready to join. Assuming the OECD file is named oecd.csv we do

join oecd.csv GDP --data=Value \


--ikey=country,quarter --okey=Country,Period \
--tconvert=Period --tconv-fmt="Q%q-%Y"

Other input formats


The OECD file discussed above is in the most convenient format for join, with one data-point per
line. But sometimes we may want to make a panel from a data file structured like this:

# Real GDP
Period,France,Germany,Italy,"United Kingdom"
Chapter 7. Joining data sources 63

"Q1-1960",463863,768757,414630,578437
"Q2-1960",465605,782438,420715,572853
"Q3-1960",469091,809484,426894,581252
"Q4-1960",474651,817758,427221,584779
"Q1-1961",482285,826031,442528,594684
...

Call this file side_by_side.csv. Assuming the same initial set-up as above, we can panelize the
data by setting the sample to each country’s time series in turn and importing the relevant column.
The only point to watch here is that the string “United Kingdom”, being a column heading, will
become United_Kingdom on importing (see section 7.2) so we’ll need a slightly different set of
country strings.

strings cstrs = defarray("France", "Germany", "Italy", "United_Kingdom")


setobs country cstrs --panel-groups
loop foreach i cstrs
smpl country=="$i" --restrict --replace
join side_by_side.csv GDP --data=$i \
--ikey=quarter --okey=Period \
--tconvert=Period --tconv-fmt="Q%q-%Y"
endloop
smpl full

If our working dataset and the outer data file are dimensioned such that there are just as many
time-series observations on the right as there are time slots on the left — and the observations
on the right are contiguous, in chronological order, and start on the same date as the working
dataset—we could dispense with the key apparatus and just use the first line of the join command
shown above. However, in general it is safer to use keys to ensure that the data end up in correct
registration.

7.13 Memo: join options


Basic syntax: join filename varname(s) [ options ]

flag effect

--data Give the name of the data column on the right, in case it differs from
varname (7.2); single import only
--filter Specify a condition for filtering data rows (7.3)
--ikey Specify up to two keys for matching data rows (7.4)
--okey Specify outer key name(s) in case they differ the inner ones (7.4)
--aggr Select an aggregation method for 1 to n joins (7.5)
--tkey Specify right-hand time key (7.10)
--tconvert Select outer date columns for conversion to numeric form (7.11)
--tconv-fmt Specify a format for use with tconvert (7.11)
--no-header Treat the first row on the right as data (7.2)
--verbose Report on progress in reading the outer data
Chapter 7. Joining data sources 64

Appendix: the full Mroz data script


# start with everybody; get gender, age and a few other variables
# directly while we’re at it
open SHIW/carcom10.csv --cols=1,2,3,4,9,10,29,41

# subsample on married women between the ages of 30 and 60


smpl SEX==2 && ETA>=30 && ETA<=60 && STACIV==1 --restrict
# for simplicity, restrict to heads of households and their spouses
smpl PARENT<3 --restrict

# rename the age and education variables for compatibility; compute


# the "city" dummy and finally save the reduced base dataset
rename ETA WA
rename STUDIO WE
series CIT = (ACOM4C>2)
store mroz_rep.gdt

# make a temp file holding annual hours worked per job


open SHIW/allb1.csv --cols=1,2,8,11 --quiet
series HRS = misszero(ORETOT) * 52 * misszero(MESILAV)/12
store HRS.csv NQUEST NORD HRS

# reopen the base dataset and begin drawing assorted data in


open mroz_rep.gdt

# women’s annual hours (summed across jobs)


join HRS.csv WHRS --ikey=NQUEST,NORD --data=HRS --aggr=sum
WHRS = misszero(WHRS)

# labor force participation


LFP = WHRS > 0

# work experience: ETALAV = age when started first job


join SHIW/lavoro.csv ETALAV --ikey=NQUEST,NORD
series AX = misszero(WA - ETALAV)

# women’s hourly wages


join SHIW/rper10.csv YL YM --ikey=NQUEST,NORD --aggr=sum
series WW = LFP ? (YL + YM)/WHRS : 0

# family income (Y = net disposable income)


join SHIW/rfam10.csv FAMINC --ikey=NQUEST --data=Y

# get data on children using the "count" method


join SHIW/carcom10.csv KIDS --ikey=NQUEST --aggr=count --filter="ETA<=18"
join SHIW/carcom10.csv KL6 --ikey=NQUEST --aggr=count --filter=ETA<6
series K618 = KIDS - KL6

# data on husbands: we first construct an auxiliary inner key for


# husbands, using the little trick of subsampling the inner dataset
#
# for women who are household heads
smpl PARENT==1 --restrict --replace
join SHIW/carcom10.csv H_ID --ikey=NQUEST --data=NORD --filter="PARENT==2"
# for women who are not household heads
smpl PARENT==2 --restrict --replace
join SHIW/carcom10.csv H_ID --ikey=NQUEST --data=NORD --filter="PARENT==1"
smpl full
Chapter 7. Joining data sources 65

# add husbands’ data via the newly-added secondary inner key


join SHIW/carcom10.csv HA --ikey=NQUEST,H_ID --okey=NQUEST,NORD --data=ETA
join SHIW/carcom10.csv HE --ikey=NQUEST,H_ID --okey=NQUEST,NORD --data=STUDIO
join HRS.csv HHRS --ikey=NQUEST,H_ID --okey=NQUEST,NORD --data=HRS --aggr=sum
HHRS = misszero(HHRS)

# final cleanup begins

# recode educational attainment as years of education


matrix eduyrs = {0, 5, 8, 11, 13, 16, 18, 21}
series WE = replace(WE, seq(1,8), eduyrs)
series HE = replace(HE, seq(1,8), eduyrs)

# cut some cruft


delete SEX STACIV KIDS YL YM PARENT H_ID ETALAV

# add some labels for the series


setinfo LFP -d "1 if woman worked in 2010"
setinfo WHRS -d "Wife’s hours of work in 2010"
setinfo KL6 -d "Number of children less than 6 years old in household"
setinfo K618 -d "Number of children between ages 6 and 18 in household"
setinfo WA -d "Wife’s age"
setinfo WE -d "Wife’s educational attainment, in years"
setinfo WW -d "Wife’s average hourly earnings, in 2010 euros"
setinfo HHRS -d "Husband’s hours worked in 2010"
setinfo HA -d "Husband’s age"
setinfo HE -d "Husband’s educational attainment, in years"
setinfo FAMINC -d "Family income, in 2010 euros"
setinfo AX -d "Actual years of wife’s previous labor market experience"
setinfo CIT -d "1 if live in large city"

# save the final product


store mroz_rep.gdt
Chapter 8

Realtime data

8.1 Introduction
As of gretl version 1.9.13 the join command (see chapter 7) has been enhanced to deal with so-
called realtime datasets in a straightforward manner. Such datasets contain information on when
the observations in a time series were actually published by the relevant statistical agency and how
they have been revised over time. Probably the most popular sources of such data are the “Alfred”
online database at the St. Louis Fed (http://alfred.stlouisfed.org/) and the OECD’s StatEx-
tracts site, http://stats.oecd.org/. The examples in this chapter deal with files downloaded
from these sources, but should be easy to adapt to files with a slightly different format.
As already stated, join requires a column-oriented plain text file, where the columns may be sepa-
rated by commas, tabs, spaces or semicolons. Alfred and the OECD provide the option to download
realtime data in this format (tab-delimited files from Alfred, comma-delimited from the OECD). If
you have a realtime dataset in a spreadsheet file you must export it to a delimited text file before
using it with join.
Representing revision histories is more complex than just storing a standard time series, because
for each observation period you have in general more than one published value over time, along
with the information on when each of these values were valid or current. Sometimes this is repre-
sented in spreadsheets with two time axes, one for the observation period and another one for the
publication date or “vintage”. The filled cells then form an upper triangle (or a “guillotine blade”
shape, if the publication dates do not reach back far enough to complete the triangle). This format
can be useful for giving a human reader an overview of realtime data, but it is not optimal for
automatic processing; for that purpose “atomic” format is best.

8.2 Atomic format for realtime data


What we are calling atomic format is exactly the format used by Alfred if you choose the option
“Observations by Real-Time Period”, and by the OECD if you select all editions of a series for
download as plain text (CSV).1 A file in this format contains one actual data-point per line, together
with associated metadata. This is illustrated in Table 8.1, where we show the first three lines from
an Alfred file and an OECD file (slightly modified).2
Consider the first data line in the Alfred file: in the observation_date column we find 1960-01-01,
indicating that the data-point on this line, namely 112.0, is an observation or measurement (in this
case, of the US index of industrial production) that refers to the period starting on January 1st
1960. The realtime_start_date value of 1960-02-16 tells us that this value was published on
February 16th 1960, and the realtime_end_date value says that this vintage remained current
through March 15th 1960. On the next day (as we can see from the following line) this data-point
was revised slightly downward to 111.0.
Daily dates in Alfred files are given in ISO extended format, YYYY-MM-DD, but below we describe
how to deal with differently formatted dates. Note that daily dates are appropriate for the last

1 If you choose to download in Excel format from OECD you get a file in the triangular or guillotine format mentioned

above.
2 In the Alfred file we have used commas rather than tabs as the column delimiter; in the OECD example we have

shortened the name in the Variable column.

66
Chapter 8. Realtime data 67

Alfred: monthly US industrial production


observation_date,INDPRO,realtime_start_date,realtime_end_date
1960-01-01,112.0000,1960-02-16,1960-03-15
1960-01-01,111.0000,1960-03-16,1961-10-15

OECD: monthly UK industrial production


Country,Variable,Frequency,Time,Edition,Value,Flags
"United Kingdom","INDPRO","Monthly","Jan-1990","February 1999",100,
"United Kingdom","INDPRO","Monthly","Feb-1990","February 1999",99.3,

Table 8.1: Variant atomic formats for realtime data

two columns, which jointly record the interval over which a given data vintage was current. Daily
dates might, however, be considered overly precise for the first column, since the data period may
well be the year, quarter or month (as it is in fact here). However, following Alfred’s practice it is
acceptable to specify a daily date, indicating the first day of the period, even for non-daily data.3
Compare the first data line of the OECD example. There’s a greater amount of leading metadata,
which is left implicit in the Alfred file. Here Time is the equivalent of Alfred’s observation_date,
and Edition the equivalent of Alfred’s realtime_start_date. So we read that in February 1999
a value of 100 was current for the UK index of industrial production for January 1990, and from
the next line we see that in the same vintage month a value of 99.3 was current for industrial
production in February 1990.
Besides the different names and ordering of the columns, there are a few more substantive differ-
ences between Alfred and OECD files, most of which are irrelevant for join but some of which are
(possibly) relevant.
The first (irrelevant) difference is the ordering of the lines. It appears (though we’re not sure how
consistent this is) that in Alfred files the lines are sorted by observation date first and then by
publication date—so that all revisions of a given observation are grouped together —while OECD
files are sorted first by revision date (Edition) and then by observation date (Time). If we want the
next revision of UK industrial production for January 1990 in the OECD file we have to scan down
several lines until we find

"United Kingdom","INDPRO","Monthly","Jan-1990","March 1999",100,

This difference is basically irrelevant because join can handle the case where the lines appear in
random order, although some operations can be coded more conveniently if we’re able to assume
chronological ordering (either on the Alfred or the OECD pattern, it doesn’t matter).
The second (also irrelevant) difference is that the OECD seems to include periodic “Edition” lines
even when there is no change from the previous value (as illustrated above, where the UK industrial
production index for January 1990 is reported as 100 as of March 1999, the same value that we saw
to be current in February 1999), while Alfred reports a new value only when it differs from what
was previously current.
A third difference lies in the dating of the revisions or editions. As we have seen, Alfred gives
a specific daily date while (in the UK industrial production file at any rate), the OECD just dates
each edition to a month. This is not necessarily relevant for join, but it does raise the question of
whether the OECD might date revisions to a finer granularity in some of their files, in which case
one would have to be on the lookout for a different date format.
The final difference is that Alfred supplies an “end date” for each data vintage while the OECD
3 Notice that this implies that in the Alfred example it is not clear without further information whether the observation

period is the first quarter of 1960, the month January 1960, or the day January 1st 1960. However, we assume that this
information is always available in context.
Chapter 8. Realtime data 68

supplies only a starting date. But there is less to this difference than meets the eye: according to
the Alfred webmaster, “by design, a new vintage must start immediately following (the day after)
the lapse of the old vintage”—so the end date conveys no independent information.4

8.3 More on time-related options


Before we get properly started it is worth saying a little more about the --tkey and --tconvert
options to join (first introduced in section 7.11), as they apply in the case of realtime data.
When you’re working with regular time series data tkey is likely to be useful while tconvert is
unlikely to be applicable (see section 7.10). On the other hand, when you’re working with panel data
tkey is definitely not applicable but tconvert may well be helpful (section 7.12). When working
with realtime data, however, depending on the task in hand both options may be useful. You will
likely need tkey; you may well wish to select at least one column for tconvert treatment; and in
fact you may want to name a given column in both contexts — that is, include the tkey variable
among the tconvert columns.
Why might this make sense? Well, think of the --tconvert option as a “preprocessing” directive:
it asks gretl to convert date strings to numerical values (8-digit ISO basic dates) “at source”, as they
are read from the outer datafile. The --tkey option, on the other hand, singles out a column as
the one to use for matching rows with the inner dataset. So you would want to name a column in
both roles if (a) it should be used for matching periods and also (b) it is desirable to have the values
from this column in numerical form, most likely for use in filtering.
As we have seen, you can supply specific formats in connection with both tkey and tconvert (in
the latter case via the companion option --tconv-fmt) to handle the case where the date strings on
the right are not ISO-friendly at source. This raises the question of how the format specifications
work if a given column is named under both options. Here are the rules that gretl applies:

1. If a format is given with the --tkey option it always applies to the tkey column alone; and
for that column it overrides any format given via the --tconv-fmt option.

2. If a format is given via tconv-fmt it is assumed to apply to all the tconvert columns, unless
this assumption is overriden by rule 1.

8.4 Getting a certain data vintage


The most common application of realtime data is to “travel back in time” and retrieve the data that
were current as of a certain date in the past. This would enable you to replicate a forecast or other
statistical result that could have been produced at that date.
For example, suppose we are interested in a variable of monthly frequency named INDPRO, realtime
data on which is stored in an Alfred file named INDPRO.txt, and we want to check the status quo
as of June 15th 2011.
If we don’t already have a suitable dataset into which to import the INDPRO data, our first steps will
be to create an appropriately dimensioned empty dataset using the nulldata command and then
specify its time-series character via setobs, as in

nulldata 132
setobs 12 2004:01

For convenience we can put the name of our realtime file into a string variable. On Windows this
might look like

4 Email received from Travis May of the Federal Reserve Bank of St. Louis, 2013-10-17. This closes off the possibility

that a given vintage could lapse or expire some time before the next vintage becomes available, hence giving rise to a
“hole” in an Alfred realtime file.
Chapter 8. Realtime data 69

string fname = "C:/Users/yourname/Downloads/INDPRO.txt"

We can then import the data vintage 2011-06-15 using join, arbitrarily choosing the self-explanatory
identifier ip_asof_20110615.

join @fname ip_asof_20110615 --tkey=observation_date --data=INDPRO \


--tconvert="realtime_start_date" \
--filter="realtime_start_date<=20110615" --aggr=max(realtime_start_date)

Here some detailed explanations of the various options are warranted:

• The --tkey option specifies the column which should be treated as holding the observation
period identifiers to be matched against the periods in the current gretl dataset.5 The more
general form of this option is --tkey="colname,format" (note the double quotes here), so
if the dates do not come in standard format, we can tell gretl how to parse them by using
the appropriate conversion specifiers as shown in Table 7.2. For example, here we could have
written --tkey="observation_date,%Y-%m-%d".

• Next, --data=INDPRO tells gretl that we want to retrieve the entries stored in the column
named INDPRO.

• As explained in section 7.11 the --tconvert option selects certain columns in the right-hand
data file for conversion from date strings to 8-digit numbers on the pattern YYYYMMDD. We’ll
need this for the next step, filtering, since the transformation to numerical values makes
it possible to perform basic arithmetic on dates. Note that since date strings in Alfred files
conform to gretl’s default assumption it is not necessary to use the --tconv-fmt option here.

• The --filter option specification in combination with the subsequent --aggr aggregation
treatment is the central piece of our data retrieval; notice how we use the date constant
20110615 in ISO basic form to do numerical comparisons, and how we perform the numerical
max operation on the converted column realtime_start_date. It would also have been
possible to predefine a scalar variable, as in
vintage = 20110615

and then use vintage in the join command instead. Here we tell join that we only want to
extract those publications that (1) already appeared before (and including) June 15th 2011,
and (2) were not yet obsoleted by a newer release.6

As a result, your dataset will now contain a time series named ip_asof_20110615 with the values
that a researcher would have had available on June 15th 2011. Of course, all values for the observa-
tions after June 2011 will be missing (and probably a few before that, too), because they only have
become available later on.

8.5 Getting the n-th release for each observation period


For some purposes it may be useful to retrieve the n-th published value of each observation, where
n is a fixed positive integer, irrespective of when each of these n-th releases was published. Sup-
pose we are interested in the third release, then the relevant join command becomes:

join @fname ip_3rdpub --tkey=observation_date --data=INDPRO --aggr="seq:3"

5 Strictly speaking, using --tkey is unnecessary in this example because we could just have relied on the default,

which is to use the first column in the source file for the periods. However, being explicit is often a good idea.
6 By implementing the second condition through the max aggregation on the realtime_start_date column alone,

without using the realtime_end_date column, we make use of the fact that Alfred files cannot have “holes” as explained
before.
Chapter 8. Realtime data 70

Since we do not need the realtime_start_date information for this retrieval, we have dropped
the --tconvert option here. Note that this formulation assumes that the source file is ordered
chronologically, otherwise using the option --aggr="seq:3", which retrieves the third value from
each sequence of matches, could have yielded a result different from the one intended. However,
this assumption holds for Alfred files and is probably rather safe in general.
The values of the variable imported as ip_3rdpub in this way were published at different dates,
so the variable is effectively a mix of different vintages. Depending on the type of variable, this
may also imply drastic jumps in the values; for example, index numbers are regularly re-based
to different base periods. This problem also carries over to inflation-adjusted economic variables,
where the base period of the price index changes over time. Mixing vintages in general also means
mixing different scales in the output, with which you would have to deal appropriately.7

8.6 Getting the values at a fixed lag after the observation period
New data releases may take place on any day of the month, and as we have seen the specific day
of each release is recorded in realtime files from Alfred. However, if you are working with, say,
monthly or quarterly data you may sometimes want to adjust the granularity of your realtime axis
to a monthly or quarterly frequency. For example, in order to analyse the data revision process for
monthly industrial production you might be interested in the extent of revisions between the data
available two and three months after each observation period.
This is a relatively complicated task and there is more than one way of accomplishing it. Either you
have to make several passes through the outer dataset or you need a sophisticated filter, written
as a hansl function. Either way you will want to make use of some of gretl’s built-in calendrical
functions.
We’ll assume that a suitably dimensioned workspace has been set up as described above. Given
that, the key ingredients of the join are a filtering function which we’ll call rel_ok (for “release is
OK”) and the join command which calls it. Here’s the function:

function series rel_ok (series obsdate, series reldate, int p)


series y_obs, m_obs, y_rel, m_rel
# get year and month from observation date
isoconv(obsdate, &y_obs, &m_obs)
# get year and month from release date
isoconv(reldate, &y_rel, &m_rel)
# find the delta in months
series dm = (12*y_rel + m_rel) - (12*y_obs + m_obs)
# and implement the filter
return dm <= p
end function

And here’s the command:

scalar lag = 3 # choose your fixed lag here


join @fname ip_plus3 --data=INDPRO --tkey=observation_date \
--tconvert="observation_date,realtime_start_date" \
--filter="rel_ok(observation_date, realtime_start_date, lag)" \
--aggr=max(realtime_start_date)

Note that we use --tconvert to convert both the observation date and the realtime start date (or
release date) to 8-digit numerical values. Both of these series are passed to the filter, which uses the
7 Some user-contributed functions may be available that address this issue, but it is beyond our scope here. Another
even more complicated issue in the realtime context is that of “benchmark revisions” applied by statistical agencies,
where the underlying definition or composition of a variable changes on some date, which goes beyond a mere rescaling.
However, this type of structural change is not, in principle, a feature of realtime data alone, but applies to any time-series
data.
Chapter 8. Realtime data 71

built-in function isoconv to extract year and month. We can then calculate dm, the “delta months”
since the observation date, for each release. The filter condition is that this delta should be no
greater than the specified lag, p.8
This filter condition may be satisfied by more than one release, but only the latest of those will
actually be the vintage that was current at the end of the n-th month after the observation period,
so we add the option --aggr=max(realtime_start_date). If instead you want to target the
release at the beginning of the n-th month you would have to use a slightly more complicated filter
function.

An illustration
Figure 8.1 shows four time series for the monthly index of US industrial production from October
2005 to June 2009: the value as of first publication plus the values current 3, 6 and 12 months out
from the observation date.9 From visual inspection it would seem that over much of this period
the Federal reserve was fairly consistently overestimating industrial production at first release and
shortly thereafter, relative to the figure they arrived at with a lag of a year.
The script that produced this Figure is shown in full in Listing 8.1. Note that in this script we are
using a somewhat more efficient version of the rel_ok function shown above, where we pass the
required series arguments in “pointer” form to avoid having to copy them (see chapter 14).

116

114

112

110

108

106

104

102

100

98
First publication
Plus 3 months
96 Plus 6 months
Plus 12 months
94
2006 2007 2008 2009

Figure 8.1: Successive revisions to US industrial production

8.7 Getting the revision history for an observation


For our final example we show how to retrieve the revision history for a given observation (again
using Alfred data on US industrial production). In this exercise we are switching the time axis: the
observation period is a fixed point and time is “vintage time”.
A suitable script is shown in Listing 8.2. We first select an observation to track (January 1970). We
start the clock in the following month, when a data-point for this period was first published, and let

8 The filter is written on the assumption that the lag is expressed in months; on that understanding it could be used

with annual or quarterly data as well as monthly. The idea could be generalized to cover weekly or daily data without
much difficulty.
9 Why not a longer series? Because if we try to extend it in either direction we immediately run into the index re-basing

problem mentioned in section 8.5, with big (staggered) leaps downward in all the series.
Chapter 8. Realtime data 72

Listing 8.1: Retrieving successive realtime lags of US industrial production [Download ▼]

function series rel_ok (series *obsdate, series *reldate, int p)


series y_obs, m_obs, d_obs, y_rel, m_rel, d_rel
isoconv(obsdate, &y_obs, &m_obs, &d_obs)
isoconv(reldate, &y_rel, &m_rel, &d_rel)
series dm = (12*y_rel + m_rel) - (12*y_obs + m_obs)
return dm < p || (dm == p && d_rel <= d_obs)
end function

nulldata 45
setobs 12 2005:10

string fname = "INDPRO.txt"

# initial published values


join @fname firstpub --data=INDPRO --tkey=observation_date \
--tconvert=realtime_start_date --aggr=min(realtime_start_date)

# plus 3 months
join @fname plus3 --data=INDPRO --tkey=observation_date \
--tconvert="observation_date,realtime_start_date" \
--filter="rel_ok(&observation_date, &realtime_start_date, 3)" \
--aggr=max(realtime_start_date)

# plus 6 months
join @fname plus6 --data=INDPRO --tkey=observation_date \
--tconvert="observation_date,realtime_start_date" \
--filter="rel_ok(&observation_date, &realtime_start_date, 6)" \
--aggr=max(realtime_start_date)

# plus 12 months
join @fname plus12 --data=INDPRO --tkey=observation_date \
--tconvert="observation_date,realtime_start_date" \
--filter="rel_ok(&observation_date, &realtime_start_date, 12)" \
--aggr=max(realtime_start_date)

setinfo firstpub --graph-name="First publication"


setinfo plus3 --graph-name="Plus 3 months"
setinfo plus6 --graph-name="Plus 6 months"
setinfo plus12 --graph-name="Plus 12 months"

# set --output=realtime.pdf for PDF


gnuplot firstpub plus3 plus6 plus12 --time --with-lines \
--output=display { set key left bottom; }
Chapter 8. Realtime data 73

it run to the end of the vintage history (in this file, March 2013). Our outer time key is the realtime
start date and we filter on observation date; we name the imported INDPRO values as ip_jan70.
Since it sometimes happens that more than one revision occurs in a given month we need to select
an aggregation method: here we choose to take the last revision in the month.
Recall from section 8.2 that Alfred records a new revision only when the data-point in question
actually changes. This means that our imported series will contain missing values for all months
when no real revision took place. However, we can apply a simple autoregressive rule to fill in the
blanks: each missing value equals the prior non-missing value.
Figure 8.2 displays the revision history. Over this sample period the periodic re-basing of the index
overshadows amendments due to accrual of new information.

Listing 8.2: Retrieving a revision history [Download ▼]

# choose the observation to track here (YYYYMMDD)


scalar target = 19700101

nulldata 518 --preserve


setobs 12 1970:02

join INDPRO.txt ip_jan70 --data=INDPRO --tkey=realtime_start_date \


--tconvert=observation_date \
--filter="observation_date==target" --aggr=seq:-1

ip_jan70 = ok(ip_jan70) ? ip_jan70 : ip_jan70(-1)


gnuplot ip_jan70 --time --with-lines --output=display

180

160

140

120
ip_jan70

100

80

60

40

20
1970 1975 1980 1985 1990 1995 2000 2005 2010

Figure 8.2: Vintages of the index of US industrial production for January 1970
Chapter 9

Temporal disaggregation

9.1 Introduction
This chapter describes and explains the facility for temporal disaggregation in gretl.1 This is im-
plemented by the tdisagg function, which supports three variants of the method of Chow and Lin
(1971); the method of Fernández (1981); and two variants of the method of Denton (1971) as modi-
fied by Cholette (1984). Given the analytical similarities between them, the three Chow–Lin variants
and the Fernández method will be grouped in the discussion below as “Chow–Lin methods”.
The balance of this section provides a gentle introduction to the idea of temporal disaggregation;
experts may wish to skip to the next section.
Basically, temporal disaggregation is the business of taking time-series data observed at some given
frequency (say, annually) and producing a counterpart series at a higher frequency (say, quarterly).
The term “disaggregation” indicates the inverse operation of aggregation, and to understand tem-
poral disaggregation it’s helpful first to understand temporal aggregation. In aggregating a high
frequency series to a lower frequency there are three basic methods, the appropriate method de-
pending on the nature of the data. Here are some illustrative examples.

• GDP: say we have quarterly GDP data and wish to produce an annual series. This is a flow
variable and the annual flow will be the sum of the quarterly values (unless the quarterly
values are annualized, in which case we would aggregate by taking their mean).

• Industrial Production: this takes the form of an index reporting the level of production over
some period relative to that in a base period in which the index is by construction 100. To
aggregate from (for example) monthly to quarterly we should take the average of the monthly
values. (The sum would give a nonsense result.) The same goes for price indices, and also for
ratios of stocks to flows or vice versa (inventory to sales, debt to GDP, capacity utilization).

• Money stock: this is typically reported as an end-of-period value, so in aggregating from


monthly to quarterly we’d take the value from the final month of each quarter. In case a
stock variable is reported as a start-of-period value, the aggregated version would be that of
the first month of the quarter.

A central idea in temporal disaggregation is that the high frequency series must respect both the
given low frequency data and the aggregation method. So for example, whatever numbers we come
up with for quarterly GDP, given an annual series as starting point, our numbers must sum to
the annual total. If money stock is measured at the end of the period then whatever numbers
we come up with for monthly money stock, given quarterly data, the figure for the last month of
the quarter must match that for the quarter as a whole. This is why temporal disaggregation is
sometimes called “benchmarking”: the given low frequency data constitute a benchmark which the
constructed high frequency data must match, in a well defined sense that depends on the nature
of the data.
Colloquially, we might describe temporal disaggregation as “interpolation,” but strictly speaking
interpolation applies only to stock variables. We have a known end-of-quarter value (say), which is
also the value at the end of the last month of the quarter, and we’re trying to figure out what the
1 We are grateful to Tommaso Di Fonzo, Professor of Statistical Science at the University of Padua, for detailed and

precise comments on earlier drafts. Any remaining errors are, of course, our responsibility.

74
Chapter 9. Temporal disaggregation 75

value might have been at the end of months 1 and 2. We’re filling in the blanks, or interpolating.
In the GDP case, however, the procedure is distribution rather than interpolation. We have a given
annual total and we’re trying to figure out how it should be distributed over the quarters. We’re
also doing distribution for variables taking the form of indices or ratios, except in this case we’re
seeking plausible values whose mean equals the given low-frequency value.
While matching the low frequency benchmark is an important constraint, it obviously does not
tie down the high frequency values. That is a job for either regression-based methods such as
Chow–Lin or non-regression methods such as Denton. Details are provided in section 9.7.

9.2 Notation and design


Some notation first: the two main ingredients in temporal disaggregation are

• a T × g matrix Y (holding the series to be disaggregated) and

• a matrix X with k columns and (s · T + m) rows (to aid in the disaggregation).

The idea is that Y contains time series data sampled at some frequency f , while each column
of X contains time series data at a higher frequency, sf . So for each observation Yt we have s
corresponding rows in X. The object is to produce a transformation of Y to frequency sf , with
the help of X (whose columns are typically called “related series” or “indicators” in the temporal
disaggregation literature), via either distribution or interpolation depending on the nature of the
data. For most of this document we will assume that g = 1, or in other words we are performing
temporal disaggregation on a single low-frequency series, but tdisagg supports “batch processing”
of several series and we return to this point in section 9.9.
If the m in (s · T + m) is greater than zero, that implies that there are some “extra” high-frequency
observations available for extrapolation — see section 9.4 for details.
We need to say something more about what goes into X. Under the Denton methods this must be a
single series, generally known as the “preliminary series”.2 For the Chow–Lin methods, X can hold
a combination of deterministic terms (e.g. constant, trend) and stochastic series. Naturally, suitable
candidates for the role of preliminary series or indicator will be variables that are correlated with Y
(and in particular, might be expected to share short-run dynamics with Y). However, it is possible
to carry out disaggregation using deterministic terms only — in the simplest case, with X containing
nothing but a constant. Experts in the field tend to frown on this, with reason: in the absence of
any genuine high-frequency information disaggregation just amounts to a “mechanical” smoothing.
But some people may have a use for such smoothing, and it’s permitted by tdisagg.
We should draw attention to a design decision in tdisagg: we have separated the specification of
indicators in X from certain standard deterministic terms that might be wanted, namely, a constant,
linear trend or quadratic trend. If you want a disaggregation without stochastic indicators, you can
omit (or set to null) the argument corresponding to X. In that case a constant (only) will be
employed automatically, but for the Chow–Lin methods one can adjust the deterministic terms
used via an option named det, described below. In other words the content of X becomes implicit.
See section 9.6 for more detail.
Here’s an important point to note when X is given explicitly: although this matrix may contain
extra observations “at the end” we assume that Y and X are correctly aligned at the start. Take
for example the annual to quarterly case: if the first observation in annual Y is for 1980 then the
first observation in quarterly X must be for the first quarter of 1980. Ensuring this is the user’s
responsibility. We will have some more to say about this in the following section.

2 There’s nothing to stop a user from constructing such a series using several primary series as input—by taking the

first principal component or some other means—but that possibility is beyond our scope here.
Chapter 9. Temporal disaggregation 76

9.3 Overview of data handling


The tdisagg function has three basic arguments, representing Y, X and s respectively (plus several
options; see below). The first two arguments can be given either in matrix form as such, or as
“dataset objects”—that is, a series for Y and a series or list of series for X. Or, as mentioned above,
X can be omitted (left implicit). This gives rise to five cases; which is most convenient will depend
on the user’s workflow.

1. Both Y and X are matrices. In this case, the size and periodicity of the currently open dataset
(if any) are irrelevant. If Y has T rows X must, of course, have at least s · T rows; if that
condition is not satisfied an “Invalid argument” error will be flagged.
2. Y is a series (or list) and X a matrix. In this case we assume that the periodicity of the
currently open dataset is the lower one, and T will be taken as equal to $nobs (the number of
observations in the current sample range). Again, X must have at least s · T rows.
3. Y is a matrix and X a series or list. We then assume that the periodicity of the currently open
dataset is the higher one, so that $nobs defines (s · T + m). And Y is supposed to be at the
lower frequency, so its number of rows gives T . We should then be able to find m as $nobs
minus s · T ; if m < 0 an error is flagged.
4. Both Y and X are “dataset objects”. We have two sub-cases here.
(a) If X is a series, or an ordinary list of series, the periodicity of the currently open dataset is
taken to be the higher one. The series (or list) containing Y should hold the appropriate
entries every s elements. For example, if s = 4, Y1 will be taken from the first observation,
Y2 from the fifth, Y3 from the ninth, and so on. In practical terms, series of this sort are
likely to be composed by repeating each element of a low-frequency variable s times.
(b) Alternatively, X could be a “MIDAS list”. The concept of a MIDAS list is fully explained in
chapter 20 but for example, in a quarterly dataset a MIDAS list would be a list of three
series, for the third, second and first month (note the ordering). In this case, the current
periodicity is taken to be the lower one and X will contain one column, corresponding to
the high-frequency representation of the MIDAS list.
5. X is omitted. If Y is given as a matrix it is taken to have T rows. Otherwise the interpretation
is determined heuristically: if the Y series is recognized by gretl as composed of repeated
low-frequency observations, or if a series result is requested, it is taken as having length sT ,
otherwise its length is taken to be T .

In the previous section we flagged the importance of correct alignment of X and Y at the start of the
data; we’re now in a position to say a little more about this. If either X or Y are given in matrix form
alignment is truly the user’s responsibility. But if they are dataset objects gretl can be more helpful.
We automatically advance the start of the sample range to exclude any leading missing values, and
retard the end of the sample ranges for X and Y to exclude trailing missing values (allowing for the
possibility that X may extend beyond Y). In addition we further advance the sample start if this is
required to ensure that the X data begin in the first high-frequency sub-period (e.g. the first quarter
of a year or the first month of a quarter). But please note: when gretl automatically excludes leading
or trailing missing values, intra-sample missing values will still provoke an error.

9.4 Extrapolation
As mentioned above, if X holds covariate data which extend beyond the range of the original series
to be disaggregated then extrapolation is supported. But this is inherently risky, and becomes
riskier the longer the horizon over which it is attempted. In tdisagg extrapolation is by default
limited to one low-frequency period (= s high-frequency periods) beyond the end of the original
data. The user can adjust this behavior via the extmax member of the opts bundle passed to
tdisagg, described in the next section.
Chapter 9. Temporal disaggregation 77

9.5 Function signature


The signature of tdisagg is:

matrix tdisagg(Y0, [X], int s, [bundle opts], [bundle results])

where square brackets indicate optional arguments. Note that while the return value is a matrix, if
Y0 contains a single column or series it can be assigned to a series as in

series ys = tdisagg(Y0, ...)

provided it’s of the right length to match the current dataset, or the current sample range. Details
on the arguments follow.

Y0 : Y, as a matrix, series or list.

X (optional): X as a matrix, series or list. This should not contain standard deterministic terms,
since they are handled separately (see det under opts below). If this matrix is omitted, then
disaggregation will be performed using deterministic terms only.

s (int): The temporal expansion factor, for example 3 for quarterly to monthly, 4 for annual to
quarterly or 12 for annual to monthly. We do not support cases such as monthly to weekly or
monthly to daily, where s is not a fixed integer value common to all observations; otherwise,
anything goes.

opts (bundle, optional): a bundle holding additional options. The recognized keys are (in alpha-
betical order):

aggtype (string): Specifies the type of temporal aggregation appropriate to the series in ques-
tion. The value must be one of sum (each low-frequency value is a sum of s high-frequency
values, the default); avg (each low-frequency value is the average of s high frequency val-
ues); or last or first, indicating respectively that each low-frequency value is the last
or first of s high frequency values.
det (int): Relevant only when one of the Chow–Lin methods is selected. This is a numeric
code for the deterministic terms to be included in the regressions: 0 means none; 1,
constant only; 2, constant and linear trend; 3, constant and quadratic trend. The default
is 1.
extmax (int): the maximum number of high-frequency periods over which extrapolation should
be carried out, conditional on the availability of covariate data. A zero value means no
extrapolation; a value of −1 means as many periods as possible; and a positive value
limits extrapolation to the specified number of periods. See section 9.4 for a statement
of the default value.
method (string): Selects the method of disaggregation (see the listing below). Note that the
Chow–Lin methods employ an autoregression coefficient, ρ, which captures the persis-
tence of the target series at the higher frequency and is used in GLS estimation of the
parameters linking X to Y.
• chow-lin (the default) is modeled on the original method proposed by Chow and
Lin. It uses a value of ρ computed as the transformation of a maximum-likelihood
estimate of the low-frequency autocorrelation coefficient.
• chow-lin-mle is equivalent to the method called chow-lin-maxlog in the tempdis-
agg package for R; ρ is estimated by iterated GLS using the loglikelihood as criterion,
as recommended by Bournay and Laroque (1979). (The BFGS algorithm is used inter-
nally).
• chow-lin-ssr is equivalent to the method called chow-lin-minrss-quilis in tem-
pdisagg; ρ is estimated by iterated GLS using the sum of squared GLS residuals as
criterion (L-BFGS is used internally).
Chapter 9. Temporal disaggregation 78

• fernandez is basically “Chow–Lin with ρ = 1”. It is suitable if the target series has a
unit root, and is not cointegrated with the indicator series.
• denton-pfd is the proportional first differences variant of Denton, as modified by
Cholette. See Di Fonzo and Marini (2012) for details.
• denton-afd is the additive first differences variant of Denton (again, as modified by
Cholette). In contrast to the Chow–Lin methods, neither Denton procedure involves
regression.
plot (int): If a non-zero value is given, a simple plot is displayed by way of a “sanity check”
on the final series. See section 9.8 for details.
rho (scalar): Relevant only when one of the Chow–Lin methods is selected. If the method
is chow-lin, then rho is treated as a fixed value for ρ, thus enabling the user to by-
pass the default estimation procedure altogether. If the method is chow-lin-mle or
chow-lin-ssr, on the other hand, the supplied ρ value is used to initialize the numeri-
cal optimization algorithm.
verbose (int): Controls the verbosity of Chow–Lin or Fernández output. If 0 (the default)
nothing is printed unless an error occurs; if 1, summary output from the relevant regres-
sion is shown; if 2, in addition output from the optimizer for the iterated GLS procedure
is shown, if applicable.

results (bundle, optional): If present, this argument must be a previously defined bundle. Upon
successful completion of any of the methods other than denton it contains details of the
disaggregation under the following keys:

method : the method employed


rho : the value of ρ used
lnl : loglikelihood (maximized by the chow-lin-mle method)
SSR : sum of squared residuals (minimized by the chow-lin-ssr method)
coeff : the GLS (or OLS) coefficients
stderr : standard errors for the coefficients

If ρ is set to zero—either by specification of the user or because the estimate ρ̂ turned out
to be non-positive—then estimation of the coefficients is via OLS. In that case the lnl and
SSR values are calculated using the OLS residuals (which will be on a different scale from the
weighted residuals in GLS).

9.6 Handling of deterministic terms


It may be helpful to set out clearly, in one place, how deterministic terms are handled by tdisagg.

• If X is given explicitly: No deterministic term is added when the Denton method is used (since
a single preliminary series is wanted) but a constant is added when one of the Chow–Lin
methods is selected. The latter default can be overridden (i.e. the constant removed, or a
trend added) by means of the det entry in the options bundle.

• If X is omitted: By default a constant is used for all methods. Again, for Chow–Lin this can be
overridden by specifying a det value. If for some reason you wanted Denton with just a trend
you would have to supply X containing a trend.

9.7 Some technical details


In this section we provide some technical details on the methods used by tdisagg. We will refer to
the version of Y converted to the high frequency sf as the “final series”.
Chapter 9. Temporal disaggregation 79

As regards the Cholette-modified Denton methods, for the proportional first difference variant we
calculate the final series using the solution described by Di Fonzo and Marini (2012), specifically
equation (4) on page 5, and for the additive variant we draw on Di Fonzo (2003), pages 3 and 5 in
particular. Note that these procedures require the construction and inversion of a matrix of order
(s + 1)T . If both s and T are large it can therefore take some time, and be quite demanding of RAM.
As regards Chow–Lin, let ρ0 indicate the rho value passed via the options bundle (if applicable). We
then take these steps:

1. If ρ0 > 0 set ρ = ρ0 and go to step 6 if the method is chow-lin or step 7 otherwise. But if
ρ0 < 0 set ρ0 = 0.

2. Estimate via OLS a regression of Y on CX,3 where C is the appropriate aggregation matrix. Let
β̂OLS equal the coefficients from this regression. If ρ0 = 0 and the method is chow-lin go to
step 8.

3. Calculate the (low frequency) first order autocorrelation of the OLS residuals, ρ̂L . If ρ̂L ≥ 10−6
go to step 4. Otherwise, if the method is chow-lin set ρ = 0 and go to step 8, else set ρ = 0.5
and go to step 7.

4. Refine the positive estimate of ρ̂L via Maximum Likelihood estimation of the AR(1) specifica-
tion as described in Davidson and MacKinnon (2004).

5. If ρ̂L < 0.999, set ρ to the high-frequency counterpart of ρ̂L using the approach given in Chow
and Lin (1971). Otherwise set ρ = 0.999. If the method is chow-lin, go to step 6, otherwise
go to step 7.

6. Perform GLS with the given value of ρ, store the coefficients as β̂GLS and go to step 9.

7. Perform iterated GLS starting from the prior value of ρ, adjusting ρ with the goal of either
maximizing the loglikelihood (method chow-lin-mle) or minimizing the sum of squared GLS
residuals (chow-lin-ssr); set β̂GLS to the final coefficient estimates; and go to step 9.

8. Calculate the final series as Xβ̂OLS + C′ (CC′ )−1 ûOLS , where ûOLS indicates the OLS residuals,
and stop.

9. Calculate the final series as Xβ̂GLS + VC′ (CVC′ )−1 ûGLS , where ûGLS indicates the GLS residuals
and V is the estimated high-frequency covariance matrix.

A few notes on our Chow–Lin algorithm follow.

• One might question the value of performing steps 2 to 5 when the method is one that calls
for GLS iteration (chow-lin-mle or chow-lin-ssr), but our testing indicates that it can be
helpful to have a reasonably good estimate of ρ in hand before embarking on these iterations.

• Conversely, one might wonder why we bother with GLS iterations if we find ρ̂L < 10−6 . But
this allows for the possibility (most likely associated with small sample size) that iteration
will lead to ρ > 0 even when the estimate based on the intial OLS residuals is zero or negative.

• Note that in all cases we are discarding an estimate of ρ < 0 (truncating to 0), which we take
to be standard in this field. In our iterated GLS we achieve this by having the optimizer pick
values x in [−∞, +∞] which are translated to [0, 1] via the logistic CDF, ρ = 1/(1 + exp(−x)).
To be precise, that’s the case with chow-lin-mle. But we find that the chow-lin-ssr method
is liable to overestimate ρ and proceed to values arbitrarily close to 1, resulting in numerical
problems. We therefore bound this method to x in [−20, +6.9], corresponding to ρ values
between near-zero and approximately 0.999.4
3 Strictly
speaking, CX uses only the first sT rows of X if m > 0.
4 It
may be worth noting that the tempdisagg package for R limits both methods to a maximum ρ of 0.999. We find,
however, that the ML method can “look after itself”, and does not require the fixed upper bound short of 1.0.
Chapter 9. Temporal disaggregation 80

Temporal disaggregation (chow-lin)


2400
original data
2300 final series * 4

2200

2100

2000

1900

1800

1700

1600
1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965

Figure 9.1: Example output from plot option, showing annual GNP (red) and quarterly final series (blue) using
quarterly industrial production as indicator.

As for the Fernández method, this is quite straightforward. The place of the high-frequency co-
variance matrix V in Chow–Lin is taken by (D′ D)−1 , where D is the approximate first-differencing
matrix, with 1 on the diagonal and −1 on the first sub-diagonal. For efficient computation, however,
we store neither D nor D′ D as such, and do not perform any explicit inversion. The special struc-
ture of (D′ D)−1 makes it possible to produce the effect of pre-multiplication by this matrix with
O(T 2 ) floating-point operations. Estimation of ρ is not an issue since it equals 1 by assumption.

9.8 The plot option


The semantics of this option may be enriched in future but for now it’s a simple boolean switch. The
effect is to produce a time series plot of the final series along with the original low-frequency series,
shown in “step” form. If aggregation is by sum the final series is multiplied by s for comparability
with the original. If the disaggregation has been successful these two series should track closely
together, with the final series showing plausible short-run dynamics. An example is shown in
Figure 9.1.
If there are many observations, the two lines may appear virtually coincident. In that case one
can see what’s going on in more detail by exploiting the “Zoom” functionality of the plot, which is
accessed via the right-click menu in the plot window.

9.9 Multiple low-frequency series


We now return to a point mentioned in section 9.2, namely, that Y may be given as a T × g matrix
with g > 1, or a list of g series. This means that a single call to tdisagg can be used to process
several input series (“batch processing”), in which case the return value is a matrix with (s · T + m)
rows and g columns.
There are some restrictions. First and most obviously, a single call to tdisagg implies a single
selection of “indicators” or “related series” (X) and a single selection of options (aggregation type
of the data, deterministic terms, disaggregation method, and so on). So this possibility will be
relevant only if you have several series that “want the same treatment.” In addition, if g > 1 the
plot and verbose options are ignored and the results bundle is not filled; if you need those
features you should supply a single series or vector in Y.
Chapter 9. Temporal disaggregation 81

The advantage of batch processing lies in the spreading of fixed computational cost, leading to
shorter execution time. However, the relative importance of the fixed cost differs substantially
according to the disaggregation method. For the Chow–Lin methods the fixed cost is relatively
small and so little speed-up can be expected, but for the Denton methods it dominates, and (in our
testing) you can process g > 1 series in little more time than it takes to process a single series.
As they say, “Your mileage may vary,” but if you have a large number of series to be disaggregated
via one of the Denton methods you may well find it much faster to use the batch facility of tdisagg.

9.10 Examples
Listing 9.1 shows an example of usage and its output. The data are drawn from the St Louis
Fed; we disaggregate quarterly GDP to monthly with the help of industrial production and payroll
employment, using the default Chow–Lin method.
Several other example scripts are available from http://gretl.sourceforge.net/tdisagg/.

Listing 9.1: Example of tdisagg usage [Download ▼]


Input:
### Traditional Chow-Lin: y is a series with repetition
### and X is a list of series. This corresponds to case 4(a)
### as described in section 9.3 of the documentation above.
###

# ensure that no data are in place


clear
# open gretl’s St Louis Fed database
open fedstl.bin
# import two monthly series
data indpro payems
# import quarterly GDP (values are repeated)
data gdpc1

# restrict sample to complete data


smpl --no-missing

# disaggregate GDP from quarterly to monthly, using


# industrial production and payroll employment as indicators
scalar s = 3
list X = indpro payems
series gdpm = tdisagg(gdpc1, X, s, _(verbose=1, aggtype="sum"))

Output:
Aggregation type sum
GLS estimates (chow-lin) T = 294
Dependent variable: gdpc1

coefficient std. error t-ratio p-value


----------------------------------------------------------
const 312.394 263.372 1.186 0.2365
indpro 10.9158 1.75785 6.210 1.83e-09 ***
payems 0.0242860 0.00171935 14.13 7.39e-35 ***

rho = 0.999, SSR = 51543.9, lnl = -1604.98

Generated series gdpm (ID 4)


Chapter 10

Special functions in genr

10.1 Introduction
The genr command provides a flexible means of defining new variables. At the same time, the
somewhat paradoxical situation is that the “genr” keyword is almost never visible in gretl scripts.
For example, it is not really recommended to write a line such as genr b = 2.5, because there are
the following alternatives:

• scalar b = 2.5, which also invokes the genr apparatus in gretl, but provides explicit type
information about the variable b, which is usually preferable. (gretl’s language hansl is stati-
cally typed, so b cannot switch from scalar to string or matrix, for example.)

• b = 2.5, leaving it to gretl to infer the admissible or most “natural” type for the new object,
which would again be a scalar in this case.

• matrix b = {2.5}: This formulation is required if b is going to be expanded with additional


rows or columns later on. Otherwise, gretl’s static typing would not allow b to be promoted
from scalar to matrix, so it must be a matrix right from the start, even if it is of dimension
1 × 1 initially. (This definition could also be written as matrix b = 2.5, but the more explicit
form is recommended.)

In addition to scalar or matrix, other type keywords that can be used to substitute the generic
genr term are those enumerated in the following chapter 11. In the case of an array the concrete
specification should be used, so one of matrices, strings, lists, bundles.1
Therefore, there’s only a handful of special cases where it is really necessary to use the “genr”
keyword:

• genr time — Creates a time trend variable (1,2,3,. . . ) under the name time. Note that within
an appropriately defined panel dataset this variable honors the panel structure and is a true
time index. (In a cross-sectional dataset, the command will still work and produces the same
result as genr index below, but of course no temporal meaning exists.)

• genr index — Creates an observation variable named index, running from 1 to the sample
size.

• genr unitdum — In the context of panel data, creates a set of dummies for the panel groups
or “units”. These are named du_1, du_2, and so forth. Actually, this particular genr usage is
not strictly necessary, because a list of group dummies can also be obtained as:

series gr = $unit
list groupdums = dummify(gr, NA)

(The NA argument to the dummify function has the effect of not skipping any unit as the
reference group, thus producing the full set of dummies.)

1A recently added advanced datatype is an array of arrays, with the associated type specifier arrays.

82
Chapter 10. Special functions in genr 83

• genr timedum — Again for panel data, creates a set of dummies for the time periods, named
dt_1, dt_2, . . . . And again, a list-producing variant without genr exists, using the special
accessor $obsminor which indexes time in the panel context and can be used as a substitute
for time from above:

series tindex = $obsminor


list timedums = dummify(tindex, NA)

Finally, there also exists genr dummy, which produces a set of seasonal dummies. However, it is
recommended to use the seasonals() function instead, which can also return centered dummies.
The rest of this chapter discusses other special function aspects.

10.2 Cumulative densities and p-values


The two functions cdf and pvalue provide complementary means of examining values from 17
probability distributions (as of July 2021), among which the most important ones: standard normal,
Student’s t, χ 2 , F , gamma, and binomial. The syntax of these functions is set out in the Gretl
Command Reference; here we expand on some subtleties.
The cumulative density function or CDF for a random variable is the integral of the variable’s
density from its lower limit (typically either −∞ or 0) to any specified value x. The p-value (at
least the one-tailed, right-hand p-value as returned by the pvalue function) is the complementary
probability, the integral from x to the upper limit of the distribution, typically +∞.
In principle, therefore, there is no need for two distinct functions: given a CDF value p0 you could
easily find the corresponding p-value as 1 − p0 (or vice versa). In practice, with finite-precision com-
puter arithmetic, the two functions are not redundant. This requires a little explanation. In gretl,
as in most statistical programs, floating point numbers are represented as “doubles” — double-
precision values that typically have a storage size of eight bytes or 64 bits. Since there are only so
many bits available, only so many floating-point numbers can be represented: doubles do not model
the real line. Typically doubles can represent numbers over the range (roughly) ±1.7977 × 10308 ,
but only to about 15 digits of precision.
Suppose you’re interested in the left tail of the χ 2 distribution with 50 degrees of freedom: you’d
like to know the CDF value for x = 0.9. Take a look at the following interactive session:

? scalar p1 = cdf(X, 50, 0.9)


Generated scalar p1 = 8.94977e-35
? scalar p2 = pvalue(X, 50, 0.9)
Generated scalar p2 = 1
? scalar test = 1 - p2
Generated scalar test = 0

The cdf function has produced an accurate value, but the pvalue function gives an answer of 1,
from which it is not possible to retrieve the answer to the CDF question. This may seem surprising
at first, but consider: if the value of p1 above is correct, then the correct value for p2 is 1−8.94977×
10−35 . But there’s no way that value can be represented as a double: that would require over 30
digits of precision.
Of course this is an extreme example. If the x in question is not too far off into one or other tail
of the distribution, the cdf and pvalue functions will in fact produce complementary answers, as
shown below:

? scalar p1 = cdf(X, 50, 30)


Generated scalar p1 = 0.0111648
? scalar p2 = pvalue(X, 50, 30)
Generated scalar p2 = 0.988835
Chapter 10. Special functions in genr 84

? scalar test = 1 - p2
Generated scalar test = 0.0111648

But the moral is that if you want to examine extreme values you should be careful in selecting the
function you need, in the knowledge that values very close to zero can be represented as doubles
while values very close to 1 cannot.

10.3 Retrieving internal variables (dollar accessors)


A very useful feature is to retrieve in a script various values calculated by gretl in the course of
estimating models or testing hypotheses. Since they all start with a literal $ character, they are
called “dollar accessors”. The variables that can be retrieved in this way are listed in the Gretl
Command Referenceor in the built-in function help under the Help menu. The dollar accessors
can be used like other gretl objects in script assignments or statements. Some of those accessors
are actually independent of any estimation or test and describe, for example, the context of the
running gretl program. But here we say a bit more about the special variables $test and $pvalue.
These variables hold, respectively, the value of the last test statistic calculated using an explicit
testing command and the p-value for that test statistic. If no such test has been performed at the
time when these variables are referenced, they will produce the missing value code. Some “explicit
testing commands” that work in this way are as follows (among others): add (joint test for the sig-
nificance of variables added to a model); adf (Augmented Dickey–Fuller test, see below); arch (test
for ARCH); chow (Chow test for a structural break); coeffsum (test for the sum of specified coef-
ficients); coint (Engle-Granger cointegration test); cusum (the Harvey–Collier t-statistic); difftest
(test for a difference of two groups); kpss (KPSS stationarity test, no p-value available); modtest
(see below); meantest (test for difference of means); omit (joint test for the significance of vari-
ables omitted from a model); reset (Ramsey’s RESET); restrict (general linear restriction); runs
(runs test for randomness); and vartest (test for difference of variances). In most cases both a
$test and a $pvalue are stored; the exception is the KPSS test, for which a p-value is not currently
available.
The modtest command (which must follow an estimation command) offers several diagnostic tests;
the particular test performed depends on the option flag provided. Please see the Gretl Command
Reference and for example chapters 32 and 31 of this Guide for details.
An important point to notice about this mechanism is that the internal variables $test and $pvalue
are over-written each time one of the tests listed above is performed. If you want to reference these
values, you must do so at the correct point in the sequence of gretl commands.
Chapter 11

Gretl data types

11.1 Introduction
Gretl offers the following data types:

scalar holds a single numerical value


series holds n numerical values, where n is the number of observations in the current
dataset
matrix holds a rectangular array of numerical values, of any (two) dimensions
list holds the ID numbers of a set of series
string holds an array of characters
bundle holds zero or more objects of various types
array holds zero or more objects of a given type

The “numerical values” mentioned above are all double-precision floating point numbers.
In this chapter we give a run-down of the basic characteristics of each of these types and also
explain their “life cycle” (creation, modification and destruction). The list and matrix types, whose
uses are relatively complex, are discussed at greater length in chapters 15 and 17 respectively.

11.2 Series
We begin with the series type, which is the oldest and in a sense the most basic type in gretl. When
you open a data file in the gretl GUI, what you see in the main window are the ID numbers, names
(and descriptions, if available) of the series read from the file. All the series existing at any point in
a gretl session are of the same length, although some may have missing values. The variables that
can be added via the items under the Add menu in the main window (logs, squares and so on) are
also series.
For a gretl session to contain any series, a common series length must be established. This is
usually achieved by opening a data file, or importing a series from a database, in which case the
length is set by the first import. But one can also use the nulldata command, which takes as it
single argument the desired length, a positive integer.
Each series has these basic attributes: an ID number, a name, and of course n numerical values.
A series may also have a description (which is shown in the main window and is also accessible
via the labels command), a “display name” for use in graphs, a record of the compaction method
used in reducing the variable’s frequency (for time-series data only) and flags marking the variable
as discrete and/or as a numeric encoding of a qualitative characteristic. These attributes can be
edited in the GUI by choosing Edit Attributes (either under the Variable menu or via right-click), or
by means of the setinfo command.
In the context of most commands you are able to reference series by name or by ID number as you
wish. The main exception is the definition or modification of variables via a formula; here you must
use names since ID numbers would get confused with numerical constants.
Note that series ID numbers are always consecutive, and the ID number for a given series will change
if you delete a lower-numbered series. In some contexts, where gretl is liable to get confused by

85
Chapter 11. Gretl data types 86

such changes, deletion of low-numbered series is disallowed.

Discrete series
It is possible to mark variables of the series type as discrete. The meaning and uses of this facility
are explained in chapter 12.

String-valued series
It is generally expected that series in gretl will be “properly numeric” (on a ratio or at least an
ordinal scale), or the sort of numerical indicator variables (0/1 “dummies”) that are traditional in
econometrics. However, “string-valued” series are also supported — see chapter 16 for details.

11.3 Scalars
The scalar type is relatively simple: just a convenient named holder for a single numerical value.
Scalars have none of the additional attributes pertaining to series, do not have ID numbers, and
must be referenced by name. A common use of scalar variables is to record information made
available by gretl commands for further processing, as in scalar s2 = $sigmaˆ2 to record the
square of the standard error of the regression following an estimation command such as ols.
You can define and work with scalars in gretl without having any dataset in place.
In the gretl GUI, scalar variables can be inspected and their values edited via the “Icon view” (see
the View menu in the main window).

11.4 Matrices
Matrices in gretl work much as in other mathematical software (e.g. MATLAB, Octave). Like scalars
they have no ID numbers and must be referenced by name, and they can be used without any
dataset in place. Matrix indexing is 1-based: the top-left element of matrix A is A[1,1]. Matrices
are discussed at length in chapter 17; advanced users of gretl will want to study this chapter in
detail.
Matrices have two optional attribute beyond their numerical content: they may have column and/or
row names attached; these are displayed when the matrix is printed. See the cnameset and
rnameset functions for details.
In the gretl GUI, matrices can be inspected, analysed and edited via the Icon view item under the
View menu in the main window: each currently defined matrix is represented by an icon.

11.5 Lists
As with matrices, lists merit an explication of their own (see chapter 15). Briefly, named lists can
(and should!) be used to make command scripts less verbose and repetitious, and more easily
modifiable. Since lists are in fact lists of series ID numbers they can be used only when a dataset is
in place.
In the gretl GUI, named lists can be inspected and edited under the Data menu in the main window,
via the item Define or edit list.

11.6 Strings
String variables may be used for labeling, or for constructing commands. They are discussed in
chapter 15. They must be referenced by name; they can be defined in the absence of a dataset.
Chapter 11. Gretl data types 87

Such variables can be created and modified via the command-line in the gretl console or via script;
there is no means of editing them via the gretl GUI.

11.7 Bundles
A bundle is a container or wrapper for various sorts of objects — primarily scalars, matrices,
strings, arrays and bundles. (Yes, a bundle can contain other bundles). Secondarily, series and
lists can be placed in bundles but this is subject to important qualifications noted below.
A bundle takes the form of a hash table or associative array: each item placed in the bundle is
associated with a key which can used to retrieve it subsequently. We begin by explaining the
mechanics of bundles then offer some thoughts on what they are good for.
There are several ways of creating a bundle. Here are the first two:

• Just “declare” it, as in


bundle foo

• or define an empty bundle using the null keyword:


bundle foo = null

These formulations are basically equivalent, in that they both create an empty bundle. The differ-
ence is that the second variant may be reused — if a bundle named foo already exists the effect is
to empty it—while the first may only be used once in a given gretl session; it is an error to attempt
to declare a variable that already exists.
To create a bundle and populate it with some members in one go, you can use the defbundle
function, introduced in gretl 2017b. For example:

bundle foo = defbundle("x", 13, "mat", I(3), "str", "some string")

Here the arguments come in pairs: key followed by the object to be associated with the key, with
all terms comma-separated. However, you may prefer to use one or other of the alternative idioms
introduced in gretl 2021a. The first of these looks like this:

bundle foo = _(x = 13, mat = I(3), str = "some string")

It’s more streamlined than defbundle but not quite so flexible. You don’t have to quote the keys,
but that also means that you can’t give the name of a key as a string variable; it’s always taken as a
string literal. Yet more streamlined but also less flexible is this variant:

bundle foo = _(x, mat, str)

which works if and only if there are existing objects x, mat and str in scope and you want to add
them to the bundle under keys equal to their own names.
For more on the defbundle function, see the Gretl Command Reference or the Function Reference
under Help in the GUI program.
To add an object to a bundle you assign to a compound left-hand value: the name of the bundle
followed by the key. Two forms of syntax are acceptable in this context. The recommended syntax
(for most uses) is bundlename.key ; that is, the name of the bundle followed by a dot, then the key.
Both the bundle name and the key must be valid gretl identifiers.1 For example, the statement

foo.matrix1 = m
1 As a reminder: 31 characters maximum, starting with a letter and composed of just letters, numbers or underscore.
Chapter 11. Gretl data types 88

adds an object called m (presumably a matrix) to bundle foo under the key matrix1. If you wish to
make it explicit that m is supposed to be a matrix you can use the form

matrix foo.matrix1 = m

Alternatively, a bundle key may be given as a string enclosed in square brackets, as in

foo["matrix1"] = m

This syntax offers greater flexibility in that the key string does not have to be a valid identifier (for
example it can include spaces). In addition, when using the square bracket syntax it is possible to
use a string variable to define or access the key in question. For example:

string s = "matrix 1"


foo[s] = m # matrix is added under key "matrix 1"

To get an item out of a bundle, again use the name of the bundle followed by the key, as in

matrix bm = foo.matrix1
# or using the alternative notation
matrix bm = foo["matrix1"]
# or using a string variable
matrix bm = foo[s]

Note that the key identifying an object within a given bundle is necessarily unique. If you reuse an
existing key in a new assignment, the effect is to replace the object which was previously stored
under the given key. It is not required that the type of the replacement object is the same as that
of the original.
Also note that when you add an object to a bundle, what in fact happens is that the bundle acquires
a copy of the object. The external object retains its own identity and is unaffected if the bundled
object is replaced by another. Consider the following script fragment:

bundle foo
matrix m = I(3)
foo.mykey = m
scalar x = 20
foo.mykey = x

After the above commands are completed bundle foo does not contain a matrix under mykey, but
the original matrix m is still in good health.
To delete an object from a bundle use the delete command, with the bundle/key combination, as
in

delete foo.mykey

This destroys the object associated with mykey and removes the key from the hash table.
To determine whether a bundle contains an object associated with a given key, use the inbundle()
function. This takes two arguments: the name of the bundle and the key string. The value returned
by this function is an integer which codes for the type of the object (0 for no match, 1 for scalar, 2
for series, 3 for matrix, 4 for string, 5 for bundle and 6 for array). The function typestr() may be
used to get the string corresponding to this code. For example:

scalar type = inbundle(foo, x)


if type == 0

You might also like