Gnuplot Tips for Data Visualization
Gnuplot Tips for Data Visualization
You may ask, why bother changing the size in the gnuplot command file? After all, PDF and EPS are
both vector formats, so the graphs can be scaled at will. True, but a uniform scaling will also affect
the font size, which may end looking wrong. You can get optimal results by experimenting with
the font and size options to gnuplot’s set term command. Here are some examples (comments
follow below).
# postscript, regular
set term post eps solid font "Helvetica,16"
# or small
set term post eps solid font "Helvetica,12" size 3in,2in
On the first line we set a sans serif font for pdfcairo at a suitable size for a 5 × 3.5 inch plot
(which you may find looks better than the rather “letterboxy” default of 5 × 3). And on the second
we illustrate what you might do to get a smaller 3 × 2 inch plot. You can specify the plot size in
centimeters if you prefer, as in
We then repeat the exercise for the pdf terminal. Notice that here we’re specifying one of the 35
standard PostScript fonts, namely Helvetica. Unlike pdfcairo, the plain pdf driver is unlikely to
be able to find fonts other than these.
In the third pair of lines we illustrate options for the postscript driver (which, as you see, can
be abbreviated as post). Note that here we have added the option solid. Unlike most other
drivers, this one uses dashed lines unless you specify the solid option. Also note that we’ve
(apparently) specified a much larger font in this case. That’s because the eps option in effect tells
the postscript driver to work at half-size (among other things), so we need to double the font
size.
Table 6.1 summarizes the basics for the three drivers we have mentioned.
To find out more about gnuplot visit www.gnuplot.info. This site has documentation for the current
version of the program in various formats.
Additional tips
To be written. Line widths, enhanced text. Show a “before and after” example.
Chapter 6. Graphs and plots 40
gnuplot y1 y2 x
will give you a basic XY plot of the two series y1 and y2 on the vertical axis versus the series x on
the horizontal axis. In general, the arguments to the gnuplot command is a list of series, the last
of which goes on the x-axis, while all the other ones go onto the y-axis. By default, the gnuplot
command gives you a scatterplot. If you just have one variable on the y-axis, then gretl will also
draw a the OLS interpolation, if the fit is good enough.2
Several aspects of the behavior described above can be modified. You do this by appending options
to the command. Most options can be broadly grouped in three categories:
1. Plot styles: we support points (the default choice), lines, lines and points together, and im-
pulses (vertical lines).
2. Algorithm for the fitted line: here you can choose between linear, quadratic and cubic inter-
polation, but also more exotic choices, such as semi-log, inverse or loess (non-parametric). Of
course, you can also turn this feature off.
3. Input and output: you can choose whether you want your graph on your computer screen
(and possibly use the in-built graphical widget to further customize it — see above, page 37),
or rather save it to a file. We support several graphical formats, among which PNG and PDF,
to make it easy to incorporate your plots into text documents.
The following script uses the AWM dataset to exemplify some traditional plots in macroeconomics:
2 The technical condition for this is that the two-tailed p-value for the slope coefficient should be under 10%.
Chapter 6. Graphs and plots 41
These examples use variables from the “area-wide model” dataset by the European Central Bank
(ECB) which is shipped with gretl in the AWM.gdt file. PCR is aggregate private real consumption
and YER is real GDP. The first command line above thus plots consumption against income as a
kind of Keynesian consumption function. More precisely, it produces a simple scatter plot with
an automatically linear fitted line. If this is executed in the gretl console the plot will be directly
shown in a new window, but if this line is contained in a script then instead a file with the plot
commands will be saved for later execution. The second example line changes this behavior for a
script command and forces the plot to be shown directly.
The third line instead asks for a plot of the two variables as two separate curves against time on
the x-axis. Each observation point is drawn separately with a certain symbol determined by gnuplot
defaults. If you add the option --with-lines the points will be connected with a continuous line
and the symbols omitted.
The second set of example lines above demonstrate how the fitted line in the scatter plot can be
controlled from gretl’s side. The option --fit=none overrides gnuplot’s default to draw a line if it
deems the fit to be “good enough”. The effect of --fit=inverse is to consider the variable on the
y-axis as a function of 1/X instead of X and draw the corresponding hyperbolic branch. For the
workings of a Loess fit (locally-weighted polynomial regression) please refer to the documentation
of the loess function.
For more detail, consult the Gretl Command Reference.
options: specify multiple options on a single line; if more than one option is given on a line, the
options should be separated by spaces.
printf: a printf statement whose result will be passed to gnuplot literally; this allows the use of
string variables without having to resort to @-style string substitution.
The options available are basically those of the current gnuplot command, but with a few dif-
ferences. For one thing you don’t need the leading double-dash in an "option" (or "options") line.
Besides that,
• You can’t use the option --matrix=whatever with plot: that possibility is handled by pro-
viding the name of a matrix on the initial plot line.
• The --input=filename option is not supported: use gnuplot for the case where you’re
supplying the entire plot specification yourself.
Chapter 6. Graphs and plots 42
• The several options pertaining to the presence and type of a fitted line, are replaced in plot
by a single option fit which requires a parameter. Supported values for the parameter are:
none, linear, quadratic, cubic, inverse, semilog and loess. Example:
option fit=quadratic
As with gnuplot, the default is to show a linear fit in an X-Y scatter if it’s significant at the 10
percent level.
Here’s a simple example, the plot specification from the “bandplot” package, which shows how
to achieve the same result via the gnuplot command and a plot block, respectively — the latter
occupies a few more lines but is clearer
plot plotmat
options with-lines fit=none
literal set linetype 3 lc rgb "#0000ff"
literal set nokey
printf "set title \"%s\"", title
printf "set xlabel \"%s\"", xname
end plot --output=display
Note that --output=display is appended to end plot; also note that if you give a matrix to plot
it’s assumed you want to plot all the columns. In addition, if you give a single series and the dataset
is time series, it’s assumed you want a time-series plot.
• The data for the plot are passed through a matrix in which we set column names via the
cnameset function; those names are then automatically used by the plot environment.
• In this example, we make extensive use of the set literal construct for refining the plot by
passing instruction to gnuplot; the power of gnuplot is impossible to overstate. We encourage
you to visit the “demos” version of gnuplot’s website (http://gnuplot.sourceforge.net/)
and revel in amazement.
• In the plot environment you can use all the quantities you have in your script. This is the
way we calibrate the histogram width (try setting the scalar k in the script to different values).
Note that the printf command has a special meaning inside a plot environment.
• The script displays the plot on your screen. If you want to save it to a file instead, replace
--output=display at the end with --output=filename.
• It’s OK to insert comments in the plot environment; actually, it’s a rather good idea to com-
ment as much as possible (as always)!
Listing 6.1: Plotting the log wage from the Mroz example dataset [Download ▼]
###
### prepare matrix with data for plot
###
###
### create plot
###
plot plotmat
# move legend
literal set key outside rmargin
# set line style
literal set linetype 2 dashtype 2 linewidth 2
# set histogram color
literal set linetype 1 lc rgb "#777777"
# set histogram style
literal set style fill solid 0.25 border
# set histogram width
printf "set boxwidth %4.2f\n", 0.5/k
options with-lines=2 with-boxes=1
end plot --output=display
Chapter 6. Graphs and plots 44
0.18
frequency
density
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
-2 -1 0 1 2 3
log wage
plot
literal set xrange [-4.5:4.5]
literal set yrange [0:0.45]
literal Binv(p,q) = exp(lgamma(p+q)-lgamma(p)-lgamma(q))
literal stud(x,m) = Binv(0.5*m,0.5)/sqrt(m)*(1.0+(x*x)/m)**(-0.5*(m+1.0))
printf "plot %s, %s, %s", tplot(dfs[1]), tplot(dfs[2]), tplot(dfs[3])
end plot --output=display
Chapter 6. Graphs and plots 45
0.45
t(2)
t(4)
0.4 t(16)
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4
6.3 Boxplots
These plots (after Tukey and Chambers) display the distribution of a variable. Its shape depends
on a few quantities, defined as follows:
The central box encloses the middle 50 percent of the data, i.e. goes from Q1 to Q3 ; therefore, its
height equals R. A line is drawn across the box at the median m and a “+” sign identifies the mean
x̄.
The length of the “whiskers” depends on the presence of outliers. The top whisker extends from
the top of the box up to a maximum of 1.5 times the interquartile range, but can be shorter if the
sample maximum is lower than that value; that is, it reaches min[xmax , Q3 + 1.5R]. Observations
larger than Q3 + 1.5R, if any, are considered outliers and represented individually via dots.3 The
bottom whisker obeys the same logic, with obvious adjustments. Figure 6.5 provides an example
of all this by using the variable FAMINC from the sample dataset mroz87.
FAMINC
xmax
80000
outliers
60000
40000
Q3 x̄
20000 m
Q1
0 xmin
In the case of boxplots with confidence intervals, dotted lines show the limits of an approximate 90
3 To give you an intuitive idea, if a variable is normally distributed, the chances of picking an outlier by this definition
percent confidence interval for the median. This is obtained by the bootstrap method, which can
take a while if the data series is very long. For details on constructing boxplots, see the entry for
boxplot in the Gretl Command Reference or use the Help button that appears when you select one
of the boxplot items under the menu item “View, Graph specified vars” in the main gretl window.
Factorized boxplots
A nice feature which is quite useful for data visualization is the conditional, or factorized boxplot.
This type of plot allows you to examine the distribution of a variable conditional on the value of
some discrete factor.
As an example, we’ll use one of the datasets supplied with gretl, that is rac3d, which contains an
example taken from Cameron and Trivedi (2013) on the health conditions of 5190 people. The
script below compares the unconditional (marginal) distribution of the number of illnesses in the
past 2 weeks with the distribution of the same variable, conditional on age classes.
open rac3d.gdt
# unconditional boxplot
boxplot ILLNESS --output=display
# create a discrete variable for age class:
# 0 = below 20, 1 = between 20 and 39, etc
series age_class = floor(AGE/0.2)
# conditional boxplot
boxplot ILLNESS age_class --factorized --output=display
After running the code above, you should see two graphs similar to Figure 6.6. By comparing the
marginal plot to the factorized one, the effect of age on the mean number of illnesses is quite
evident: by joining the green crosses you get what is technically known as the conditional mean
function, or regression function if you prefer.
5 5
4
4
3
ILLNESS
2
2
1
1
0 0 1 2 3
age_class
7.1 Introduction
Gretl provides two commands for adding data from file to an existing dataset in the program’s
workspace, namely append and join. The append command, which has been available for a long
time, is relatively simple and is described in the Gretl Command Reference. Here we focus on the
join command, which is much more flexible and sophisticated. This chapter gives an overview
of the functionality of join along with a detailed account of its syntax and options. We provide
several toy examples and discuss one real-world case at length.
First, a note on terminology: in the following we use the terms “left-hand” and “inner” to refer
to the dataset that is already in memory, and the terms “right-hand” and “outer” to refer to the
dataset in the file from which additional data are to be drawn.
Two main features of join are worth emphasizing at the outset:
• “Key” variables can be used to match specific observations (rows) in the inner and outer
datasets, and this match need not be 1 to 1.
• A row filter may be applied to screen out unwanted observations in the outer dataset.
As will be explained below, these features support rather complex concatenation and manipulation
of data from different sources.
A further aspect of join should be noted — one that makes this command particularly useful when
dealing with very large data files. That is, when gretl executes a join operation it does not, in gen-
eral, read into memory the entire content of the right-hand side dataset. Only those columns that
are actually needed for the operation are read in full. This makes join faster and less demanding
of computer memory than the methods available in most other software. On the other hand, gretl’s
asymmetrical treatment of the “inner” and “outer” datasets in join may require some getting used
to, for users of other packages.
48
Chapter 7. Joining data sources 49
on the left, any observations outside of the current sample are set to NA; if it exists already then
observations outside of the current sample are left unchanged.
The case where you want to rename a series on import is handled by the --data option. This option
has one required argument, the name by which the series is known on the right. At this point we
need to explain something about right-hand variable names (column headings).
Right-hand names
We accept on input arbitrary column heading strings, but if these strings do not qualify as valid
gretl identifiers they are automatically converted, and in the context of join you must use the
converted names. A gretl identifier must start with a letter, contain nothing but (ASCII) letters,
digits and the underscore character, and must not exceed 31 characters. The rules used in name
conversion are:
2. Until the 31-character is reached or the input is exhausted: transcribe “legal” characters; skip
“illegal” characters apart from spaces; and replace one or more consecutive spaces with an
underscore, unless the last character transcribed is an underscore in which case space is
skipped.
In the unlikely event that this policy yields an empty string, we replace the original with coln,
where n is replaced by the 1-based index of the column in question among those used in the
join operation. If you are in doubt regarding the converted name of a given column, the function
fixname() can be used as a check: it takes the original string as an argument and returns the
converted name. Examples:
? eval fixname("valid_identifier")
valid_identifier
? eval fixname("12. Some name")
Some_name
Returning to the use of the --data option, suppose we have a column headed "12. Some name"
on the right and wish to import it as x. After figuring how the right-hand name converts, we can do
No right-hand names?
Some data files have no column headings; they jump straight into the data (and you need to deter-
mine from accompanying documentation what the columns represent). Since gretl expects column
headings, you have to take steps to get the importation right. It is generally a good idea to insert a
suitable header row into the data file. However, if for some reason that’s not practical, you should
give the --no-header option, in which case gretl will name the columns on the right as col1, col2
and so on. If you do not do either of these things you will likely lose the first row of data, since
gretl will attempt to make variable names out of it, as described above.
7.3 Filtering
Rows from the outer dataset can be filtered using the --filter option. The required parameter
for this option is a Boolean condition, that is, an expression which evaluates to non-zero (true,
include the row) or zero (false, skip the row) for each of the outer rows. The filter expression may
include any of the following terms: up to three “right-hand” series (under their converted names as
Chapter 7. Joining data sources 50
explained above); scalar or string variables defined “on the left”; any of the operators and functions
available in gretl (including user-defined functions); and numeric or string constants.
Here are a few simple examples of potentially valid filter options (assuming that the specified right-
hand side columns are found):
# 5. compound condition
--filter="x < 100 && (x > 0 || y > 0)"
Note that if you are comparing against a string constant (as in example 3 above) it is necessary
to put the string in “escaped” double-quotes (each double-quote preceded by a backslash) so the
interpreter knows that F is not supposed to be the name of a variable.
It is safest to enclose the whole filter expression in double quotes, however this is not strictly
required unless the expression contains spaces or the equals sign.
In general, an error is flagged if a missing value is encountered in a series referenced in a filter
expression. This is because the condition then becomes indeterminate; taking example 2 above, if
the nkids value is NA on any given row we are not in a position to evaluate the condition nkids>2.
However, you can use the missing() function — or ok(), which is a shorthand for !missing()— if
you need a filter that keys off the missing or non-missing status of a variable.
The basic key option is named ikey; this indicates “inner key”, that is, the key variable found in the
left-hand or inner dataset. By default it is assumed that the right-hand dataset contains a column of
the same name, though as we’ll see below that assumption can be overridden. The join command
above says, find a series named xh in the right-hand dataset and add it to the left-hand one, using
the values of hid to match rows. Looking at the data in Table 7.1 we can see how this should
work. Persons 1 and 2 are both members of household 1, so they should both get values of 1 for
xh; persons 3 and 4 are members of household 2, so that xh = 4; and so on. Note that the order
Chapter 7. Joining data sources 51
in which the key values occur on the right-hand side does not matter. The gretl output from the
print command is shown in the lower panel of Table 7.1.
people.csv hholds.csv
pid,hid,gender,age,xp hid,country,xh
1,1,M,50,1 1,US,1
2,1,F,40,2 6,IT,12
3,2,M,30,3 3,UK,6
4,2,F,25,2 4,IT,8
5,3,M,40,3 2,US,4
6,4,F,35,4 5,IT,10
7,4,M,70,3
8,4,F,60,3
9,5,F,20,4
10,6,M,40,4
pid hid xh
1 1 1
2 1 1
3 2 4
4 2 4
5 3 6
6 4 8
7 4 8
8 4 8
9 5 10
10 6 12
Table 7.1: Two linked CSV data files, and the effect of a join
Note that key variables are treated conceptually as integers. If a specified key contains fractional
values these are truncated.
Two extensions of the basic key mechanism are available.
• If the outer dataset contains a relevant key variable but it goes under a different name from
the inner key, you can use the --okey option to specify the outer key. (As with other right-
hand names, this does not have to be a valid gretl identifier.) So, for example, if hholds.csv
contained the hid information, but under the name HHOLD, the join command above could
be modified as
join hholds.csv xh --ikey=hid --okey=HHOLD
• If a single key is not sufficient to generate the matches you want, you can specify a double key
in the form of two series names separated by a comma; in this case the importation of data is
restricted to those rows on which both keys match. The syntax here is, for example
Again, the --okey option may be used if the corresponding right-hand columns are named
differently. The same number of keys must be given on the left and the right, but when a
Chapter 7. Joining data sources 52
double key is used and only one of the key names differs on the right, the name that is in
common may be omitted (although the comma separator must be retained). For example, the
second of the following lines is acceptable shorthand for the first:
join foo.csv x --ikey=key1,Lkey2 --okey=key1,Rkey2
join foo.csv x --ikey=key1,Lkey2 --okey=,Rkey2
• Some rows on the left have multiple matches on the right (“1 to n matching”).
• Some rows on the right have multiple matches on the left (“n to 1 matching”).
The first case is addressed in detail in the next section; here we discuss the others.
The n to 1 case is straightforward. If a particular key value (or combination of key values) occurs at
each of n > 1 observations on the left but at a single observation on the right, then the right-hand
value is entered at each of the matching slots on the left.
The handling of the case where there’s no match on the right depends on whether the join operation
is adding a new series to the inner dataset or modifying an existing one. If it’s a new series, then
unmatched rows automatically get NA for the imported data. However, if join is pulling in values
for a series already present on the left only matched rows will be updated. In other words we do
not overwite an existing value on the left with NA when there’s no match on the right.
These defaults may not produce the desired results in every case but gretl provides the means to
modify the effect if need be. We will illustrate with two scenarios.
First consider adding a new series recording “number of hours worked” when the inner dataset
contains individuals and the outer file contains data on jobs. If an individual does not appear in
the jobs file, we may want to take her hours worked as implicitly zero rather than NA. In this case
gretl’s misszero() function can be used to turn NA into 0 in the imported series.
Second, consider updating a series via join when the outer file is presumed to contain all available
updated values, such that “no match” should be taken as an implicit NA. In that case we want the
(presumably out-of-date) values on any unmatched rows to be overwritten with NA. Let the series
in question be called x (both on the left and the right) and let the common key be called pid. The
solution is then
As a new variable, tmpvar will get NA for all unmatched rows; we then transcribe its values into
x. In a more complicated case one might use the smpl command to limit the sample range before
assigning tmpvar to x, or use the conditional assignment operator ?:.
One further point: given some missing values in an imported series you may want to know whether
(a) the NAs were explicitly represented in the outer data file or (b) they arose due to “no match”. You
can find this out by using a method described in the following section, namely the count variant of
the aggregation option: this will give you a series with 0 values for all and only unmatched rows.
Chapter 7. Joining data sources 53
7.5 Aggregation
In the case of 1 to n matching of rows (n > 1) the user must specify an “aggregation method”; that
is, a method for mapping from n rows down to one. This is handled by the --aggr option which
requires a single argument from the following list:
Note that the count aggregation method is special, in that there is no need for a “data series” on
the right; the imported series is simply a function of the specified key(s). All the other methods
require that “actual data” are found on the right. Also note that when count is used, the value
returned when no match is found is (as one might expect) zero rather than NA.
The basic use of the seq method is shown above: following the colon you give a positive integer rep-
resenting the (1-based) position of the observation in the sequence of matched rows. Alternatively,
a negative integer can be used to count down from the last match (seq:-1 selects the last match,
seq:-2 the second-last match, and so on). If the specified sequence number is out of bounds for a
given observation this method returns NA.
Referring again to the data in Table 7.1, suppose we want to import data from the persons file into
a dataset established at household level. Here’s an example where we use the individual age data
from people.csv to add the average and minimum age of household members.
Here’s a further example where we add to the household data the sum of the personal data xp, with
the twist that we apply filters to get the sum specifically for household members under the age of
40, and for women.
The possibility of using an auxiliary variable with the min and max modes of aggregation gives extra
flexibility. For example, suppose we want for each household the income of its oldest member:
you can mix numerical and string keys, but naturally you cannot mix a string variable on the left
(via ikey) with a numerical one on the right (via okey), or vice versa.
Here’s a simple example. Suppose that alongside hholds.csv we have a file countries.csv with
the following content:
country,GDP
UK,100
US,500
IT,150
FR,180
The variable country, which is also found in hholds.csv, is string-valued. We can pull the GDP of
the country in which the household resides into our households dataset with
open hholds.csv -q
join countries.csv GDP --ikey=country
which gives
1 1 1 500
2 6 2 150
3 3 3 100
4 4 2 150
5 2 1 500
6 5 2 150
1. The varname field in the command can take the form of a space-separated list of names rather
than a single name.
2. Alternatively, you can give the name of an array of strings in place of varname: the elements
of this array should be the names of the series to import.
1. The --data option, which permits the renaming of a series on import, is not available. When
importing multiple series you are obliged to accept their “outer” names, fixed up as described
in section 7.2.
2. While the other join options are available, they necessarily apply uniformly to all the series
imported via a given command. This means that if you want to import several series but using
different keys, filters or aggregation methods you must use a sequence of commands.
We then restrict the sample to married women from 30 to 60 years of age, and additionally restrict
the sample of women to those who are either heads of households or spouses of the head.
For compatibility with the Mroz dataset as presented in the gretl data file mroz87.gdt, we rename
the age and education variables as WA and WE respectively, we compute the CIT dummy and finally
we store the reduced base dataset in gretl format.
rename ETA WA
rename STUDIO WE
series CIT = (ACOM4C > 2)
store mroz_rep.gdt
The next step will be to get data on working hours from the jobs file allb1.csv. There’s a com-
plication here. We need the total hours worked over the course of the year (for both the women
The ASCII (CSV) data files for the 2010 survey are available at http://www.bancaditalia.it/statistiche/indcamp/
bilfait/dismicro/annuale/ascii/ind10_ascii.zip.
Chapter 7. Joining data sources 56
and their husbands). This is not available as such, but the variables ORETOT and MESILAV give,
respectively, average hours worked per week and the number of months worked in 2010, each on a
per-job basis. If each person held at most one job over the year we could compute his or her annual
hours as
However, some people had more than one job, and in this case what we want is the sum of annual
hours across their jobs. We could use join with the seq aggregation method to construct this sum,
but it is probably more straightforward to read the allb1 data, compute the HRS values per job as
shown above, and save the results to a temporary CSV file.
Now we can reopen the base dataset and join the hours variable from HRS.csv. Note that we need
a double key here: the women are uniquely identified by the combination of NQUEST and NORD. We
don’t need an okey specification since these keys go under the same names in the right-hand file.
We define labor force participation, LFP, based on hours.
open mroz_rep.gdt
join HRS.csv WHRS --ikey=NQUEST,NORD --data=HRS --aggr=sum
WHRS = misszero(WHRS)
LFP = WHRS > 0
For reference, here’s how we could have used seq to avoid writing a temporary file:
To generate the work experience variable, AX, we use the file lavoro.csv: this contains a variable
named ETALAV which records the age at which the person first started work.
We compute the woman’s hourly wage, WW, as the ratio of total employment income to annual
working hours. This requires drawing the series YL (payroll income) and YM (net self-employment
income) from the persons file rper10.csv.
The family’s net disposable income is available as Y in the file rfam10.csv; we import this as
FAMINC.
Data on number of children are now obtained by applying the count method. For the Mroz repli-
cation we want the number of children under the age of 6, and also the number aged 6 to 18.
Chapter 7. Joining data sources 57
We want to add data on the women’s husbands, but how do we find them? To do this we create an
additional inner key which we’ll call H_ID (husband ID), by sub-sampling in turn on the observations
falling into each of two classes: (a) those where the woman is recorded as head of household and
(b) those where the husband has that status. In each case we want the individual ID (NORD) of the
household member whose status is complementary to that of the woman in question. So for case
(a) we subsample using PARENT==1 (head of household) and filter the join using PARENT==2 (spouse
of head); in case (b) we do the converse. We thus construct H_ID piece-wise.
Now we can use our new inner key to retrieve the husbands’ data, matching H_ID on the left with
NORD on the right within each household.
The remainder of the script is straightforward and does not require discussion here: we recode
the education variables for compatibility; delete some intermediate series that are not needed any
more; add informative labels; and save the final product. See the Appendix for details.
To compare the results from this dataset with those from the earlier US data used by Mroz, one can
copy the input file heckit.inp (supplied with the gretl package) and substitute mroz_rep.gdt for
mroz87.gdt. It turns out that the results are qualitatively very similar.
a colon, as in 2013:06. A difficulty with following ISO here is that in a statistical context a string
such as 1980-10 may look more like a subtraction than a date.) Anyway, at present we are more
interested in the parsing of dates on input rather than in what gretl prints. And in that context note
that “excess precision” is acceptable: a month may be represented by its first day (e.g. 2005-05-01
for May, 2005), and a quarter may be represented by its first month and day (2005-07-01 for the
third quarter of 2005).
Some additional points regarding dates will be taken up as they become relevant in practical cases
of joining data.
• The default assumptions are: (1) the time key appears in the first column; (2) the heading
of this column is either left blank or is one of obs, date, year, period, observation, or
observation_date (on a case-insensitive comparison); and (3) the time format conforms
to ISO 8601 where applicable (“extended” daily date format YYYY-MM-DD, monthly format
YYYY-MM, or annual format YYYY).
• If dates do not appear in the first column of the outer file, or if the column heading or format
is not as just described, the --tkey option can be used to indicate which column should be
used and/or what format should be assumed.
--tkey="Period,%m/%d/%Y"
--tkey="Period"
--tkey="obsperiod"
--tkey=",%Ym%m"
Chapter 7. Joining data sources 59
The first of these applies if Period is not the first column on the right, and dates are given in the
US format of month, day, year, separated by slashes. The second implies that although Period is
not the first column, the date format is ISO 8601. The third again implies that the date format is
OK; here the name is required even if obsperiod is the first column since this heading is not one
recognized by gretl’s heuristic. The last example implies that dates are in the first column (with
one of the recognized headings), but are given in the non-standard format year, “m”, month.
The date format string should be composed using the codes employed by the POSIX function
strptime; Table 7.2 contains a list of the most relevant codes.3
Code Meaning
%% The % character.
%b The month name according to the current locale, either abbreviated
or in full.
%C The century number (0–99).
%d The day of month (1–31).
%D Equivalent to %m/%d/%y. (This is the American style date, very con-
fusing to non-Americans, especially since %d/%m/%y is widely used in
Europe. The ISO 8601 standard format is %Y-%m-%d.)
%H The hour (0–23).
%j The day number in the year (1–366).
%m The month number (1–12).
%n Arbitrary whitespace.
%q The quarter (1–4).
%w The weekday number (0–6) with Sunday = 0.
%y The year within century (0–99). When a century is not otherwise spec-
ified, values in the range 69–99 refer to years in the twentieth century
(1969–1999); values in the range 00–68 refer to years in the twenty-
first century (2000–2068).
%Y The year, including century (for example, 1991).
Date,Open,High,Low,Close,Volume,Adj Close
2013-08-02,195.50,195.50,193.22,195.16,3861000,195.16
2013-08-01,196.65,197.17,195.41,195.81,2856900,195.81
2013-07-31,194.49,196.91,194.49,195.04,3810000,195.04
Note that the data are in reverse time-series order — that won’t matter to join, the data can appear
in any order. Also note that the first column is headed Date and holds daily dates as ISO 8601
extended. That means we can pull the data into gretl very easily. In the following fragment we
create a suitably dimensioned empty daily dataset then rely on the default behavior of join with
time-series data to import the closing stock price.
nulldata 500
setobs 5 2012-01-01
join IBM.csv Close
3 The %q code for quarter is not present in strptime; it is added for use with join since quarterly data are common
in macroeconomics.
Chapter 7. Joining data sources 60
To make explicit what we’re doing, we could accomplish exactly the same using the --tkey option:
Frequency,Period,Country,Value,Flags
"Quarterly","Q1-1960","France",463876.148126845,E
"Quarterly","Q1-1960","Germany",768802.119278467,E
"Quarterly","Q1-1960","Italy",414629.791450547,E
"Quarterly","Q1-1960","United Kingdom",578437.090291889,E
"Quarterly","Q2-1960","France",465618.977328614,E
"Quarterly","Q2-1960","Germany",782484.138122549,E
"Quarterly","Q2-1960","Italy",420714.910290157,E
"Quarterly","Q2-1960","United Kingdom",572853.474696578,E
"Quarterly","Q3-1960","France",469104.41925852,E
"Quarterly","Q3-1960","Germany",809532.161494483,E
"Quarterly","Q3-1960","Italy",426893.675840156,E
"Quarterly","Q3-1960","United Kingdom",581252.066618986,E
"Quarterly","Q4-1960","France",474664.327992619,E
"Quarterly","Q4-1960","Germany",817806.132384948,E
"Quarterly","Q4-1960","Italy",427221.338414114,E
...
Table 7.3: Example of CSV file as provided by the OECD statistical website
This is an instance of data in what we call atomic format, that is, a format in which each line of the
outer file contains a single data-point and extracting data mainly requires filtering the appropriate
lines. The outer time key is under the Period heading, and has the format Q<quarter>-<year>.
Assuming that the file in Table 7.3 has the name oecd.csv, the following script reconstructs the
time series of Gross Domestic Product for several countries:
nulldata 220
setobs 4 1960:1
Note the use of the format codes %q for the quarter and %Y for the 4-digit year. A touch of elegance
could have been added by storing the invariant options to join using the setopt command, as in
If one were importing a large number of such series it might be worth rewriting the sequence of
joins as a loop, as in
4 Retrieved 2013-08-05. The OECD files in fact contain two leading columns with very long labels; these are irrelevant
to the present example and can be omitted without altering the sample script.
Chapter 7. Joining data sources 61
• If a specified column is not selected for a substantive role in the join operation (as data to be
imported, as a key, or as an auxiliary variable for use in aggregation) the column in question
is not read and so no conversion is carried out.
• If a specified column contains numerical rather than string values, no conversion is carried
out.
• If a string value in a selected column fails parsing using the relevant time format (user-
specified or default), the converted value is NA.
• On successful conversion, the output is always in daily-date form as stated above. If you
specify a monthly or quarterly time format, the converted date is the first day of the month
or quarter.
suppose we want the GDP data in panel format instead (stacked time series). How can we do this
with join?
As a reminder, here’s what the OECD data look like:
Frequency,Period,Country,Value,Flags
"Quarterly","Q1-1960","France",463876.148126845,E
"Quarterly","Q1-1960","Germany",768802.119278467,E
"Quarterly","Q1-1960","Italy",414629.791450547,E
"Quarterly","Q1-1960","United Kingdom",578437.090291889,E
"Quarterly","Q2-1960","France",465618.977328614,E
and so on. If we have four countries and quarterly observations running from 1960:1 to 2013:2 (T
= 214 quarters) we might set up our panel workspace like this:
scalar N = 4
scalar T = 214
scalar NT = N*T
nulldata NT --preserve
setobs T 1.1 --stacked-time-series
The relevant outer keys are obvious: Country for the country and Period for the time period. Our
task is now to construct matching keys in the inner dataset. This can be done via two panel-specific
options to the setobs command. Let’s work on the time dimension first:
This variant of setobs allows us to tell gretl that time in our panel is quarterly, starting in the
first quarter of 1960. Having set that, the accessor $obsdate will give us a series of 8-digit dates
representing the first day of each quarter — 19600101, 19600401, 19600701, and so on, repeating
for each country. As we explained in section 7.11, we can use the --tconvert option on the outer
series Period to get exactly matching values (in this case using a format of Q%q-%Y for parsing the
Period values).
Now for the country names:
Here we write into the string cstrs the names of the countries, using escaped double-quotes to
handle the space in “United Kingdom”, then pass this string to setobs with the --panel-groups
option, preceded by the identifier country. This asks gretl to construct a string-valued series
named country, in which each name will repeat T times.
We’re now ready to join. Assuming the OECD file is named oecd.csv we do
# Real GDP
Period,France,Germany,Italy,"United Kingdom"
Chapter 7. Joining data sources 63
"Q1-1960",463863,768757,414630,578437
"Q2-1960",465605,782438,420715,572853
"Q3-1960",469091,809484,426894,581252
"Q4-1960",474651,817758,427221,584779
"Q1-1961",482285,826031,442528,594684
...
Call this file side_by_side.csv. Assuming the same initial set-up as above, we can panelize the
data by setting the sample to each country’s time series in turn and importing the relevant column.
The only point to watch here is that the string “United Kingdom”, being a column heading, will
become United_Kingdom on importing (see section 7.2) so we’ll need a slightly different set of
country strings.
If our working dataset and the outer data file are dimensioned such that there are just as many
time-series observations on the right as there are time slots on the left — and the observations
on the right are contiguous, in chronological order, and start on the same date as the working
dataset—we could dispense with the key apparatus and just use the first line of the join command
shown above. However, in general it is safer to use keys to ensure that the data end up in correct
registration.
flag effect
--data Give the name of the data column on the right, in case it differs from
varname (7.2); single import only
--filter Specify a condition for filtering data rows (7.3)
--ikey Specify up to two keys for matching data rows (7.4)
--okey Specify outer key name(s) in case they differ the inner ones (7.4)
--aggr Select an aggregation method for 1 to n joins (7.5)
--tkey Specify right-hand time key (7.10)
--tconvert Select outer date columns for conversion to numeric form (7.11)
--tconv-fmt Specify a format for use with tconvert (7.11)
--no-header Treat the first row on the right as data (7.2)
--verbose Report on progress in reading the outer data
Chapter 7. Joining data sources 64
Realtime data
8.1 Introduction
As of gretl version 1.9.13 the join command (see chapter 7) has been enhanced to deal with so-
called realtime datasets in a straightforward manner. Such datasets contain information on when
the observations in a time series were actually published by the relevant statistical agency and how
they have been revised over time. Probably the most popular sources of such data are the “Alfred”
online database at the St. Louis Fed (http://alfred.stlouisfed.org/) and the OECD’s StatEx-
tracts site, http://stats.oecd.org/. The examples in this chapter deal with files downloaded
from these sources, but should be easy to adapt to files with a slightly different format.
As already stated, join requires a column-oriented plain text file, where the columns may be sepa-
rated by commas, tabs, spaces or semicolons. Alfred and the OECD provide the option to download
realtime data in this format (tab-delimited files from Alfred, comma-delimited from the OECD). If
you have a realtime dataset in a spreadsheet file you must export it to a delimited text file before
using it with join.
Representing revision histories is more complex than just storing a standard time series, because
for each observation period you have in general more than one published value over time, along
with the information on when each of these values were valid or current. Sometimes this is repre-
sented in spreadsheets with two time axes, one for the observation period and another one for the
publication date or “vintage”. The filled cells then form an upper triangle (or a “guillotine blade”
shape, if the publication dates do not reach back far enough to complete the triangle). This format
can be useful for giving a human reader an overview of realtime data, but it is not optimal for
automatic processing; for that purpose “atomic” format is best.
1 If you choose to download in Excel format from OECD you get a file in the triangular or guillotine format mentioned
above.
2 In the Alfred file we have used commas rather than tabs as the column delimiter; in the OECD example we have
66
Chapter 8. Realtime data 67
two columns, which jointly record the interval over which a given data vintage was current. Daily
dates might, however, be considered overly precise for the first column, since the data period may
well be the year, quarter or month (as it is in fact here). However, following Alfred’s practice it is
acceptable to specify a daily date, indicating the first day of the period, even for non-daily data.3
Compare the first data line of the OECD example. There’s a greater amount of leading metadata,
which is left implicit in the Alfred file. Here Time is the equivalent of Alfred’s observation_date,
and Edition the equivalent of Alfred’s realtime_start_date. So we read that in February 1999
a value of 100 was current for the UK index of industrial production for January 1990, and from
the next line we see that in the same vintage month a value of 99.3 was current for industrial
production in February 1990.
Besides the different names and ordering of the columns, there are a few more substantive differ-
ences between Alfred and OECD files, most of which are irrelevant for join but some of which are
(possibly) relevant.
The first (irrelevant) difference is the ordering of the lines. It appears (though we’re not sure how
consistent this is) that in Alfred files the lines are sorted by observation date first and then by
publication date—so that all revisions of a given observation are grouped together —while OECD
files are sorted first by revision date (Edition) and then by observation date (Time). If we want the
next revision of UK industrial production for January 1990 in the OECD file we have to scan down
several lines until we find
This difference is basically irrelevant because join can handle the case where the lines appear in
random order, although some operations can be coded more conveniently if we’re able to assume
chronological ordering (either on the Alfred or the OECD pattern, it doesn’t matter).
The second (also irrelevant) difference is that the OECD seems to include periodic “Edition” lines
even when there is no change from the previous value (as illustrated above, where the UK industrial
production index for January 1990 is reported as 100 as of March 1999, the same value that we saw
to be current in February 1999), while Alfred reports a new value only when it differs from what
was previously current.
A third difference lies in the dating of the revisions or editions. As we have seen, Alfred gives
a specific daily date while (in the UK industrial production file at any rate), the OECD just dates
each edition to a month. This is not necessarily relevant for join, but it does raise the question of
whether the OECD might date revisions to a finer granularity in some of their files, in which case
one would have to be on the lookout for a different date format.
The final difference is that Alfred supplies an “end date” for each data vintage while the OECD
3 Notice that this implies that in the Alfred example it is not clear without further information whether the observation
period is the first quarter of 1960, the month January 1960, or the day January 1st 1960. However, we assume that this
information is always available in context.
Chapter 8. Realtime data 68
supplies only a starting date. But there is less to this difference than meets the eye: according to
the Alfred webmaster, “by design, a new vintage must start immediately following (the day after)
the lapse of the old vintage”—so the end date conveys no independent information.4
1. If a format is given with the --tkey option it always applies to the tkey column alone; and
for that column it overrides any format given via the --tconv-fmt option.
2. If a format is given via tconv-fmt it is assumed to apply to all the tconvert columns, unless
this assumption is overriden by rule 1.
nulldata 132
setobs 12 2004:01
For convenience we can put the name of our realtime file into a string variable. On Windows this
might look like
4 Email received from Travis May of the Federal Reserve Bank of St. Louis, 2013-10-17. This closes off the possibility
that a given vintage could lapse or expire some time before the next vintage becomes available, hence giving rise to a
“hole” in an Alfred realtime file.
Chapter 8. Realtime data 69
We can then import the data vintage 2011-06-15 using join, arbitrarily choosing the self-explanatory
identifier ip_asof_20110615.
• The --tkey option specifies the column which should be treated as holding the observation
period identifiers to be matched against the periods in the current gretl dataset.5 The more
general form of this option is --tkey="colname,format" (note the double quotes here), so
if the dates do not come in standard format, we can tell gretl how to parse them by using
the appropriate conversion specifiers as shown in Table 7.2. For example, here we could have
written --tkey="observation_date,%Y-%m-%d".
• Next, --data=INDPRO tells gretl that we want to retrieve the entries stored in the column
named INDPRO.
• As explained in section 7.11 the --tconvert option selects certain columns in the right-hand
data file for conversion from date strings to 8-digit numbers on the pattern YYYYMMDD. We’ll
need this for the next step, filtering, since the transformation to numerical values makes
it possible to perform basic arithmetic on dates. Note that since date strings in Alfred files
conform to gretl’s default assumption it is not necessary to use the --tconv-fmt option here.
• The --filter option specification in combination with the subsequent --aggr aggregation
treatment is the central piece of our data retrieval; notice how we use the date constant
20110615 in ISO basic form to do numerical comparisons, and how we perform the numerical
max operation on the converted column realtime_start_date. It would also have been
possible to predefine a scalar variable, as in
vintage = 20110615
and then use vintage in the join command instead. Here we tell join that we only want to
extract those publications that (1) already appeared before (and including) June 15th 2011,
and (2) were not yet obsoleted by a newer release.6
As a result, your dataset will now contain a time series named ip_asof_20110615 with the values
that a researcher would have had available on June 15th 2011. Of course, all values for the observa-
tions after June 2011 will be missing (and probably a few before that, too), because they only have
become available later on.
5 Strictly speaking, using --tkey is unnecessary in this example because we could just have relied on the default,
which is to use the first column in the source file for the periods. However, being explicit is often a good idea.
6 By implementing the second condition through the max aggregation on the realtime_start_date column alone,
without using the realtime_end_date column, we make use of the fact that Alfred files cannot have “holes” as explained
before.
Chapter 8. Realtime data 70
Since we do not need the realtime_start_date information for this retrieval, we have dropped
the --tconvert option here. Note that this formulation assumes that the source file is ordered
chronologically, otherwise using the option --aggr="seq:3", which retrieves the third value from
each sequence of matches, could have yielded a result different from the one intended. However,
this assumption holds for Alfred files and is probably rather safe in general.
The values of the variable imported as ip_3rdpub in this way were published at different dates,
so the variable is effectively a mix of different vintages. Depending on the type of variable, this
may also imply drastic jumps in the values; for example, index numbers are regularly re-based
to different base periods. This problem also carries over to inflation-adjusted economic variables,
where the base period of the price index changes over time. Mixing vintages in general also means
mixing different scales in the output, with which you would have to deal appropriately.7
8.6 Getting the values at a fixed lag after the observation period
New data releases may take place on any day of the month, and as we have seen the specific day
of each release is recorded in realtime files from Alfred. However, if you are working with, say,
monthly or quarterly data you may sometimes want to adjust the granularity of your realtime axis
to a monthly or quarterly frequency. For example, in order to analyse the data revision process for
monthly industrial production you might be interested in the extent of revisions between the data
available two and three months after each observation period.
This is a relatively complicated task and there is more than one way of accomplishing it. Either you
have to make several passes through the outer dataset or you need a sophisticated filter, written
as a hansl function. Either way you will want to make use of some of gretl’s built-in calendrical
functions.
We’ll assume that a suitably dimensioned workspace has been set up as described above. Given
that, the key ingredients of the join are a filtering function which we’ll call rel_ok (for “release is
OK”) and the join command which calls it. Here’s the function:
Note that we use --tconvert to convert both the observation date and the realtime start date (or
release date) to 8-digit numerical values. Both of these series are passed to the filter, which uses the
7 Some user-contributed functions may be available that address this issue, but it is beyond our scope here. Another
even more complicated issue in the realtime context is that of “benchmark revisions” applied by statistical agencies,
where the underlying definition or composition of a variable changes on some date, which goes beyond a mere rescaling.
However, this type of structural change is not, in principle, a feature of realtime data alone, but applies to any time-series
data.
Chapter 8. Realtime data 71
built-in function isoconv to extract year and month. We can then calculate dm, the “delta months”
since the observation date, for each release. The filter condition is that this delta should be no
greater than the specified lag, p.8
This filter condition may be satisfied by more than one release, but only the latest of those will
actually be the vintage that was current at the end of the n-th month after the observation period,
so we add the option --aggr=max(realtime_start_date). If instead you want to target the
release at the beginning of the n-th month you would have to use a slightly more complicated filter
function.
An illustration
Figure 8.1 shows four time series for the monthly index of US industrial production from October
2005 to June 2009: the value as of first publication plus the values current 3, 6 and 12 months out
from the observation date.9 From visual inspection it would seem that over much of this period
the Federal reserve was fairly consistently overestimating industrial production at first release and
shortly thereafter, relative to the figure they arrived at with a lag of a year.
The script that produced this Figure is shown in full in Listing 8.1. Note that in this script we are
using a somewhat more efficient version of the rel_ok function shown above, where we pass the
required series arguments in “pointer” form to avoid having to copy them (see chapter 14).
116
114
112
110
108
106
104
102
100
98
First publication
Plus 3 months
96 Plus 6 months
Plus 12 months
94
2006 2007 2008 2009
8 The filter is written on the assumption that the lag is expressed in months; on that understanding it could be used
with annual or quarterly data as well as monthly. The idea could be generalized to cover weekly or daily data without
much difficulty.
9 Why not a longer series? Because if we try to extend it in either direction we immediately run into the index re-basing
problem mentioned in section 8.5, with big (staggered) leaps downward in all the series.
Chapter 8. Realtime data 72
nulldata 45
setobs 12 2005:10
# plus 3 months
join @fname plus3 --data=INDPRO --tkey=observation_date \
--tconvert="observation_date,realtime_start_date" \
--filter="rel_ok(&observation_date, &realtime_start_date, 3)" \
--aggr=max(realtime_start_date)
# plus 6 months
join @fname plus6 --data=INDPRO --tkey=observation_date \
--tconvert="observation_date,realtime_start_date" \
--filter="rel_ok(&observation_date, &realtime_start_date, 6)" \
--aggr=max(realtime_start_date)
# plus 12 months
join @fname plus12 --data=INDPRO --tkey=observation_date \
--tconvert="observation_date,realtime_start_date" \
--filter="rel_ok(&observation_date, &realtime_start_date, 12)" \
--aggr=max(realtime_start_date)
it run to the end of the vintage history (in this file, March 2013). Our outer time key is the realtime
start date and we filter on observation date; we name the imported INDPRO values as ip_jan70.
Since it sometimes happens that more than one revision occurs in a given month we need to select
an aggregation method: here we choose to take the last revision in the month.
Recall from section 8.2 that Alfred records a new revision only when the data-point in question
actually changes. This means that our imported series will contain missing values for all months
when no real revision took place. However, we can apply a simple autoregressive rule to fill in the
blanks: each missing value equals the prior non-missing value.
Figure 8.2 displays the revision history. Over this sample period the periodic re-basing of the index
overshadows amendments due to accrual of new information.
180
160
140
120
ip_jan70
100
80
60
40
20
1970 1975 1980 1985 1990 1995 2000 2005 2010
Figure 8.2: Vintages of the index of US industrial production for January 1970
Chapter 9
Temporal disaggregation
9.1 Introduction
This chapter describes and explains the facility for temporal disaggregation in gretl.1 This is im-
plemented by the tdisagg function, which supports three variants of the method of Chow and Lin
(1971); the method of Fernández (1981); and two variants of the method of Denton (1971) as modi-
fied by Cholette (1984). Given the analytical similarities between them, the three Chow–Lin variants
and the Fernández method will be grouped in the discussion below as “Chow–Lin methods”.
The balance of this section provides a gentle introduction to the idea of temporal disaggregation;
experts may wish to skip to the next section.
Basically, temporal disaggregation is the business of taking time-series data observed at some given
frequency (say, annually) and producing a counterpart series at a higher frequency (say, quarterly).
The term “disaggregation” indicates the inverse operation of aggregation, and to understand tem-
poral disaggregation it’s helpful first to understand temporal aggregation. In aggregating a high
frequency series to a lower frequency there are three basic methods, the appropriate method de-
pending on the nature of the data. Here are some illustrative examples.
• GDP: say we have quarterly GDP data and wish to produce an annual series. This is a flow
variable and the annual flow will be the sum of the quarterly values (unless the quarterly
values are annualized, in which case we would aggregate by taking their mean).
• Industrial Production: this takes the form of an index reporting the level of production over
some period relative to that in a base period in which the index is by construction 100. To
aggregate from (for example) monthly to quarterly we should take the average of the monthly
values. (The sum would give a nonsense result.) The same goes for price indices, and also for
ratios of stocks to flows or vice versa (inventory to sales, debt to GDP, capacity utilization).
A central idea in temporal disaggregation is that the high frequency series must respect both the
given low frequency data and the aggregation method. So for example, whatever numbers we come
up with for quarterly GDP, given an annual series as starting point, our numbers must sum to
the annual total. If money stock is measured at the end of the period then whatever numbers
we come up with for monthly money stock, given quarterly data, the figure for the last month of
the quarter must match that for the quarter as a whole. This is why temporal disaggregation is
sometimes called “benchmarking”: the given low frequency data constitute a benchmark which the
constructed high frequency data must match, in a well defined sense that depends on the nature
of the data.
Colloquially, we might describe temporal disaggregation as “interpolation,” but strictly speaking
interpolation applies only to stock variables. We have a known end-of-quarter value (say), which is
also the value at the end of the last month of the quarter, and we’re trying to figure out what the
1 We are grateful to Tommaso Di Fonzo, Professor of Statistical Science at the University of Padua, for detailed and
precise comments on earlier drafts. Any remaining errors are, of course, our responsibility.
74
Chapter 9. Temporal disaggregation 75
value might have been at the end of months 1 and 2. We’re filling in the blanks, or interpolating.
In the GDP case, however, the procedure is distribution rather than interpolation. We have a given
annual total and we’re trying to figure out how it should be distributed over the quarters. We’re
also doing distribution for variables taking the form of indices or ratios, except in this case we’re
seeking plausible values whose mean equals the given low-frequency value.
While matching the low frequency benchmark is an important constraint, it obviously does not
tie down the high frequency values. That is a job for either regression-based methods such as
Chow–Lin or non-regression methods such as Denton. Details are provided in section 9.7.
The idea is that Y contains time series data sampled at some frequency f , while each column
of X contains time series data at a higher frequency, sf . So for each observation Yt we have s
corresponding rows in X. The object is to produce a transformation of Y to frequency sf , with
the help of X (whose columns are typically called “related series” or “indicators” in the temporal
disaggregation literature), via either distribution or interpolation depending on the nature of the
data. For most of this document we will assume that g = 1, or in other words we are performing
temporal disaggregation on a single low-frequency series, but tdisagg supports “batch processing”
of several series and we return to this point in section 9.9.
If the m in (s · T + m) is greater than zero, that implies that there are some “extra” high-frequency
observations available for extrapolation — see section 9.4 for details.
We need to say something more about what goes into X. Under the Denton methods this must be a
single series, generally known as the “preliminary series”.2 For the Chow–Lin methods, X can hold
a combination of deterministic terms (e.g. constant, trend) and stochastic series. Naturally, suitable
candidates for the role of preliminary series or indicator will be variables that are correlated with Y
(and in particular, might be expected to share short-run dynamics with Y). However, it is possible
to carry out disaggregation using deterministic terms only — in the simplest case, with X containing
nothing but a constant. Experts in the field tend to frown on this, with reason: in the absence of
any genuine high-frequency information disaggregation just amounts to a “mechanical” smoothing.
But some people may have a use for such smoothing, and it’s permitted by tdisagg.
We should draw attention to a design decision in tdisagg: we have separated the specification of
indicators in X from certain standard deterministic terms that might be wanted, namely, a constant,
linear trend or quadratic trend. If you want a disaggregation without stochastic indicators, you can
omit (or set to null) the argument corresponding to X. In that case a constant (only) will be
employed automatically, but for the Chow–Lin methods one can adjust the deterministic terms
used via an option named det, described below. In other words the content of X becomes implicit.
See section 9.6 for more detail.
Here’s an important point to note when X is given explicitly: although this matrix may contain
extra observations “at the end” we assume that Y and X are correctly aligned at the start. Take
for example the annual to quarterly case: if the first observation in annual Y is for 1980 then the
first observation in quarterly X must be for the first quarter of 1980. Ensuring this is the user’s
responsibility. We will have some more to say about this in the following section.
2 There’s nothing to stop a user from constructing such a series using several primary series as input—by taking the
first principal component or some other means—but that possibility is beyond our scope here.
Chapter 9. Temporal disaggregation 76
1. Both Y and X are matrices. In this case, the size and periodicity of the currently open dataset
(if any) are irrelevant. If Y has T rows X must, of course, have at least s · T rows; if that
condition is not satisfied an “Invalid argument” error will be flagged.
2. Y is a series (or list) and X a matrix. In this case we assume that the periodicity of the
currently open dataset is the lower one, and T will be taken as equal to $nobs (the number of
observations in the current sample range). Again, X must have at least s · T rows.
3. Y is a matrix and X a series or list. We then assume that the periodicity of the currently open
dataset is the higher one, so that $nobs defines (s · T + m). And Y is supposed to be at the
lower frequency, so its number of rows gives T . We should then be able to find m as $nobs
minus s · T ; if m < 0 an error is flagged.
4. Both Y and X are “dataset objects”. We have two sub-cases here.
(a) If X is a series, or an ordinary list of series, the periodicity of the currently open dataset is
taken to be the higher one. The series (or list) containing Y should hold the appropriate
entries every s elements. For example, if s = 4, Y1 will be taken from the first observation,
Y2 from the fifth, Y3 from the ninth, and so on. In practical terms, series of this sort are
likely to be composed by repeating each element of a low-frequency variable s times.
(b) Alternatively, X could be a “MIDAS list”. The concept of a MIDAS list is fully explained in
chapter 20 but for example, in a quarterly dataset a MIDAS list would be a list of three
series, for the third, second and first month (note the ordering). In this case, the current
periodicity is taken to be the lower one and X will contain one column, corresponding to
the high-frequency representation of the MIDAS list.
5. X is omitted. If Y is given as a matrix it is taken to have T rows. Otherwise the interpretation
is determined heuristically: if the Y series is recognized by gretl as composed of repeated
low-frequency observations, or if a series result is requested, it is taken as having length sT ,
otherwise its length is taken to be T .
In the previous section we flagged the importance of correct alignment of X and Y at the start of the
data; we’re now in a position to say a little more about this. If either X or Y are given in matrix form
alignment is truly the user’s responsibility. But if they are dataset objects gretl can be more helpful.
We automatically advance the start of the sample range to exclude any leading missing values, and
retard the end of the sample ranges for X and Y to exclude trailing missing values (allowing for the
possibility that X may extend beyond Y). In addition we further advance the sample start if this is
required to ensure that the X data begin in the first high-frequency sub-period (e.g. the first quarter
of a year or the first month of a quarter). But please note: when gretl automatically excludes leading
or trailing missing values, intra-sample missing values will still provoke an error.
9.4 Extrapolation
As mentioned above, if X holds covariate data which extend beyond the range of the original series
to be disaggregated then extrapolation is supported. But this is inherently risky, and becomes
riskier the longer the horizon over which it is attempted. In tdisagg extrapolation is by default
limited to one low-frequency period (= s high-frequency periods) beyond the end of the original
data. The user can adjust this behavior via the extmax member of the opts bundle passed to
tdisagg, described in the next section.
Chapter 9. Temporal disaggregation 77
where square brackets indicate optional arguments. Note that while the return value is a matrix, if
Y0 contains a single column or series it can be assigned to a series as in
provided it’s of the right length to match the current dataset, or the current sample range. Details
on the arguments follow.
X (optional): X as a matrix, series or list. This should not contain standard deterministic terms,
since they are handled separately (see det under opts below). If this matrix is omitted, then
disaggregation will be performed using deterministic terms only.
s (int): The temporal expansion factor, for example 3 for quarterly to monthly, 4 for annual to
quarterly or 12 for annual to monthly. We do not support cases such as monthly to weekly or
monthly to daily, where s is not a fixed integer value common to all observations; otherwise,
anything goes.
opts (bundle, optional): a bundle holding additional options. The recognized keys are (in alpha-
betical order):
aggtype (string): Specifies the type of temporal aggregation appropriate to the series in ques-
tion. The value must be one of sum (each low-frequency value is a sum of s high-frequency
values, the default); avg (each low-frequency value is the average of s high frequency val-
ues); or last or first, indicating respectively that each low-frequency value is the last
or first of s high frequency values.
det (int): Relevant only when one of the Chow–Lin methods is selected. This is a numeric
code for the deterministic terms to be included in the regressions: 0 means none; 1,
constant only; 2, constant and linear trend; 3, constant and quadratic trend. The default
is 1.
extmax (int): the maximum number of high-frequency periods over which extrapolation should
be carried out, conditional on the availability of covariate data. A zero value means no
extrapolation; a value of −1 means as many periods as possible; and a positive value
limits extrapolation to the specified number of periods. See section 9.4 for a statement
of the default value.
method (string): Selects the method of disaggregation (see the listing below). Note that the
Chow–Lin methods employ an autoregression coefficient, ρ, which captures the persis-
tence of the target series at the higher frequency and is used in GLS estimation of the
parameters linking X to Y.
• chow-lin (the default) is modeled on the original method proposed by Chow and
Lin. It uses a value of ρ computed as the transformation of a maximum-likelihood
estimate of the low-frequency autocorrelation coefficient.
• chow-lin-mle is equivalent to the method called chow-lin-maxlog in the tempdis-
agg package for R; ρ is estimated by iterated GLS using the loglikelihood as criterion,
as recommended by Bournay and Laroque (1979). (The BFGS algorithm is used inter-
nally).
• chow-lin-ssr is equivalent to the method called chow-lin-minrss-quilis in tem-
pdisagg; ρ is estimated by iterated GLS using the sum of squared GLS residuals as
criterion (L-BFGS is used internally).
Chapter 9. Temporal disaggregation 78
• fernandez is basically “Chow–Lin with ρ = 1”. It is suitable if the target series has a
unit root, and is not cointegrated with the indicator series.
• denton-pfd is the proportional first differences variant of Denton, as modified by
Cholette. See Di Fonzo and Marini (2012) for details.
• denton-afd is the additive first differences variant of Denton (again, as modified by
Cholette). In contrast to the Chow–Lin methods, neither Denton procedure involves
regression.
plot (int): If a non-zero value is given, a simple plot is displayed by way of a “sanity check”
on the final series. See section 9.8 for details.
rho (scalar): Relevant only when one of the Chow–Lin methods is selected. If the method
is chow-lin, then rho is treated as a fixed value for ρ, thus enabling the user to by-
pass the default estimation procedure altogether. If the method is chow-lin-mle or
chow-lin-ssr, on the other hand, the supplied ρ value is used to initialize the numeri-
cal optimization algorithm.
verbose (int): Controls the verbosity of Chow–Lin or Fernández output. If 0 (the default)
nothing is printed unless an error occurs; if 1, summary output from the relevant regres-
sion is shown; if 2, in addition output from the optimizer for the iterated GLS procedure
is shown, if applicable.
results (bundle, optional): If present, this argument must be a previously defined bundle. Upon
successful completion of any of the methods other than denton it contains details of the
disaggregation under the following keys:
If ρ is set to zero—either by specification of the user or because the estimate ρ̂ turned out
to be non-positive—then estimation of the coefficients is via OLS. In that case the lnl and
SSR values are calculated using the OLS residuals (which will be on a different scale from the
weighted residuals in GLS).
• If X is given explicitly: No deterministic term is added when the Denton method is used (since
a single preliminary series is wanted) but a constant is added when one of the Chow–Lin
methods is selected. The latter default can be overridden (i.e. the constant removed, or a
trend added) by means of the det entry in the options bundle.
• If X is omitted: By default a constant is used for all methods. Again, for Chow–Lin this can be
overridden by specifying a det value. If for some reason you wanted Denton with just a trend
you would have to supply X containing a trend.
As regards the Cholette-modified Denton methods, for the proportional first difference variant we
calculate the final series using the solution described by Di Fonzo and Marini (2012), specifically
equation (4) on page 5, and for the additive variant we draw on Di Fonzo (2003), pages 3 and 5 in
particular. Note that these procedures require the construction and inversion of a matrix of order
(s + 1)T . If both s and T are large it can therefore take some time, and be quite demanding of RAM.
As regards Chow–Lin, let ρ0 indicate the rho value passed via the options bundle (if applicable). We
then take these steps:
1. If ρ0 > 0 set ρ = ρ0 and go to step 6 if the method is chow-lin or step 7 otherwise. But if
ρ0 < 0 set ρ0 = 0.
2. Estimate via OLS a regression of Y on CX,3 where C is the appropriate aggregation matrix. Let
β̂OLS equal the coefficients from this regression. If ρ0 = 0 and the method is chow-lin go to
step 8.
3. Calculate the (low frequency) first order autocorrelation of the OLS residuals, ρ̂L . If ρ̂L ≥ 10−6
go to step 4. Otherwise, if the method is chow-lin set ρ = 0 and go to step 8, else set ρ = 0.5
and go to step 7.
4. Refine the positive estimate of ρ̂L via Maximum Likelihood estimation of the AR(1) specifica-
tion as described in Davidson and MacKinnon (2004).
5. If ρ̂L < 0.999, set ρ to the high-frequency counterpart of ρ̂L using the approach given in Chow
and Lin (1971). Otherwise set ρ = 0.999. If the method is chow-lin, go to step 6, otherwise
go to step 7.
6. Perform GLS with the given value of ρ, store the coefficients as β̂GLS and go to step 9.
7. Perform iterated GLS starting from the prior value of ρ, adjusting ρ with the goal of either
maximizing the loglikelihood (method chow-lin-mle) or minimizing the sum of squared GLS
residuals (chow-lin-ssr); set β̂GLS to the final coefficient estimates; and go to step 9.
8. Calculate the final series as Xβ̂OLS + C′ (CC′ )−1 ûOLS , where ûOLS indicates the OLS residuals,
and stop.
9. Calculate the final series as Xβ̂GLS + VC′ (CVC′ )−1 ûGLS , where ûGLS indicates the GLS residuals
and V is the estimated high-frequency covariance matrix.
• One might question the value of performing steps 2 to 5 when the method is one that calls
for GLS iteration (chow-lin-mle or chow-lin-ssr), but our testing indicates that it can be
helpful to have a reasonably good estimate of ρ in hand before embarking on these iterations.
• Conversely, one might wonder why we bother with GLS iterations if we find ρ̂L < 10−6 . But
this allows for the possibility (most likely associated with small sample size) that iteration
will lead to ρ > 0 even when the estimate based on the intial OLS residuals is zero or negative.
• Note that in all cases we are discarding an estimate of ρ < 0 (truncating to 0), which we take
to be standard in this field. In our iterated GLS we achieve this by having the optimizer pick
values x in [−∞, +∞] which are translated to [0, 1] via the logistic CDF, ρ = 1/(1 + exp(−x)).
To be precise, that’s the case with chow-lin-mle. But we find that the chow-lin-ssr method
is liable to overestimate ρ and proceed to values arbitrarily close to 1, resulting in numerical
problems. We therefore bound this method to x in [−20, +6.9], corresponding to ρ values
between near-zero and approximately 0.999.4
3 Strictly
speaking, CX uses only the first sT rows of X if m > 0.
4 It
may be worth noting that the tempdisagg package for R limits both methods to a maximum ρ of 0.999. We find,
however, that the ML method can “look after itself”, and does not require the fixed upper bound short of 1.0.
Chapter 9. Temporal disaggregation 80
2200
2100
2000
1900
1800
1700
1600
1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965
Figure 9.1: Example output from plot option, showing annual GNP (red) and quarterly final series (blue) using
quarterly industrial production as indicator.
As for the Fernández method, this is quite straightforward. The place of the high-frequency co-
variance matrix V in Chow–Lin is taken by (D′ D)−1 , where D is the approximate first-differencing
matrix, with 1 on the diagonal and −1 on the first sub-diagonal. For efficient computation, however,
we store neither D nor D′ D as such, and do not perform any explicit inversion. The special struc-
ture of (D′ D)−1 makes it possible to produce the effect of pre-multiplication by this matrix with
O(T 2 ) floating-point operations. Estimation of ρ is not an issue since it equals 1 by assumption.
The advantage of batch processing lies in the spreading of fixed computational cost, leading to
shorter execution time. However, the relative importance of the fixed cost differs substantially
according to the disaggregation method. For the Chow–Lin methods the fixed cost is relatively
small and so little speed-up can be expected, but for the Denton methods it dominates, and (in our
testing) you can process g > 1 series in little more time than it takes to process a single series.
As they say, “Your mileage may vary,” but if you have a large number of series to be disaggregated
via one of the Denton methods you may well find it much faster to use the batch facility of tdisagg.
9.10 Examples
Listing 9.1 shows an example of usage and its output. The data are drawn from the St Louis
Fed; we disaggregate quarterly GDP to monthly with the help of industrial production and payroll
employment, using the default Chow–Lin method.
Several other example scripts are available from http://gretl.sourceforge.net/tdisagg/.
Output:
Aggregation type sum
GLS estimates (chow-lin) T = 294
Dependent variable: gdpc1
10.1 Introduction
The genr command provides a flexible means of defining new variables. At the same time, the
somewhat paradoxical situation is that the “genr” keyword is almost never visible in gretl scripts.
For example, it is not really recommended to write a line such as genr b = 2.5, because there are
the following alternatives:
• scalar b = 2.5, which also invokes the genr apparatus in gretl, but provides explicit type
information about the variable b, which is usually preferable. (gretl’s language hansl is stati-
cally typed, so b cannot switch from scalar to string or matrix, for example.)
• b = 2.5, leaving it to gretl to infer the admissible or most “natural” type for the new object,
which would again be a scalar in this case.
In addition to scalar or matrix, other type keywords that can be used to substitute the generic
genr term are those enumerated in the following chapter 11. In the case of an array the concrete
specification should be used, so one of matrices, strings, lists, bundles.1
Therefore, there’s only a handful of special cases where it is really necessary to use the “genr”
keyword:
• genr time — Creates a time trend variable (1,2,3,. . . ) under the name time. Note that within
an appropriately defined panel dataset this variable honors the panel structure and is a true
time index. (In a cross-sectional dataset, the command will still work and produces the same
result as genr index below, but of course no temporal meaning exists.)
• genr index — Creates an observation variable named index, running from 1 to the sample
size.
• genr unitdum — In the context of panel data, creates a set of dummies for the panel groups
or “units”. These are named du_1, du_2, and so forth. Actually, this particular genr usage is
not strictly necessary, because a list of group dummies can also be obtained as:
series gr = $unit
list groupdums = dummify(gr, NA)
(The NA argument to the dummify function has the effect of not skipping any unit as the
reference group, thus producing the full set of dummies.)
1A recently added advanced datatype is an array of arrays, with the associated type specifier arrays.
82
Chapter 10. Special functions in genr 83
• genr timedum — Again for panel data, creates a set of dummies for the time periods, named
dt_1, dt_2, . . . . And again, a list-producing variant without genr exists, using the special
accessor $obsminor which indexes time in the panel context and can be used as a substitute
for time from above:
Finally, there also exists genr dummy, which produces a set of seasonal dummies. However, it is
recommended to use the seasonals() function instead, which can also return centered dummies.
The rest of this chapter discusses other special function aspects.
The cdf function has produced an accurate value, but the pvalue function gives an answer of 1,
from which it is not possible to retrieve the answer to the CDF question. This may seem surprising
at first, but consider: if the value of p1 above is correct, then the correct value for p2 is 1−8.94977×
10−35 . But there’s no way that value can be represented as a double: that would require over 30
digits of precision.
Of course this is an extreme example. If the x in question is not too far off into one or other tail
of the distribution, the cdf and pvalue functions will in fact produce complementary answers, as
shown below:
? scalar test = 1 - p2
Generated scalar test = 0.0111648
But the moral is that if you want to examine extreme values you should be careful in selecting the
function you need, in the knowledge that values very close to zero can be represented as doubles
while values very close to 1 cannot.
11.1 Introduction
Gretl offers the following data types:
The “numerical values” mentioned above are all double-precision floating point numbers.
In this chapter we give a run-down of the basic characteristics of each of these types and also
explain their “life cycle” (creation, modification and destruction). The list and matrix types, whose
uses are relatively complex, are discussed at greater length in chapters 15 and 17 respectively.
11.2 Series
We begin with the series type, which is the oldest and in a sense the most basic type in gretl. When
you open a data file in the gretl GUI, what you see in the main window are the ID numbers, names
(and descriptions, if available) of the series read from the file. All the series existing at any point in
a gretl session are of the same length, although some may have missing values. The variables that
can be added via the items under the Add menu in the main window (logs, squares and so on) are
also series.
For a gretl session to contain any series, a common series length must be established. This is
usually achieved by opening a data file, or importing a series from a database, in which case the
length is set by the first import. But one can also use the nulldata command, which takes as it
single argument the desired length, a positive integer.
Each series has these basic attributes: an ID number, a name, and of course n numerical values.
A series may also have a description (which is shown in the main window and is also accessible
via the labels command), a “display name” for use in graphs, a record of the compaction method
used in reducing the variable’s frequency (for time-series data only) and flags marking the variable
as discrete and/or as a numeric encoding of a qualitative characteristic. These attributes can be
edited in the GUI by choosing Edit Attributes (either under the Variable menu or via right-click), or
by means of the setinfo command.
In the context of most commands you are able to reference series by name or by ID number as you
wish. The main exception is the definition or modification of variables via a formula; here you must
use names since ID numbers would get confused with numerical constants.
Note that series ID numbers are always consecutive, and the ID number for a given series will change
if you delete a lower-numbered series. In some contexts, where gretl is liable to get confused by
85
Chapter 11. Gretl data types 86
Discrete series
It is possible to mark variables of the series type as discrete. The meaning and uses of this facility
are explained in chapter 12.
String-valued series
It is generally expected that series in gretl will be “properly numeric” (on a ratio or at least an
ordinal scale), or the sort of numerical indicator variables (0/1 “dummies”) that are traditional in
econometrics. However, “string-valued” series are also supported — see chapter 16 for details.
11.3 Scalars
The scalar type is relatively simple: just a convenient named holder for a single numerical value.
Scalars have none of the additional attributes pertaining to series, do not have ID numbers, and
must be referenced by name. A common use of scalar variables is to record information made
available by gretl commands for further processing, as in scalar s2 = $sigmaˆ2 to record the
square of the standard error of the regression following an estimation command such as ols.
You can define and work with scalars in gretl without having any dataset in place.
In the gretl GUI, scalar variables can be inspected and their values edited via the “Icon view” (see
the View menu in the main window).
11.4 Matrices
Matrices in gretl work much as in other mathematical software (e.g. MATLAB, Octave). Like scalars
they have no ID numbers and must be referenced by name, and they can be used without any
dataset in place. Matrix indexing is 1-based: the top-left element of matrix A is A[1,1]. Matrices
are discussed at length in chapter 17; advanced users of gretl will want to study this chapter in
detail.
Matrices have two optional attribute beyond their numerical content: they may have column and/or
row names attached; these are displayed when the matrix is printed. See the cnameset and
rnameset functions for details.
In the gretl GUI, matrices can be inspected, analysed and edited via the Icon view item under the
View menu in the main window: each currently defined matrix is represented by an icon.
11.5 Lists
As with matrices, lists merit an explication of their own (see chapter 15). Briefly, named lists can
(and should!) be used to make command scripts less verbose and repetitious, and more easily
modifiable. Since lists are in fact lists of series ID numbers they can be used only when a dataset is
in place.
In the gretl GUI, named lists can be inspected and edited under the Data menu in the main window,
via the item Define or edit list.
11.6 Strings
String variables may be used for labeling, or for constructing commands. They are discussed in
chapter 15. They must be referenced by name; they can be defined in the absence of a dataset.
Chapter 11. Gretl data types 87
Such variables can be created and modified via the command-line in the gretl console or via script;
there is no means of editing them via the gretl GUI.
11.7 Bundles
A bundle is a container or wrapper for various sorts of objects — primarily scalars, matrices,
strings, arrays and bundles. (Yes, a bundle can contain other bundles). Secondarily, series and
lists can be placed in bundles but this is subject to important qualifications noted below.
A bundle takes the form of a hash table or associative array: each item placed in the bundle is
associated with a key which can used to retrieve it subsequently. We begin by explaining the
mechanics of bundles then offer some thoughts on what they are good for.
There are several ways of creating a bundle. Here are the first two:
These formulations are basically equivalent, in that they both create an empty bundle. The differ-
ence is that the second variant may be reused — if a bundle named foo already exists the effect is
to empty it—while the first may only be used once in a given gretl session; it is an error to attempt
to declare a variable that already exists.
To create a bundle and populate it with some members in one go, you can use the defbundle
function, introduced in gretl 2017b. For example:
Here the arguments come in pairs: key followed by the object to be associated with the key, with
all terms comma-separated. However, you may prefer to use one or other of the alternative idioms
introduced in gretl 2021a. The first of these looks like this:
It’s more streamlined than defbundle but not quite so flexible. You don’t have to quote the keys,
but that also means that you can’t give the name of a key as a string variable; it’s always taken as a
string literal. Yet more streamlined but also less flexible is this variant:
which works if and only if there are existing objects x, mat and str in scope and you want to add
them to the bundle under keys equal to their own names.
For more on the defbundle function, see the Gretl Command Reference or the Function Reference
under Help in the GUI program.
To add an object to a bundle you assign to a compound left-hand value: the name of the bundle
followed by the key. Two forms of syntax are acceptable in this context. The recommended syntax
(for most uses) is bundlename.key ; that is, the name of the bundle followed by a dot, then the key.
Both the bundle name and the key must be valid gretl identifiers.1 For example, the statement
foo.matrix1 = m
1 As a reminder: 31 characters maximum, starting with a letter and composed of just letters, numbers or underscore.
Chapter 11. Gretl data types 88
adds an object called m (presumably a matrix) to bundle foo under the key matrix1. If you wish to
make it explicit that m is supposed to be a matrix you can use the form
matrix foo.matrix1 = m
foo["matrix1"] = m
This syntax offers greater flexibility in that the key string does not have to be a valid identifier (for
example it can include spaces). In addition, when using the square bracket syntax it is possible to
use a string variable to define or access the key in question. For example:
To get an item out of a bundle, again use the name of the bundle followed by the key, as in
matrix bm = foo.matrix1
# or using the alternative notation
matrix bm = foo["matrix1"]
# or using a string variable
matrix bm = foo[s]
Note that the key identifying an object within a given bundle is necessarily unique. If you reuse an
existing key in a new assignment, the effect is to replace the object which was previously stored
under the given key. It is not required that the type of the replacement object is the same as that
of the original.
Also note that when you add an object to a bundle, what in fact happens is that the bundle acquires
a copy of the object. The external object retains its own identity and is unaffected if the bundled
object is replaced by another. Consider the following script fragment:
bundle foo
matrix m = I(3)
foo.mykey = m
scalar x = 20
foo.mykey = x
After the above commands are completed bundle foo does not contain a matrix under mykey, but
the original matrix m is still in good health.
To delete an object from a bundle use the delete command, with the bundle/key combination, as
in
delete foo.mykey
This destroys the object associated with mykey and removes the key from the hash table.
To determine whether a bundle contains an object associated with a given key, use the inbundle()
function. This takes two arguments: the name of the bundle and the key string. The value returned
by this function is an integer which codes for the type of the object (0 for no match, 1 for scalar, 2
for series, 3 for matrix, 4 for string, 5 for bundle and 6 for array). The function typestr() may be
used to get the string corresponding to this code. For example: