Hidden Markov Models for Modeling of Daily Station Rainfall
Introduction
The Hidden
Markov Model (HMM) provides a framework for modeling daily rainfall occurrences
and amounts on multisite rainfall networks. The HMM fits a model to observed
rainfall records by introducing a small number of discrete rainfallstates.
These states allow a diagnostic interpretation of observed rainfall variability
in terms of a few rainfall patterns. They are not directly observable,
or 'hidden' from the observer.
The time sequence
of which state is active on each day follows a Markov chain. Thus, the
state which is active 'today' depends only on the state which was active
'yesterday' according to transition probabilities.
The HMM also
allows you to simulate
rainfall at each of the station locations, such that key statistical properties
(eg. rainfall probabilities, dry/wet spell lengths) of the simulated rainfall
match those of the observed rainfall records. This can be useful for generating
large numbers of synthetic realizations of rainfall for input into statistical
analysis, or input into a crop simulation model, for example.
The HMM also
provides a basis for downscaling GCM simulations to the station scale,
or calibrating estimates of observed rainfall. Downscaling is accomplished
using a nonhomogeneous
HMM (NHMM) in which predictors are incorporated into the HMM. The predictors
modulate the transitions between the states over time. The transition
matrix is no longer homogeneous in time, hence the name.
The HMM toolbox
contains models of varying complexity that you can use to analyze sets
of rainfall observations. The HMM represents the rainfallgauge observations
themselves, without any of the treatment needed to average the pointscale
observations onto grids. It represents the raw observations. Missing observations
are allowed, and a treated in a consistent probabilistic manner. Several
options are available for modeling spatial dependencies between stations.
The HMM can be
thought of as a cluster analysis of multisite rainfall observations,
with an explicit (Markov chain) time dependence. In its ability to make
simulations, it also bears similarities to existing stochastic weather
generators.
This tutorial provides
a basic introduction to the use of the Toolbox for analysis of rainfall
observations, covering the capabilities in the GUI implementation.
Typical Usage of HMMTool
Launch HMMTool,
either from the command line (linux and Mac), or by doubleclicking the
HMMTool icon (Windows). Please see the platformspecific installation
instructions for
each platform (Linux, Mac or Windows XP). This will bring up the Main
Frame of the HMMTool graphical user interface.
For a chosen model, analysis with HMMTool consists of a sequence
of Actions, performed consecutively from the Action
Menu at the top of the main frame. First estimate
the model parameters (learn action), secondly estimate the mostlikely
state sequence (viterbi action), and thirdly generate rainfall simulations.
1.Describe Your Data
Using the
Main Frame
of HMMTool, select your rainfall data file, an output directory for
the results files, and fill in the boxes specifying the number of stations,
number of sequences (e.g. one for each summer season), length of sequences
(i.e. number of days in each season), and number of hidden states. A
4state model often makes a reasonable starting point. We return to
the number of hidden states later. Apart from the number of states,
which can be any value, the values specifying the size of the rainfall
data set need to match your input file.At least two input
files are needed: one containing the rainfall data, and one
containing the longitudelatitude coordinates of the station. In the
case for NHMM
or NMIXTURE, an additional input file is needed for the predictors.
In the rainfall data file, columns should correspond to stations, and
rows to days.
2.Choose
the Model
Having filled
in the basic options, the next step is to choose the model type from
the Model
Menu, and specify its details.
a.
Model Types There are 4 model types under the Model menu:
HMM
 Hidden Markov Model
Model
with firstorder Markovian time dependence
NHMM
 Nonhomogeneous Hidden Markov Model
HMM
with predictors, sometimes called 'inputs.' These inputs consist
of one or more daily realvalued timeseries. A logistic regression
is used to model model the dependence of stateprobability on the
inputs.
MIXTURE
 Mixture model
Model
without explicit time dependence. This is the same as the HMM, but
without the modeled Markov time dependence.
NMIXTURE  Nonhomogeneous
Mixture model
Model MIXTURE
with predictors.
b.
Specifying the Model Details
The
way you wish to model the rainfall data is controlled under the
Advanced
Options tab, by specifying the spatial model and the rainfall
distribution.
Spatial
Model
Choice
of spatial model can be important to the properties of simulated
rainfall in applications where capturing spatial correlations between
stations may be important (e.g. hydrology). Two spatial models are
available: (a) Independent,
in which rainfall at each station is assumed independent of the
other stations, or (b) ChowLiu,
in which spatial dependence is modeled using tree models.
Please
note that the ChowLiu
option is currently only working for certain choices of model. The
Independent model is selected by default, and is sufficient in most
cases. Some spatial dependence will be captured through the state
variable even under the Independent
model, for example if a state is characterized by high rainfall
probabilities at all stations.
Rainfall
Distribution Type
HMMTool
can model binary rainfall occurrence data, or real valued rainfall
amounts. If the data are binary, then the Bernoulli
distribution should be used. If the data are real, the rainfall
amounts are modeled using a mixture of distributions, consisting
of a deltafunction at zero and a set of exponentials or gamma distributions.
These two possibilities are named deltaexponential
and deltagamma
respectively on the right of the Advanced
Options panel. The number of components included in the
mixture (including the delta function) needs to be entered in the
line below. For example, the common (default) choice of a delta
function together with two exponentials would be deltaexponential
with 3 components entered. A delta function together
with one gamma distribution would be deltagamma
with 2 components.
Other
Advanced Options
Make
sure learn
is selected from the Action
Menu.
There
are two further options for the learn
action: Number
of restarts, and EM
precision. The number of restarts specifies how many
times the EM algorithm is executed from different random starting
points. HMMTool uses an iterative maximum likelihood method, selecting
the restart with the highest loglikelihood as its best estimate
of the global maximum in the likelihood. While larger numbers of
restarts are preferable, this slows execution. Inspection of loglikelihood
values in the output can help decide on an appropriate number. Typically,
10 is a relatively safe choice, but a much smaller number can be
used for test purposes. The EM
precision controls the stopping point of iteration,
and can be left at its default value.
3.Make a Preliminary
Analysis
We
are now ready to make our first analysis, performing the three actions
consecutively from the Action
Menu: estimate the model parameters (learn),
estimate the mostlikely state sequence (viterbi),
and thirdly generate rainfall simulations (simulate).
a. Learn
Action
Make
sure the learn
is selected from the Action
Menu. Select Launch
from the main frame. Print
to Screen prints the loglikelihood values at each iteration,
although execution will have to be completed before anything appears.
Random
number seed allows for different starting points of the
pseudo random number generator; the results are repeatable from each
seed value.
Execution
times may be several minutes, and much longer on slower machines for
large datasets with large numbers of hidden states.
Once
execution is complete, the results can be plotted from the Graphics
Menu, as maps of rainfall
probability and mean rainfall
amount on wet days at each station. The mean rainfall
amount (for deltaexponential
and deltagamma)
is computed from the parametric form of the distribution.
All
the estimated model parameters are written to a *.out file in the
specified output directory. The output filenames are based on the
rainfall input filename, and include the action (here learn), modeltype
and options.
b. Viterbi
Action
Once
the model parameters have been estimated, we are ready to estimate
the mostlikely state sequence, using the Viterbi
action from the Action
Menu. No options or advanced options need to be changed.
Once Viterbi
has been selected, simply click Launch
on the Main
Frame.
Execution
of the Viterbi
action is rapid. Once it completes, a visualization of the state sequence
can be viewed from the Graphics
Menu. The sequence is written to a *.out file in the output
directory, with a filename including the word
'viterbi.' Note that
the states are numbered from 0 in this file, with 0 corresponding
to state 1. Each row of the file contains a single temporal sequence,
ordered from first to last.
The
Viterbi sequence can be used offline to construct composites of atmospheric
circulation, from reanalysis data.
c. Simulate
Action
Once
the model parameters have been estimated, the model can be used to
make a set of rainfall simulations, using the simulate
action from the Action
Menu. When simulate
is selected, an additional option appears on the Main
Frame to specify the number of simulations. If the dataset
has 30 sequences of 90 days, 2 simulations would produce 2 sets of
30 sequences of 90 days, with the second set below the first set.
No advanced options need to be changed. Once simulate
has been selected, simply click Launch
on the main frame.
The stochastic simulations are written to a *.out file in the output
directory, with the filename including the word
'simulation' .
The
simulations can be analyzed offline to test the model performance
in terms of key statistics, such as dryspell run lengths, for example.
4.Choosing
the Number of States
The
most appropriate number of hidden states (k)
is a subjective choice based on the goal of the analysis. It can be
guided by the BIC (Bayes Information Criterion, a penalized likelihood
measure) values obtained with different choices of k,
found in the 'learn'.out files. A graph of the BIC vs. k
(for k
= 2 .. 15, say) will generally increase with k
to a certain point, before reaching a maximum or a plateau. The smallest
k
for which the BIC no longer increases appreciably makes a reasonable
choice, although examination of the states themselves in terms of composites
of atmospheric winds etc can provide further guidance.
The
NHMM
and NMIXTURE
model types require an additional predictor
input file on the Main
Frame, that specifies the timeseries of a set of predictors.
For the learn
and Viterbi
actions, this file needs to have the same number of rows (number of
sequences x sequencelength) as the rainfall data. Each whitespacedelimited
column of the file should contain a separate predictor, with the number
of columns matching the number of predictors specified in the main panel.
For
the simulation
action, the number of rows in the predictor
input file must match the desired length of the simulations.
The
output files from the Viterbi
and simulation
actions with predictors have the same form as for models without predictors.
However, note that the learn
action with predictors produces an output file of a sightly different
form, because there is no simple transition matrix in this case. A transition
matrix can be reconstructed from the Viterbi sequence.
