lnre {zipfR} | R Documentation |
LNRE model constructor, returns an object representing a LNRE model with the specified parameters, or allows parameters to be estimated automatically from an observed frequency spectrum.
lnre(type=c("zm", "fzm", "gigp"), spc=NULL, debug=FALSE, cost=c("chisq", "linear", "smooth.linear", "mse", "exact"), m.max=15, method=c("Custom", "NLM", "Nelder-Mead", "SANN"), exact=TRUE, sampling=c("Poisson", "multinomial"), ...)
type |
class of LNRE model to use (see "LNRE Models" below) |
spc |
observed frequency spectrum used to estimate model parameters |
debug |
if TRUE , detailed debugging information will be
printed during parameter estimation |
cost |
cost function for measuring the "distance" between observed and expected vocabulary size and frequency spectrum. Parameters are estimated by minimizing this cost function (see "Cost Functions" below for a listing of available cost functions). |
m.max |
number of spectrum elements considered by the cost function (see "Cost Functions" below for more information) |
method |
algorithm used for parameter estimation, by minimizing the value of the cost function (see "Parameter Estimation" below for details, and "Minimization Algorithms" for descriptions of the available algorithms) |
exact |
if FALSE , certain LNRE models will be allowed to
use approximations when calculating expected values and variances,
in order to improve performance and numerical stability. However,
the computed values might be inaccurate or inconsistent in "extreme"
situations: in particular, E[V] might be larger than N
when N is very small; \sum_m E[V_m] can be larger than
E[V] at the same N; sum_m (m
* E[V_m]) can be larger than N |
sampling |
type of random sampling model to use. Poisson
sampling is mathematically simpler and allows fast and robust
calculations, while multinomial sampling is more accurate
especially for very small samples. Poisson sampling is the
default and should be unproblematic for sample sizes N \ge
10000. NB: The multinomial sampling option has not
been implemented yet. |
... |
all further named arguments are interpreted as parameter values for the chosen LNRE model (see the respective manpages for names and descriptions of the model parameters) |
Currently, the following LNRE models are supported by the zipfR
package:
The Zipf-Mandelbrot (ZM) LNRE model (see lnre.zm
for details).
The finite Zipf-Mandelbrot (fZM) LNRE model (see
lnre.fzm
for details).
The Generalized Inverse Gauss-Poisson (GIGP) LNRE model (see
lnre.gigp
for details).
If explicit model parameters are specified in addition to an observed
frequency spectrum spc
, these parameters are fixed to the given
values and are excluded from the estimation procedure. This feature
can be useful if fully automatic parameter estimation leads to a poor
or counterintuitive fit.
An object of a suitable subclass of lnre
, depending on the
type
argument (e.g. lnre.fzm
for type="fzm"
).
This object represents a LNRE model of the selected type with the
specified parameter values, or with parameter values estimated from
the observed frequency spectrum spc
.
The internal structure of lnre
objects is described on the
lnre.details
manpage (intended for developers).
Automatic parameter estimation for LNRE models is performed by
matching the expected vocabulary size and frequency spectrum of the
model against the observed data passed in the spc
argument.
For this purpose, a cost function has to be defined as a measure of the "distance" between observed and expected frequency spectrum. Parameters are then estimated by applying a minimization algorithm in order to find those parameter values that lead to the smallest possible cost.
Parameter estimation is a crucial and often also quite critical step in the application of LNRE models. Depending on the shape of the observed frequency spectrum, the automatic estimation procedure may result in a poor and counter-intuitive fit, or may fail altogether.
Users can influence parameter estimation by choosing from a range of
predefined cost functions and from several minimization algorithms, as
described in the following sections. Some experimentation with the
cost
, m.max
and method
arguments will often help
to resolve estimation failures and may result in a considerably better
goodness-of-fit.
The following cost functions are available and can be selected with
the cost
argument. All functions are based on the differences
between observed and expected values for vocabulary size and the first
elements of the frequency spectrum (V_1, \ldots, V_m, where
m is given by the m.max
argument):
chisq
:
linear
:
smooth.linear
:
mse
:
exact
:m.max
automatically,
depending on the number of free parameters that are estimated (in
general, the number of constraints that can be satisfied by
estimating parameters is the same as the number of free
parameters). Having adjusted m.max
, the mse
cost
function is used to determined parameter values, so that the
estimation procedure will not fail even if the constraints cannot
be matched exactly.
Several different minimization algorithms can be used for parmeter
estimation and are selected with the method
argument:
Custom
:lnre
falls back on the Nelder-Mead
or NLM
algorithm.
NLM
:nlm
function, which
makes use of numerical derivatives of the cost function.
NLM
minimization converges quickly and obtains very precise
parameter estimates (for a local minimum of the cost function),
but it is not very stable and may cause parameter estimation to
fail altogether.
Nelder-Mead
:optim
function, performs minimization without using
derivatives. Parameter estimation is therefore very robust, while
almost as fast and accurate as the NLM
method.
Nelder-Mead
is also used internally by most custom
minimization algorithms.
optim
function. Like Nelder-Mead
, this algorithm is
very robust because it avoids numerical derivatives, but
convergence is extremely slow. In some cases, SANN
might
produce a better fit than Nelder-Mead
(if the latter
converges to a suboptimal local minimum).
See the nlm
and optim
manpages for more
information about the minimization algorithms used and key references.
Detailed descriptions of the different LNRE models provided by
zipfR
and their parameters can be found on the manpages
lnre.zm
, lnre.fzm
and
lnre.gigp
.
Useful methods for trained models are lnre.spc
,
lnre.vgc
, EV
, EVm
,
VV
, VVm
. Suitable implementations of the
print
and summary
methods are also
provided (see print.lnre
for details). Note that the
methods N
, V
and Vm
can be
applied to LNRE models with estimated parameters and return
information about the observed frequency spectrum used for parameter
estimation.
The lnre.details
manpage gives details about the
implementation of LNRE models and the internal structure of
lnre
objects, while estimate.model
has more
information on the parameter estimation procedure (both manpages are
intended for developers).
See lnre.goodness.of.fit
for a complete description of
the goodness-of-fit test that is automatically performed after
parameter estimation (and which is reported in the summary
of
the LNRE model). This function can also be used to evaluate the
predictions of the LNRE model on a different data set than the one
used for parameter estimation.
## load Dickens dataset data(Dickens.spc) ## estimate parameters of GIGP model and show summary m <- lnre("gigp", Dickens.spc) m ## N, V and V1 of spectrum used to compute model ## (should be the same as for Dickens.spc) N(m) V(m) Vm(m,1) ## expected V and V_m and their variances for arbitrary N EV(m,100e6) VV(m,100e6) EVm(m,1,100e6) VVm(m,1,100e6) ## use only 10 instead of 15 spectrum elements to estimate model ## (note how fit improves for V and V1) m.10 <- lnre("gigp", Dickens.spc, m.max=10) m.10 ## experiment with different cost functions m.mse <- lnre("gigp", Dickens.spc, cost="mse") m.mse m.exact <- lnre("gigp", Dickens.spc, cost="exact") m.exact ## NLM minimization algorithm is faster but less robust m.nlm <- lnre("gigp", Dickens.spc, method="NLM") m.nlm ## ZM and fZM LNRE models have special estimation algorithms m.zm <- lnre("zm", Dickens.spc) m.zm m.fzm <- lnre("fzm", Dickens.spc) m.fzm ## estimation is much faster if approximations are allowed m.approx <- lnre("fzm", Dickens.spc, exact=FALSE) m.approx ## specify parameters of LNRE models directly m <- lnre("zm", alpha=.5, B=.01) lnre.spc(m, N=1000, m.max=10) m <- lnre("fzm", alpha=.5, A=1e-6, B=.01) lnre.spc(m, N=1000, m.max=10) m <- lnre("gigp", gamma=-.5, B=.01, C=.01) lnre.spc(m, N=1000, m.max=10)