lnre {zipfR}R Documentation

LNRE Models (zipfR)

Description

LNRE model constructor, returns an object representing a LNRE model with the specified parameters, or allows parameters to be estimated automatically from an observed frequency spectrum.

Usage


  lnre(type=c("zm", "fzm", "gigp"),
       spc=NULL, debug=FALSE,
       cost=c("chisq", "linear", "smooth.linear", "mse", "exact"),
       m.max=15,
       method=c("Custom", "NLM", "Nelder-Mead", "SANN"),
       exact=TRUE, sampling=c("Poisson", "multinomial"), 
       ...)

Arguments

type class of LNRE model to use (see "LNRE Models" below)
spc observed frequency spectrum used to estimate model parameters
debug if TRUE, detailed debugging information will be printed during parameter estimation
cost cost function for measuring the "distance" between observed and expected vocabulary size and frequency spectrum. Parameters are estimated by minimizing this cost function (see "Cost Functions" below for a listing of available cost functions).
m.max number of spectrum elements considered by the cost function (see "Cost Functions" below for more information)
method algorithm used for parameter estimation, by minimizing the value of the cost function (see "Parameter Estimation" below for details, and "Minimization Algorithms" for descriptions of the available algorithms)
exact if FALSE, certain LNRE models will be allowed to use approximations when calculating expected values and variances, in order to improve performance and numerical stability. However, the computed values might be inaccurate or inconsistent in "extreme" situations: in particular, E[V] might be larger than N when N is very small; \sum_m E[V_m] can be larger than E[V] at the same N; sum_m (m * E[V_m]) can be larger than N
sampling type of random sampling model to use. Poisson sampling is mathematically simpler and allows fast and robust calculations, while multinomial sampling is more accurate especially for very small samples. Poisson sampling is the default and should be unproblematic for sample sizes N \ge 10000. NB: The multinomial sampling option has not been implemented yet.
... all further named arguments are interpreted as parameter values for the chosen LNRE model (see the respective manpages for names and descriptions of the model parameters)

Details

Currently, the following LNRE models are supported by the zipfR package:

The Zipf-Mandelbrot (ZM) LNRE model (see lnre.zm for details).

The finite Zipf-Mandelbrot (fZM) LNRE model (see lnre.fzm for details).

The Generalized Inverse Gauss-Poisson (GIGP) LNRE model (see lnre.gigp for details).

If explicit model parameters are specified in addition to an observed frequency spectrum spc, these parameters are fixed to the given values and are excluded from the estimation procedure. This feature can be useful if fully automatic parameter estimation leads to a poor or counterintuitive fit.

Value

An object of a suitable subclass of lnre, depending on the type argument (e.g. lnre.fzm for type="fzm"). This object represents a LNRE model of the selected type with the specified parameter values, or with parameter values estimated from the observed frequency spectrum spc.

The internal structure of lnre objects is described on the lnre.details manpage (intended for developers).

Parameter Estimation

Automatic parameter estimation for LNRE models is performed by matching the expected vocabulary size and frequency spectrum of the model against the observed data passed in the spc argument.

For this purpose, a cost function has to be defined as a measure of the "distance" between observed and expected frequency spectrum. Parameters are then estimated by applying a minimization algorithm in order to find those parameter values that lead to the smallest possible cost.

Parameter estimation is a crucial and often also quite critical step in the application of LNRE models. Depending on the shape of the observed frequency spectrum, the automatic estimation procedure may result in a poor and counter-intuitive fit, or may fail altogether.

Users can influence parameter estimation by choosing from a range of predefined cost functions and from several minimization algorithms, as described in the following sections. Some experimentation with the cost, m.max and method arguments will often help to resolve estimation failures and may result in a considerably better goodness-of-fit.

Cost Functions

The following cost functions are available and can be selected with the cost argument. All functions are based on the differences between observed and expected values for vocabulary size and the first elements of the frequency spectrum (V_1, \ldots, V_m, where m is given by the m.max argument):

chisq:
cost function based on a simplified version of the multivariate chi-squared test for goodness-of-fit (assuming independence between the random variables V_m). This cost function usually achieves the best results in the goodness-of-fit evaluation and is used by default.

linear:
linear cost function, which sums over the absolute differences between observed and expected values. This cost function puts more weight on fitting the vocabulary size and the first few elements of the frequency spectrum (where absolute differences are much larger than for higher spectrum elements).

smooth.linear:
modified version of the linear cost function, which smoothes the kink of the absolute value function for a difference of 0 (since non-differentiable cost functions might be problematic for gradient-base minimization algorithms)

mse:
mean squared error cost function, averaging over the squares of differences between observed and expected values. This cost function penalizes large absolute differences more heavily than linear cost (and therefore puts even greater weight on fitting vocabulary size and the first spectrum elements).

exact:
this "virtual" cost function attempts to match the observed vocabulary size and first spectrum elements exactly, ignoring differences for all higher spectrum elements. This is achieved by adjusting the value of m.max automatically, depending on the number of free parameters that are estimated (in general, the number of constraints that can be satisfied by estimating parameters is the same as the number of free parameters). Having adjusted m.max, the mse cost function is used to determined parameter values, so that the estimation procedure will not fail even if the constraints cannot be matched exactly.

Minimization Algorithms

Several different minimization algorithms can be used for parmeter estimation and are selected with the method argument:

Custom:
by default, a custom estimation procedure is used for each type of LNRE model, which may exploit special mathematical properties of the model in order to calculate one or more of the parameter values directly. For example, one parameter of the ZM and fZM models can easily be determined from the constraint E[V] = V (but note that this additional constraint leads to a different fit than is obtained by plain minimization of the cost function!). Custom estimation might also apply special configuration settings to improve convergence of the minimization process, based on knowledge about the valid ranges and "behaviour" of model parameters. If no custom estimation procedure has been implemented for the selected LNRE model, lnre falls back on the Nelder-Mead or NLM algorithm.

NLM:
a standard Newton-type algorithm for nonlinear minimization, implemented by the nlm function, which makes use of numerical derivatives of the cost function. NLM minimization converges quickly and obtains very precise parameter estimates (for a local minimum of the cost function), but it is not very stable and may cause parameter estimation to fail altogether.

Nelder-Mead:
the Nelder-Mead algorithm, implemented by the optim function, performs minimization without using derivatives. Parameter estimation is therefore very robust, while almost as fast and accurate as the NLM method. Nelder-Mead is also used internally by most custom minimization algorithms.

SANN
minimization by simulated annealing, also provided by the optim function. Like Nelder-Mead, this algorithm is very robust because it avoids numerical derivatives, but convergence is extremely slow. In some cases, SANN might produce a better fit than Nelder-Mead (if the latter converges to a suboptimal local minimum).

See the nlm and optim manpages for more information about the minimization algorithms used and key references.

See Also

Detailed descriptions of the different LNRE models provided by zipfR and their parameters can be found on the manpages lnre.zm, lnre.fzm and lnre.gigp.

Useful methods for trained models are lnre.spc, lnre.vgc, EV, EVm, VV, VVm. Suitable implementations of the print and summary methods are also provided (see print.lnre for details). Note that the methods N, V and Vm can be applied to LNRE models with estimated parameters and return information about the observed frequency spectrum used for parameter estimation.

The lnre.details manpage gives details about the implementation of LNRE models and the internal structure of lnre objects, while estimate.model has more information on the parameter estimation procedure (both manpages are intended for developers).

See lnre.goodness.of.fit for a complete description of the goodness-of-fit test that is automatically performed after parameter estimation (and which is reported in the summary of the LNRE model). This function can also be used to evaluate the predictions of the LNRE model on a different data set than the one used for parameter estimation.

Examples


## load Dickens dataset
data(Dickens.spc)

## estimate parameters of GIGP model and show summary
m <- lnre("gigp", Dickens.spc)
m


## N, V and V1 of spectrum used to compute model
## (should be the same as for Dickens.spc)
N(m)
V(m)
Vm(m,1)


## expected V and V_m and their variances for arbitrary N 
EV(m,100e6)
VV(m,100e6)
EVm(m,1,100e6)
VVm(m,1,100e6)

## use only 10 instead of 15 spectrum elements to estimate model
## (note how fit improves for V and V1)
m.10 <- lnre("gigp", Dickens.spc, m.max=10)
m.10

## experiment with different cost functions
m.mse <- lnre("gigp", Dickens.spc, cost="mse")
m.mse
m.exact <- lnre("gigp", Dickens.spc, cost="exact")
m.exact


## NLM minimization algorithm is faster but less robust
m.nlm <- lnre("gigp", Dickens.spc, method="NLM")
m.nlm

## ZM and fZM LNRE models have special estimation algorithms
m.zm <- lnre("zm", Dickens.spc)
m.zm
m.fzm <- lnre("fzm", Dickens.spc)
m.fzm


## estimation is much faster if approximations are allowed
m.approx <- lnre("fzm", Dickens.spc, exact=FALSE)
m.approx


## specify parameters of LNRE models directly
m <- lnre("zm", alpha=.5, B=.01)
lnre.spc(m, N=1000, m.max=10)

m <- lnre("fzm", alpha=.5, A=1e-6, B=.01)
lnre.spc(m, N=1000, m.max=10)

m <- lnre("gigp", gamma=-.5, B=.01, C=.01)
lnre.spc(m, N=1000, m.max=10)


[Package zipfR version 0.6-5 Index]