tfl {zipfR} | R Documentation |
In the zipfR
library, tfl
objects are used to represent
a type frequency list, which specifies the observed frequency of each
type in a corpus. For mathematical reasons, expected type frequencies
are rarely considered.
With the tfl
constructor function, an object can be initialized
directly from the specified data vectors. It is more common to read
a type frequency list from a disk file with read.tfl
or,
in some cases, derive it from an observed frequency spectrum with
spc2tfl
.
tfl
objects should always be treated as read-only.
tfl(f, k=1:length(f), type=NULL, f.min=min(f), f.max=max(f), incomplete=!(missing(f.min) && missing(f.max)), N=NA, V=NA, delete.zeros=FALSE)
k |
integer vector of type IDs k (if omitted, natural numbers 1,2,\ldots are assigned automatically) |
f |
vector of corresponding type frequencies f_k |
type |
optional character vector of type representations (e.g. word forms or lemmata), used for informational and printing purposes only |
incomplete |
indicates that the type frequency list is incomplete, i.e. only contains types in a certain frequency range (typically, the lowest-frequency types may be excluded). Incomplete type frequency lists are rarely useful. |
N, V |
sample size and vocabulary size corresponding to the type frequency list have to be specified explicitly for incomplete lists |
f.min, f.max |
frequency range represented in an incomplete type frequency list (see details below) |
delete.zeros |
if TRUE , delete types with f=0 from
the type frequency list, after assigning type IDs. This
operation does not make the resulting tfl object
incomplete. |
If f.min
and f.max
are not specified, but the list is
marked as incomplete (with incomplete=TRUE
), they are
automatically determined from the frequency vector f
(making
the assumption that all types in this frequency range are listed).
Explicit specification of either f.min
or f.max
implies
an incomplete list. In this case, all types outside the specified
range will be deleted from the list. If incomplete=FALSE
is
explicitly given, N
and V
will be determined
automatically from the input data (which is assumed to be complete),
but the resulting type frequency list will still be incomplete.
If you just want to remove types with f=0 without marking the
type frequency list as incomplete, use the option
delete.zeros=TRUE
.
A tfl
object is a data frame with the following variables:
k
f
type
The data frame always has to be sorted with respect to the k
column (ascending order).
The following attributes are used to store additional information about the frequency spectrum:
N, V
f
variable, but they are essential for an incomplete list.
incomplete
TRUE
, the type frequency list is
incomplete, i.e. it lists only types in the frequency range given
by f.min
and f.max
f.min
, f.max
incomplete
flag is set)
hasTypes
type
variable is present
An object of class tfl
representing the specified type
frequency list. This object should be treated as read-only (although
such behaviour cannot be enforced in R).
read.tfl
, write.tfl
,
sample.tfl
, spc2tfl
, tfl2spc
Generic methods supported by tfl
objects are
print
, summary
, N
,
V
and Vm
.
Implementation details and non-standard arguments for these methods
can be found on the manpages print.tfl
,
summary.tfl
, N.tfl
, V.tfl
,
etc.
## typically, you will read a tfl from a file ## (see examples in the read.tfl manpage) ## or you can load a ready-made tfl data(Brown.tfl) summary(Brown.tfl) print(Brown.tfl) ## or create it from a spectrum (with different ids and ## no type labels) data(Brown.spc) Brown.tfl2 <- spc2tfl(Brown.spc) ## same frequency information as Brown.tfl ## but with different ids and no type labels summary(Brown.tfl2) print(Brown.tfl2) ## how to display draw a Zipf's rank/frequency plot ## by extracting frequencies from a tfl plot(sort(Brown.tfl$f,decreasing=TRUE),log="y",xlab="rank",ylab="frequency") ## simulating a tfl Zipfian.tfl <- tfl(1000/(1:1000)) plot(Zipfian.tfl$f,log="y")