zipfR-package {zipfR} | R Documentation |
The zipfR package performs Large-Number-of-Rare-Events (LNRE) modeling of (linguistic) type frequency distributions (Baayen 2001) and provides utilities to run various forms of lexical statistics analysis in R.
The best way to get started with zipfR is to read the tutorial, which you can find via the HTML documentation (follow the Overview link); you can also download it from http://purl.org/stefan.evert/zipfR/
zipfR is released under the GNU General Public License (http://www.gnu.org/copyleft/gpl.html)
Stefan Evert <stefan.evert@uos.de> and Marco Baroni <marco.baroni@unitn.it>
Maintainer: Stefan Evert <stefan.evert@uos.de>
zipfR Website: http://purl.org/stefan.evert/zipfR/
Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer, Dordrecht.
Baroni, Marco (to appear). Distributions in text. To appear in: A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 39. Mouton de Gruyter, Berlin.
Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, IMS, University of Stuttgart. URN urn:nbn:de:bsz:93-opus-23714 http://elib.uni-stuttgart.de/opus/volltexte/2005/2371/
Evert, Stefan (2004). A simple LNRE model for random character sequences. Proceedings of JADT 2004, 411-422.
Evert, Stefan and Baroni, Marco (2006). Testing the extrapolation quality of word frequency models. Proceedings of Corpus Linguistics 2005.
Evert, Stefan and Baroni, Marco (2006). The zipfR library: Words and other rare events in R. useR! 2006: The second R user conference.
The zipfR tutorial: available from http://purl.org/stefan.evert/zipfR/ and via the HTML documentation (by following the Overview link)
Some good entry points into the zipfR documentation are
be spc
, vgc
, tfl
,
read.spc
, read.tfl
,
read.vgc
, lnre
,
lnre.vgc
, plot.spc
,
plot.vgc
The same authors also develop the corpora
library
(available on CRAN) supporting simple inferential statistics
for corpus analysis
Harald Baayen's LEXSTATS tools: http://www.mpi.nl/world/persons/private/baayen/software.html
Stefan Evert's UCS tools: http://collocations.de/
## load Oliver Twist and Great Expectations frequency spectra data(DickensOliverTwist.spc) data(DickensGreatExpectations.spc) ## check sample size and vocabulary and hapax counts N(DickensOliverTwist.spc) V(DickensOliverTwist.spc) Vm(DickensOliverTwist.spc,1) N(DickensGreatExpectations.spc) V(DickensGreatExpectations.spc) Vm(DickensGreatExpectations.spc,1) ## compute binomially interpolated growth curves ot.vgc <- vgc.interp(DickensOliverTwist.spc,(1:100)*1570) ge.vgc <- vgc.interp(DickensGreatExpectations.spc,(1:100)*1865) ## plot them plot(ot.vgc,ge.vgc,legend=c("Oliver Twist","Great Expectations")) ## load Dickens' works frequency spectrum data(Dickens.spc) ## compute Zipf-Mandelbrot model from Dickens data ## and look at model summary zm <- lnre("zm",Dickens.spc) zm ## plot observed and expected spectrum zm.spc <- lnre.spc(zm,N(Dickens.spc)) plot(Dickens.spc,zm.spc) ## obtain expected V and V1 values at arbitrary sample sizes EV(zm,1e+8) EVm(zm,1,1e+8) ## generate expected V and V1 growth curves up to a sample size ## of 10 million tokens and plot them, with vertical line at ## estimation size ext.vgc <- lnre.vgc(zm,(1:100)*1e+5,m.max=1) plot(ext.vgc,N0=N(zm),add.m=1)