Information Extraction with Hidden Markov Models (Abstract)

Eom Jae-Hong
Seoul National University, KOREA

Recently, due to the popularity of computers and the Internet, the amount of
information provided to users has increased exponentially. This implies that
information technology needs to go beyond simple information retrieval. In other
words, information technology must be able to support more advanced information
processing techniques such as information extraction and automatic document
summarization. HMM is a kind of automata and it\'s internal state transitions
are decided with some probabilistic values. HMM is used widely for the
application with temporal data of sequential characteristics such as speech
data. Here, we present a new effective method for building HMM structure for
information extraction tasks. For information extraction tasks, we used modified
HMM structure. Traditional HMM are used with pre-constructed static model
structure and trained its model parameter after model construction. We present
here a new HMM called S-HMM (Self-Organizing Hidden Markov Model) that
constructs its structure with the rules that are obtained from training dataset.
We used CFP (Call For Papers or Participation) documents of computer science and
biology conferences as a dataset. The proposed HMM learn to distinguish the
fields, and then extracts conference names, dates and locations, conference
URLs, deadlines for paper submission and contact information (phone, fax,
e-mail) from the CFP text data. We also tested our S-HMM with CMU online seminar
announcement data and LA restaurants review and recommendation data. We
construct model structure using S-HMM from initial abstract model structure to
more detailed structure with the set of rules learned from the training data. We
could find more appropriate structure with this set of rules. The experimental
results show improved average extraction accuracy of about 12% increase in
average extraction speed in comparison with fixed-state HMM.