Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm

dc.contributor.authorBarrett, Neil
dc.contributor.authorWeber-Jahnke, Jens
dc.date.accessioned2013-10-28T17:00:51Z
dc.date.available2013-10-28T17:00:51Z
dc.date.copyright2011en_US
dc.date.issued2011-06-09
dc.descriptionBioMed Centralen_US
dc.description.abstractBackground: Tokenization is an important component of language processing yet there is no widely accepted tokenization method for English texts, including biomedical texts. Other than rule based techniques, tokenization in the biomedical domain has been regarded as a classification task. Biomedical classifier-based tokenizers either split or join textual objects through classification to form tokens. The idiosyncratic nature of each biomedical tokenizer’s output complicates adoption and reuse. Furthermore, biomedical tokenizers generally lack guidance on how to apply an existing tokenizer to a new domain (subdomain). We identify and complete a novel tokenizer design pattern and suggest a systematic approach to tokenizer creation. We implement a tokenizer based on our design pattern that combines regular expressions and machine learning. Our machine learning approach differs from the previous split-join classification approaches. We evaluate our approach against three other tokenizers on the task of tokenizing biomedical text. Results: Medpost and our adapted Viterbi tokenizer performed best with a 92.9% and 92.4% accuracy respectively. Conclusions: Our evaluation of our design pattern and guidelines supports our claim that the design pattern and guidelines are a viable approach to tokenizer construction (producing tokenizers matching leading custom-built tokenizers in a particular domain). Our evaluation also demonstrates that ambiguous tokenizations can be disambiguated through POS tagging. In doing so, POS tag sequences and training data have a significant impact on proper text tokenization.en_US
dc.description.reviewstatusRevieweden_US
dc.description.scholarlevelFacultyen_US
dc.identifier.citationBarrett and Weber-Jahnke: Building a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithm. BMC Bioinformatics 2011 12(Suppl 3):S1.en_US
dc.identifier.urihttp://www.biomedcentral.com/1471-2105/12/S3/S1
dc.identifier.urihttp://dx.doi.org/10.1186/1471-2105-12-S3-S1
dc.identifier.urihttp://hdl.handle.net/1828/4993
dc.language.isoenen_US
dc.publisherBioMed Centralen_US
dc.subject.departmentDepartment of Computer Science
dc.titleBuilding a biomedical tokenizer using the token lattice design pattern and the adapted Viterbi algorithmen_US
dc.typeArticleen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Barret_Neil_BMCBioinformatics_2011.pdf
Size:
349.34 KB
Format:
Adobe Portable Document Format
Description:
Barret_Neil_BMCBioinformatics_2011.pdf
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.74 KB
Format:
Item-specific license agreed upon to submission
Description: