Practical Earley parsing and the SPARK toolkit

Date

2018-05-23

Authors

Aycock, John Daniel

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Domain-specific, “little” languages are commonplace in computing. So too is the need to implement such languages; to meet this need, we have created SPARK (Scanning, Parsing, And Rewriting Kit), a toolkit for little language implementation in Python, an object-oriented scripting language. SPARK greatly simplifies the task of little language implementation. It requires little code to be written, and accommodates a wide range of users—even those without a background in compiler theory. Our toolkit is seeing increasing use on a variety of diverse projects. SPARK was designed to be easy-to-use with few limitations, and relies heavily on Earley's general parsing algorithm internally, which helps in meeting these design goals. Earley's algorithm, in its standard form, can be hard to use; indeed, experience with SPARK has highlighted several problems with the practical use of Earley's algorithm. Our research addresses and provides solutions for these problems, making some significant improvements to the implementation and use of Earley's algorithm. First, Earley's algorithm suffers from the performance problem . Even under optimum conditions, a standard Earley parser is burdened with overhead. We extend directly-executable parsing techniques for use in Earley parsers, the results of which run in time comparable to the much-more-specialized LALR(1) parsing algorithm. Second is what we call the delayed action problem. General parsers like Earley must, in the worst case, read the entire input before executing any semantic actions associated with the grammar rules. We attack this problem in two ways. We have identified conditions under which it is safe to execute semantic actions on the fly during recognition; as a side effect, this has yielded space savings of over 90% for some grammars. The other approach to the delayed action problem deals with the difficulty of handling context-dependent tokens. Such tokens are easy to handle using what we call “Schrödinger's tokens,” a superposition of token types. Finally, Earley parsers are complicated by the need to process grammar rules with empty right-hand sides. We present a simple, efficient way to handle these empty rules, and prove that our new method is correct. We also show how our method may be used to create a new type of LR(0) automaton which is ideally suited for use in Earley parsers. Our work has made Earley parsing faster and more space-efficient, turning it into an excellent candidate for practical use in many applications.

Description

Keywords

Parsing (Computer grammar), SPARK (Computer program language), Programming languages (Electronic computers)

Citation