Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

An interesting article about parsing techniques that were developed half a century ago when there were still severe memory limitations. It would be nice if it at least would mention that these techniques were developed because back-tracking was not considered as a valid option for parsing, because it was simply not feasible to store a whole file in memory. Nowadays the whole source code of the linux kernel easily fits in RAM.

For many applications, using a back-tracking parser with some memorization/caching, is feasible. IParse Studio [1] is an example of an interpretting parser that parses a grammar and an input at each change on the input and demonstrates that it works for rather complex grammars [2] on realistic input [3].

[1] https://fransfaase.github.io/MCH2022ParserWorkshop/IParseStu... [2] https://fransfaase.github.io/MCH2022ParserWorkshop/C_grammar... [3] https://fransfaase.github.io/MCH2022ParserWorkshop/scan_pc.t...



Here is another example of an LR parser that reads the grammar on the fly and immediately parses the input (which is, among other things, also defining its own grammar) with it: https://marketplace.visualstudio.com/items?itemName=Practal....

But it seems the LR approach is just not flexible enough for my needs, so I am experimenting with replacing it with parser combinators + Earley parsing. The LR approach has also the disadvantage of having to precompute tables, which is fine for a single file, but if you want to deal with modules, then I am not sure how to compute these tables incrementally, which is kind of a prerequisite for a good user experience.


"I suspect that parsing performance worries date back to the period when parsing techniques were under heavy development. LR parsing was invented in 1965, a time when computers were painfully slow [19] and resource poor. "

"When combined with a couple of other techniques to squeeze the statetable’s memory footprint [22], even the most puny modern machine can run an arbitrary LR parser at impressive speeds."


> works (...) on realistic input

However, sometimes the input is generated by code. Think of e.g. huge switch statements. It's certainly not nice if a parser breaks down on that.


The source file isn't the main memory concern, it's the memoization.


There is no need for full memoization. The above JavaScript implmentation caches for each non-terminal the last location a parsing attempt was made and the out come of that: either false, or true with an abstract syntax tree and the next location after the part of the input that was parsed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: