The two top answers are in conflict. The second answer with 43 points is closer ...

WalterBright · on May 2, 2020

The implementation of D has a lexer that is independent of the parser, and a parser that is independent of the rest of the implementation. I have resisted enhancement proposals that would put holes in those walls.

billconan · on May 2, 2020

could you give an example for

> Any programming language that is massively adopted is context free?

I want to have an intuition on what is context free and what is not.

Is D context free?

I mean I want to make a useful language, is being context free practical?

WalterBright · on May 2, 2020

D is context free in that the parser does not require semantic analysis in order to complete its parse.

cxr · on May 3, 2020

The best way would probably be to work on a language that isn't context-free. When you run into the problems context-sensitivity introduces, then you'll really start to understand it.

In the past, Pascal was widely used in teaching contexts, for both learning to program and later how to implement a compiler. I've seen Walter say in the past that he got into doing languages from a series of articles in Byte magazine that included the listings for a small Pascal compiler. For your task, you can use Oberon instead, which is a smaller and simpler than Pascal. Oberon-07 is context free up to a point, which is another way of saying that it's not context-free. It'll work well for this exercise. Don't worry about writing a full-fledged compiler, because you'd end up spending forever on the code generator backend. Instead implement half of a compiler.

Here's one project that fits the description of being one half of a compiler—a documentation generator:

https://github.com/schierlm/OberonXref/

Implementing this requires doing the lexer and the parser, but its "backend" streams out documentation pages instead of object files or assembler. The truth is, you wouldn't even need to go that far. Just do the lexer and then start implementing the parser—let's say that your program should just read in source code and then say whether it's valid or not—whether it will parse. Since Oberon is mostly context-free and the natural way to do things is to use a recursive descent parser, you'll get into a rhythm implementing things, until you run into the part that's context-sensitive, at which point things will get uncomfortable. You will find true enlightenment then. This can be done over a weekend, but if you reach the end of the first weekend and realize it will take you two, don't sweat it.

> I mean I want to make a useful language, is being context free practical?

It is. If you do the exercise above, it should be straightforward to figure out several different ways to change the grammar to eliminate context-sensitivity in your parser. You'll realize that the transformations would be minor. They won't compromise the power or practicality of the language, and it's possible to end up with one that is almost exactly the same. The biggest casualty in this case would be reworking a dubious design decision made for aesthetic reasons.

billconan · on May 2, 2020

Thank you very much for the references. They are very good reads!

In the article about Python, the author said:

> Most languages of interest are context sensitive, and yet we are trying to use inappropriate tools (context-free grammars and context-free parser generators) to specify and parse them. This leads to bad specifications and bad implementations.

Then what do you think is a better tool to specify and parse languages?

chubot · on May 3, 2020

what do you think is a better tool to specify and parse languages?

That's unfortunately an open problem. As I mentioned here, lexing is "solved" but parsing isn't:

https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-...

Here are what some recent languages do:

Go: Maintain a written spec [1], with multiple hand-written parsers. (I think the main one is in Go, while gccgo has one in C++?) I believe Go used to be parsed with yacc, but they moved to a hand-written parser for error messages. It looks like [2] was an early attempt to keep it generated.

[1] https://golang.org/ref/spec

[2] https://research.swtch.com/yyerror

-----

Rust: The parser is hand-written, like Go. There's a grammar in the spec, but like Go's, it's not used to generate code, so it isn't tested. They also have an LALR(1) grammar which was suppoed to be an executable specification, but as far as I can tell that effort has not made progress recently.

It used to be done with GNU yacc but it looks like they moved to a Rust-based tool [3]

[1] https://doc.rust-lang.org/grammar.html

[2] https://github.com/rust-lang/lang-team/tree/master/working-g...

[3] https://github.com/rust-lang/wg-grammar/tree/master/grammar

-----

So basically the state of the art is fairly laborious. It's basically write down the grammar, implement it by hand, and try not to make any mistakes. If you need to implement it again, which is extremely common (e.g. for gccgo, for Rust's language server), also try not to make any mistakes.

When you need to change the language, update the spec, and change all the parsers by hand.

I've written multiple recursive-descent parsers based on grammars, and it's not hard if you have examples to test with, but mistakes can easily creep in.

Rely on users to report bugs. There is a "long tail" of programs that will tickle various corner cases.

-----

Here's the approach I took with Oil:

How to Parse Shell Like a Programming Language http://www.oilshell.org/blog/2019/02/07.html

which is basically to do as much work as possible in the lexer, use a high-level programming language translated to C++ for the parser, and to use a DSL for specifying the AST. This approach keeps the mistakes and tedium down because shell is an extremely large language syntactically. (It could be the biggest of any language except for perhaps Perl. Shell is relatively small semantically, but huge syntactically (which is why I'm preoccupied with parsing :) ).

matheusmoreira · on May 2, 2020

> A main point here is that you have to consider the lexer and parser separately. Grammars don't address this distinction, which arises in essentially all programming languages.

So why does this happen? Couldn't context-free language parsers emulate a lexer by treating individual characters as symbols?

chubot · on May 2, 2020

If there's no feedback, you can consider the lexer and parser separately as languages.

- The lexer recognizes a set of strings of characters.

- The parser recognizes a set of strings of tokens (as returned by the lexer)

But those two languages will have different power in general, so, like Jim points out, when you say something like "Python is context-free" or "Haskell context-free", it's not clear what you're talking about. And it's really false under any reasonable interpretation.

So you can consider them separately, and make a precise statement. But if there's feedback between the two, then you can't do that anymore. The theory doesn't tell you any properties a that the (lexer + parser + feedback mechanism) possess.

That is, regular languages and context-free languages have all sorts of properties proven about them, including ones that let you write code generators. You don't get any of those properties when you have ad hoc feedback mechanism between the lexer and parser.

----

Someone could come up with formalisms for specific types of feedback.

I think OCaml's Menhir has done some of this, but I don't remember the details off hand.

They have written parsers for C (CompCert) and POSIX shell and addressed some of the gaps between theory and practice. I don't use it but they're at least tackling the right problems.

But note that C uses a specific type of feedback (the lexer hack), which isn't identical to what other languages use. So you would have to come up with a formalism for each one, and probably nobody has done that.

zzo38computer · on May 2, 2020

Is Haskell context-free if you don't use the indentation layout mode? Haskell does support using braces and semicolons too. However, this is not true of Python (as far as I know).

massysett · on May 2, 2020

In theory Haskell supports braces and semicolons but in practice nobody uses them, leading to bugs like this one:

https://github.com/haskell/haddock/issues/12

zzo38computer · on May 3, 2020

I do use braces and semicolons when programming in Haskell, and I have encountered that bug before, and I hope that they will fix it (although it still seems to be unfixed after six years, but a few people clearly do care, even if others don't care).

tanilama · on May 2, 2020

what is context free language anyway? context free grammar had a clear definition in parsing, not sure i understand how that can be extended to languages.

chubot · on May 3, 2020

The OP is abusing the term "context free". He's saying it avoids "the lexer hack" [1]:

Context free grammars. What this really means is the code should be parseable without having to look things up in a symbol table

That's NOT what context free means. That's a narrow view from someone designing a C-like language and trying to avoid a very specific property of C.

Another example of being context-sensitive, which has nothing to do with symbol tables, is that resolving LALR(1) conflicts in a yacc-style grammar can make the language not context-free. To resolve ambiguity the parser is doing stuff "outside" the grammar.

----

"Context free" is a mathematical term with a precise definition. It's a class of languages that's part of the Chomsky hierarchy [2], which was discovered and described by linguists and mathematicians before any programming language existed, and has applications outside programming languages. Wikipedia does a good job:

https://en.wikipedia.org/wiki/Context-free_grammar

A formal grammar is considered "context free" when its production rules can be applied regardless of the context of a nonterminal. No matter which symbols surround it, the single nonterminal on the left hand side can always be replaced by the right hand side. This is what distinguishes it from a context-sensitive grammar.

Simple examples of languages that's are context-sensitive (not context-free): Lua string literals, Rust raw strings (see other comments in this thread), and shell here docs:

http://lua-users.org/wiki/StringsTutorial

Yes:

    [[ mystring ]]
    [=[ mystring ]=]
    [==[ mystring ]==]

No:

    [=[ mystring ]]    # mismatched

Matching this language requires a context-sensitive grammar and can't be done with a context-free grammar. It's not all that easy to prove: See the posts from Trevor Jim regarding proofs.

The C language is also not context-free.

[1] http://www.oilshell.org/blog/2017/12/15.html#appendix-lexing...

[2] https://en.wikipedia.org/wiki/Chomsky_hierarchy

bmn__ · on May 3, 2020

> [CF] was discovered and described by linguists and mathematicians before any programming language existed

Nitpick: that was 1956, by then a few PL did already exist.

wtetzner · on May 3, 2020

I read it to mean that the language's grammar is context free, but I guess I'm not sure if that's what was meant.