Formal grammar

I ran across the Wikipedia article on this subject today. While I found it fascinating and enthralling, it also took me about half an hour to properly decode the notation being used, so here I have interpreted it for the layman.

What is a formal grammar?

Symbols

Begin with symbols. A symbol can be anything. One common example is an upper- or lower-case letter of the Roman alphabet.

"a"
"b"
"W"
etc.

If we allow other special characters to be symbols, then we could also have:

"_"
"'"
";"
etc.

But our symbols could also be English words. If this were the case, then our symbols would be things like:

"spectacular"
"adventure"
"heroes"

A symbol in isolation does not need to mean anything.

Alphabets

A set of symbols is called an alphabet. An alphabet must be finite.

Strings

A string is a finite series of symbols. All the symbols must be chosen from the same alphabet. If we use Roman letters as our alphabet, then some example strings would be:

"cat"
"misanthropy"
"sfdggkklsl"

If our alphabet contained special characters, then some more example strings might be:

"__^(sdfsd_+_$%"
"dd'dddddd"
";kfllfddf7777lskl¬¬"

If, instead, our alphabet consisted of English words, then a string (or sentence) would simply be a series of words, such as

"bucket July embarrass whatsoever isn't a the"

Although every string must be finite, notice how there are infinitely many possible finite strings.

"bucket"
"bucket bucket"
"bucket bucket bucket"
"bucket bucket bucket bucket"
etc.

For every alphabet, there is also a single empty string:

""

Notice, again, how strings have no meaning ascribed to them.

Languages

Given an alphabet, a formal language L is any set of finite strings over that alphabet.

  • A formal language may contain a finite number of strings.

    • The smallest possible language is the empty language, containing no strings at all (this is not to be confused with the language containing only the empty string).

  • Or, a language may contain infinitely many strings.

    • The largest possible language is the one containing every possible finite string.

Strings which are in L can be considered "correct". Strings which are not in L can be considered "incorrect". For example, if our alphabet was the set of all English words, then our language could be the set of all grammatically correct English sentences:

"The cat sat on the mat"
"No"
"Colourless green ideas sleep furiously"
etc.

For a language L to be well-defined, we only need one thing: a process which generates all of the strings in L.

Grammars

A formal grammar is a finite set of rules for generating "grammatically correct" sentences.

A formal grammar works by always starting with the same single "unfinished sentence" or "root". Then, different rules are applied in turn, modifying the sentence each time, until the sentence is deemed to be "finished" and the process terminates.

(By definition,) the set of sentences which can be generated by following the rules of a formal grammar comprise a formal language. The sentences which cannot be generated in this way do not fall in the formal language.

Once again, note that a grammar ascribes no meaning to the language it generates.

Formal definition

A formal grammar has four components.

  • First we have a finite set Σ of terminal symbols. This is our alphabet. The final sentence must consist only of terminal symbols. For our example we'll take:

    Σ = { "a", "the", "dog", "cat", "barked", "napped"}

  • Next we have a finite set N of nonterminal symbols. Non-terminal symbols are special. I think of them as wildcards. If a sentence still contains non-terminal symbols then it is not finished yet. To help keep this intuitive, my non-terminal symbols will all have <angle brackets>.

    N = { "<sentence>", "<noun>", "<verb>", "<article>" }

    Note that there can be no symbols shared between Σ and N.

  • One of the non-terminal symbols, S, must be nominated as a start symbol from which all sentences must be built.

    S = "<sentence>"

  • Finally, we have a finite set P of production rules, which is things we are allowed to do to construct sentences.

    Every rule must contain at least one non-terminating symbol on the left.

    P = {

    1. "<sentence>" → "<article><noun><verb>"
    2. "<article>" → "a"
    3. "<article>" → "the"
    4. "<noun>" → "dog"
    5. "<noun>" → "cat"
    6. "<verb>" → "barked"
    7. "<verb>" → "napped"

    }

A grammar G is thus expressed as

G = { N, Σ, P, S }

How do we apply the rules?

Obviously, we can theoretically concatenate any combination of symbols, to make example strings like

"<verb> <verb> <verb> a dog a dog <sentence>"
"cat napped a"
"a a a a a a a a a a a <noun>"

However, starting from S and following the rules P, only some of these can be constructed.

  • A sentence y can be derived in one step from another sentence x if there is a rule which can be applied to connect the two. For example:

    x = "cat cat cat <noun> <verb> the"

    Using rule 4, "<noun>" → "dog", we could generate:

    y = "cat cat cat dog <verb> the"

    Or, using rule 7, "<verb>" → "napped", we could generate:

    y = "cat cat cat <noun> napped the"

  • A sentential form is any sentence which can be derived in a finite number of steps, starting from S. Examples are:

    "<sentence>" (0 steps)

    "<article> <noun> <verb>" (1 step)

    "<article> <noun> barked" (2 steps)

    "a <noun> barked" (3 steps)

    "a cat barked" (4 steps)

  • A finished sentential form is any sentential form which contains no non-terminating symbols. Only the last of the previous five examples, "a cat barked", counts. All the others have "wildcards" in.

  • Finally, the formal language L(G) is the set of all finished sentences:

    L(G) = {

    "a dog barked"
    "a dog napped"
    "a cat barked"
    "a cat napped"
    "the dog barked"
    "the dog napped"
    "the cat barked"
    "the cat napped"

    }

A more complicated example

These rules are a little more complex.

G = {

N = { S, B }

Σ = { a, b, c }

S = S, obviously

P = {

  1. SaBSc
  2. Sabc
  3. BaaB
  4. Bbbb

}

}

Here are some productions which follow this grammar. Check to see which rule is being followed for each step.

S
abc

S
aBSc
aBabcc
aaBbcc
aabbcc

S
aBSc
aBaBScc
aBaBabccc
aBaaBbccc
aBaabbccc
aaBabbccc
aaaBbbccc
aaabbbccc

The finished sentences are the last ones in each block: abc, aabbcc and aaabbbccc. All the others contain Ss or Bs. So, notice how the language described by this example grammar is:

L(G) = {

abc
aabbcc
aaabbbccc
...

}

Note that while N, Σ and P must always be finite (though they may be very large), L is infinite in this case.

So what?

The concept of a formal grammar is not a mathematical curio but of critical importance in the modern world.

Linguistics

Using the concept of a formal grammar to try to approach actual human languages like English and Chinese is possible, though extremely difficult and of dubious usefulness.

Mathematics

Formal grammars are used in mathematics to define what does or does not constitute a syntactically correct mathematical statement or equation. We can create a language of mathematics.

Bad:

488023a ++ 5353 ==== / 5 / 5 /
π * 2.3.34534532.!
993333339X999999

Good:

a2 + b2 = c2 + 17.3
X ⇒ ( ( XY ) ⇒ Y)
2 + 2 = 87

Note again that a grammar ascribes no formal meaning to the language it describes. The fact that the sentence is syntactically correct does not imply that it is factually correct. It doesn't even imply that the finished sentence is a statement of fact at all! Grammatical correctness draws a line between gibberish and nonsense, but not between nonsense and sense.

Very simple formal grammars are used as the basis for the statements of propositional logic, which itself is the basis for much more valuable mathematics.

Computing

Formal grammars are also used to define syntactical correctness in machine languages: that is, programming languages (such as C, Java, Python, and almost any other language you can think of), markup languages (most notably HTML, which is used to describe web pages), data languages (XML, JSON and CSV) and querying languages (SQL and its many brethren).

If you know HTML, a great example can be found by checking out the Document Type Definition of the web page that you're currently looking at: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. A DTD defines what constitutes valid HTML in the document to follow, and it explains which attributes each HTML element may have, and which elements may nest inside that element. Although the syntax is a little different from the one above, you can probably work out how to read the DTD and spot the wildcards. Notice how the syntax of the DTD is itself very strict and rigorous; it, too, is subject to the constraints of a formal grammar.

Validation and parsing

Validation is the process of determining whether a sentence is "correct"; that is, whether it is contained within the formal language generated by a grammar.

Parsing is the process of determining the grammatical structure of a sentence with respect to a formal grammar. Specifically, it means determining the sequence of production steps by which the sentence was constructed.

The only way to validate a sentence is to demonstrate that it can be successfully parsed. Likewise, an invalid sentence, by definition, cannot be parsed. Therefore, in the general case, parsing and validation are almost exactly the same thing. Each entails and implies the other. The only distinction is that parsing involves collecting more information.

For an arbitrary sentence and an arbitrary formal grammar, parsing is very difficult. It requires a fully-functional Turing machine and can be extremely computationally intensive. So, how can we make our machine languages more practical to parse?

We do this by putting restrictions on the types of rules and symbols we use in our formal grammar. By adding further restrictions we can classify all of the formal grammars into a hierarchy. Grammars higher in the hierarchy are very powerful, but also very difficult to validate. Grammars lower in the hierarchy are more constrained and have less expressive power, but they are also easier to validate.

Context-free grammars

The most important of the various types of restricted grammar is a context-free grammar. This is about halfway up the hierarchy, striking a good balance between simplicity of validation and expressive power.

In a context-free grammar, all of the production rules must start with a single non-terminating character. The rule

"<verb>" → "barked"

is non-contextual.

The rules

"<noun> <verb>" → "dog barked"

"<noun> <noun>" → "dog"

"dog <verb>" → "dog barked"

are contextual. They provide restrictions on when the sentence may be transformed - in the last example, we are dependent on the existence of the word "dog" immediately before the "<verb>", to provide context. If we had the sentence "cat <verb>" then this rule could not be used, because of the incorrect context.

In a context-free grammar, these rules would not be permitted. The position and context of the single non-terminating character would not affect the operations which could be performed upon it.

(Note how the first example grammar in this writeup is context-free as originally presented, while the second is not.)

Sentences in context-free languages can easily be broken down into their original components and production steps, and the automaton required need not be as complex as a full Turing machine - you can get away with a simpler machine called a pushdown automaton. Validating sentences in context-free grammars is computationally simpler, more mathematically tractable and has in fact been the subject of a great deal of valuable study. Taking an existing contextual grammar and formulating it to become context-free (by changing its rules, without altering the resulting formal language) is a very useful endeavour.

Next: parser combinators

Back to Blog
Back to Things Of Interest

Facebook Twitter Reddit Email Hacker News StumbleUpon

Discussion (12)

2010-01-13 01:41:09 by Chris:

Interesting, and like a lot of things in maths, too elaborate for it's own good. I've just finished a course on Formal Languages and Automata, no grammar stuff though, but I was considering taking the course further for my 4th year project, which would almost certainly involve this kinda stuff.

2010-01-13 01:45:10 by Supergrunch:

Nice writeup - I should have guessed you'd like early Chomskyan stuff. Of course as you say, you can't carry out effective formal analyses of natural language(s) with phrase structure grammars (as Chomsky demonstrates in Syntactic Structures), but there was a bit of a hunt for a while to see whether any natural languages aren't context free. And they aren't, although as Geoffrey Pullum explains in his article "Footlose and context-free," (http://www.jstor.org/stable/4047619) it's all a bit vague as to when exactly this fact was discovered.

2010-01-13 03:32:38 by TSC:

So, er, you HAVE read Gödel, Escher, Bach, right?

2010-01-14 04:10:11 by dankuck:

Thank you for explaining the term context-free grammar! My textbook made no effort to explain what a context is.

2010-01-15 05:11:47 by Josh:

You nailed it, good explanation of formal languages and grammars. Also, context-free grammars pretty much equal push-down automaton. For any context-free grammar, there exists a push-down automaton that can generate the same language as the CFG.

Did you check out Finite Automaton and Regular Expressions? Kind of a step down from CFG and PDA.

2010-01-18 16:33:03 by Dentin:

"Unfortunately, for an arbitrary sentence and an arbitrary formal grammar, this process is very difficult, requires a fully-functional Turing machine and can be extremely computationally intensive."

This phrase immediately brought to my newbie mind encryption, though I don't know enough about grammars to know if it could really be used for that purpose. There would have to be a way to do some kind of asymmetric hard/easy verification of sentences to do public key encryption, and there's always the possibility that sentences would have to be ridiculously large to be sufficiently resistant to attack.

2010-01-23 00:59:56 by Drake:

Dentin, fun thought, but not for CFGs. There are efficient (polynomial-time) parsing algorithms floating around (CKY, Earley's, etc) for CFGs, however, recursively enumerable languages (which allow any kind of production rules) could have some unpleasant "parse" times - however they would all be decidable, as that is guaranteed by the class RE.

However (maybe a complexity person can correct me), I think this is isomorphic to the current approach of recovering prime factors.

2011-06-12 22:51:15 by sean:

Parsing is deriving the syntactic structure from a sentence. No meaning/semantics involved here. Have a look at the Wikipedia page you are linking to...

2011-10-05 15:37:57 by Kobi:

"For every alphabet, there is also a single empty string ..."

That is not correct. For instance: the alphabet for the language defined here:

A-> ab

is composed of the letters 'a' and 'b' only.

2011-10-05 15:45:07 by Sam:

An alphabet can only contain letters/characters. It can't contain an "empty character" and it can't contain strings, certainly not the empty string.

The set of finite strings over an alphabet is different from the alphabet itself. The empty string is always a member of this set, regardless of what the alphabet contains.

2012-01-25 18:17:55 by Lee:

Formal languages and grammars use set theory to establish the tractability of creating automatons for validating sentences using a set of rules. By definition of a set, all sets contain the empty set. Therefore all sets used to define the set of well-formed sentences contain the empty set (i.e. "").

2012-01-25 18:32:49 by Sam:

Not all sets contain the empty set. For example, the empty set does not. It's empty.