Formal grammar

I ran across the Wikipedia article on this subject today. While I found it fascinating and enthralling, it also took me about half an hour to properly decode the notation being used, so here I have interpreted it for the layman.

What is a formal grammar?

Symbols

Begin with symbols. A symbol can be anything. One common example is an upper- or lower-case letter of the Roman alphabet.

"a"
"b"
"W"
etc.

If we allow other special characters to be symbols, then we could also have:

"_"
"'"
";"
etc.

But our symbols could also be English words. If this were the case, then our symbols would be things like:

"spectacular"
"adventure"
"heroes"

A symbol in isolation does not need to mean anything.

Alphabets

A set of symbols is called an alphabet. An alphabet must be finite.

Strings

A string is a finite series of symbols. All the symbols must be chosen from the same alphabet. If we use Roman letters as our alphabet, then some example strings would be:

"cat"
"misanthropy"
"sfdggkklsl"

If our alphabet contained special characters, then some more example strings might be:

"__^(\\\sdfsd_+_$%"
"dd'dddddd"
";kfllfddf7777lskl¬¬"

If, instead, our alphabet consisted of English words, then a string (or sentence) would simply be a series of words, such as

"bucket July embarrass whatsoever isn't a the"

Although every string must be finite, notice how there are infinitely many possible finite strings.

"bucket"
"bucket bucket"
"bucket bucket bucket"
"bucket bucket bucket bucket"
etc.

For every alphabet, there is also a single empty string:

""

Notice, again, how strings have no meaning ascribed to them.

Languages

Given an alphabet, a formal language L is any set of finite strings over that alphabet.

Strings which are in L can be considered "correct". Strings which are not in L can be considered "incorrect". For example, if our alphabet was the set of all English words, then our language could be the set of all grammatically correct English sentences:

"The cat sat on the mat"
"No"
"Colourless green ideas sleep furiously"
etc.

For a language L to be well-defined, we only need one thing: a process which generates all of the strings in L.

Grammars

A formal grammar is a finite set of rules for generating "grammatically correct" sentences.

A formal grammar works by starting with a single "unfinished sentence" or "root". Then, different rules are applied in turn, modifying the sentence each time, until the sentence is deemed to be "finished" and the process terminates.

(By definition,) the set of sentences which can be generated by following the rules of a formal grammar comprise a formal language. The sentences which cannot be generated in this way do not fall in the formal language.

Once again, note that a grammar ascribes no meaning to the language it generates.

Formal definition

A formal grammar has four components.

A grammar G is thus expressed as

G = { N, Σ, P, S }

How do we apply the rules?

Obviously, we can theoretically concatenate any combination of symbols, to make example strings like

"<verb> <verb> <verb> a dog a dog <sentence>"
"cat napped a"
"a a a a a a a a a a a <noun>"

However, starting from S and following the rules P, only some of these can be constructed.

A more complicated example

These rules are a little more complex.

G = {

N = { S, B }

Σ = { a, b, c }

S = S, obviously

P = {

  1. SaBSc
  2. Sabc
  3. BaaB
  4. Bbbb

}

}

Here are some productions which follow this grammar. Check to see which rule is being followed for each step.

S
abc

S
aBSc
aBabcc
aaBbcc
aabbcc

S
aBSc
aBaBScc
aBaBabccc
aBaaBbccc
aBaabbccc
aaBabbccc
aaaBbbccc
aaabbbccc

The finished sentences are the last ones in each block: abc, aabbcc and aaabbbccc. All the others contain Ss or Bs. So, notice how the language described by this example grammar is:

L(G) = {

abc
aabbcc
aaabbbccc
...

}

Note that while N, Σ and P are strictly finite (though they may be very large), L is infinite in this case.

So what?

The concept of a formal grammar is not a mathematical curio but of critical importance in the modern world.

Linguistics

Using the concept of a formal grammar to try to approach actual human languages like English and Chinese is possible, though extremely difficult and of dubious usefulness.

Mathematics

Formal grammars are used in mathematics to define what does or does not constitute a syntactically correct mathematical statement or equation. We can create a language of mathematics.

Bad:

488023a ++ 5353 ==== / 5 / 5 /
π * 2.3.34534532.!
993333339X999999

Good:

a2 + b2 = c2 + 17.3
X ⇒ ( ( XY ) ⇒ Y)
2 + 2 = 87

Note again that a grammar ascribes no formal meaning to the language it describes. The fact that the sentence is syntactically correct does not imply that it is factually correct. It doesn't even imply that the finished sentence is a statement of fact at all! Grammatical correctness draws a line between gibberish and nonsense, but not between nonsense and sense.

Very simple formal grammars are used as the basis for the statements of propositional logic, which itself is the basis for much more valuable mathematics.

Computing

Formal grammars are also used to define syntactical correctness in machine languages: that is, programming languages (such as C, Java, Python, and almost any other language you can think of), markup languages (most notably HTML, which is used to describe web pages), data languages (XML, JSON and CSV) and querying languages (SQL and its many brethren).

If you know HTML, a great example can be found by checking out the Document Type Definition of the web page that you're currently looking at: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. A DTD defines what constitutes valid HTML in the document to follow, and it explains which attributes each HTML element may have, and which elements may nest inside that element. Although the syntax is a little different from the one above, you can probably work out how to read the DTD and spot the wildcards. Notice how the syntax of the DTD is itself very strict and rigorous; it, too, is subject to the constraints of a formal grammar.

Validation and parsing

Validation is the process of determining whether a sentence is contained within the formal language generated by a grammar.

Parsing is the process of determining the grammatical structure of a sentence with respect to a formal grammar. Specifically, it means determining the sequence of steps by which the sentence was constructed.

The only way to validate a sentence is to demonstrate that it can be successfully parsed. Likewise, an invalid sentence, by definition, cannot be parsed. Therefore, in the general case, parsing and validation are almost exactly the same thing. Each entails and implies the other. The only distinction is that parsing involves collecting more information.

For an arbitrary sentence and an arbitrary formal grammar, parsing is very difficult. It requires a fully-functional Turing machine and can be extremely computationally intensive. So, how can we make our machine languages more practical to parse?

We do this by putting restrictions on the types of rules and symbols we use in our formal grammar. By adding further restrictions we can classify all of the formal grammars into a hierarchy. Grammars higher in the hierarchy are very powerful, but also very difficult to validate. Grammars lower in the hierarchy are more constrained and have less expressive power, but they are also easier to validate.

Context-free grammars

The most important of the various types of restricted grammar is a context-free grammar. This is about halfway up the hierarchy, striking a good balance between simplicity of validation and expressive power.

In a context-free grammar, all of the production rules must start with a single non-terminating character. The rule

"<verb>" → "barked"

is non-contextual.

The rules

"<noun> <verb>" → "dog barked"

"<noun> <noun>" → "dog"

"dog <verb>" → "dog barked"

are contextual. They provide restrictions on when the sentence may be transformed - in the last example, we are dependent on the existence of the word "dog" immediately before the "<verb>", to provide context. If we had the sentence "cat <verb>" then this rule could not be used, because of the incorrect context.

In a context-free grammar, these rules would not be permitted. The position and context of the single non-terminating character would not affect the operations which could be performed upon it.

(Note how the first example grammar in this writeup is context-free as originally presented, while the second is not.)

Sentences in context-free languages can easily be broken down into their original components and production steps, and the automaton required need not be as complex as a full Turing machine - you can get away with a simpler machine called a pushdown automaton. Validating sentences in context-free grammars is computationally simpler, more mathematically tractable and has in fact been the subject of a great deal of valuable study. Taking an existing contextual grammar and formulating it to become context-free (by changing its rules, without altering the resulting formal language) is a very useful endeavour.

Next: parser combinators

Back to Blog
Back to Things Of Interest
StumbleUpon Twitter Hacker News Facebook Reddit Digg del.icio.us Email

Discussion (12)

2010-01-13 01:41:09 by Chris:

Interesting, and like a lot of things in maths, too elaborate for it's own good. I've just finished a course on Formal Languages and Automata, no grammar stuff though, but I was considering taking the course further for my 4th year project, which would almost certainly involve this kinda stuff.

2010-01-13 01:45:10 by Supergrunch:

Nice writeup - I should have guessed you'd like early Chomskyan stuff. Of course as you say, you can't carry out effective formal analyses of natural language(s) with phrase structure grammars (as Chomsky demonstrates in Syntactic Structures), but there was a bit of a hunt for a while to see whether any natural languages aren't context free. And they aren't, although as Geoffrey Pullum explains in his article "Footlose and context-free," (http://www.jstor.org/stable/4047619) it's all a bit vague as to when exactly this fact was discovered.

2010-01-13 03:32:38 by TSC:

So, er, you HAVE read Gödel, Escher, Bach, right?

2010-01-14 04:10:11 by dankuck:

Thank you for explaining the term context-free grammar! My textbook made no effort to explain what a context is.

2010-01-15 05:11:47 by Josh:

You nailed it, good explanation of formal languages and grammars. Also, context-free grammars pretty much equal push-down automaton. For any context-free grammar, there exists a push-down automaton that can generate the same language as the CFG.

Did you check out Finite Automaton and Regular Expressions? Kind of a step down from CFG and PDA.

2010-01-18 16:33:03 by Dentin:

"Unfortunately, for an arbitrary sentence and an arbitrary formal grammar, this process is very difficult, requires a fully-functional Turing machine and can be extremely computationally intensive."

This phrase immediately brought to my newbie mind encryption, though I don't know enough about grammars to know if it could really be used for that purpose. There would have to be a way to do some kind of asymmetric hard/easy verification of sentences to do public key encryption, and there's always the possibility that sentences would have to be ridiculously large to be sufficiently resistant to attack.

2010-01-23 00:59:56 by Drake:

Dentin, fun thought, but not for CFGs. There are efficient (polynomial-time) parsing algorithms floating around (CKY, Earley's, etc) for CFGs, however, recursively enumerable languages (which allow any kind of production rules) could have some unpleasant "parse" times - however they would all be decidable, as that is guaranteed by the class RE.

However (maybe a complexity person can correct me), I think this is isomorphic to the current approach of recovering prime factors.

2011-06-12 22:51:15 by sean:

Parsing is deriving the syntactic structure from a sentence. No meaning/semantics involved here. Have a look at the Wikipedia page you are linking to...

2011-10-05 15:37:57 by Kobi:

"For every alphabet, there is also a single empty string ..."

That is not correct. For instance: the alphabet for the language defined here:

A-> ab

is composed of the letters 'a' and 'b' only.

2011-10-05 15:45:07 by Sam:

An alphabet can only contain letters/characters. It can't contain an "empty character" and it can't contain strings, certainly not the empty string.

The set of finite strings over an alphabet is different from the alphabet itself. The empty string is always a member of this set, regardless of what the alphabet contains.

2012-01-25 18:17:55 by Lee:

Formal languages and grammars use set theory to establish the tractability of creating automatons for validating sentences using a set of rules. By definition of a set, all sets contain the empty set. Therefore all sets used to define the set of well-formed sentences contain the empty set (i.e. "").

2012-01-25 18:32:49 by Sam:

Not all sets contain the empty set. For example, the empty set does not. It's empty.

add comment