I ran across the Wikipedia article on this subject today. While I found it fascinating and enthralling, it also took me about half an hour to properly decode the notation being used, so here I have interpreted it for the layman.
Begin with symbols. A symbol can be anything. One common example is an upper- or lower-case letter of the Roman alphabet.
"a"
"b"
"W"
etc.
If we allow other special characters to be symbols, then we could also have:
"_"
"'"
";"
etc.
But our symbols could also be English words. If this were the case, then our symbols would be things like:
"spectacular"
"adventure"
"heroes"
A symbol in isolation does not need to mean anything.
A set of symbols is called an alphabet. An alphabet must be finite.
A string is a finite series of symbols. All the symbols must be chosen from the same alphabet. If we use Roman letters as our alphabet, then some example strings would be:
"cat"
"misanthropy"
"sfdggkklsl"
If our alphabet contained special characters, then some more example strings might be:
"__^(sdfsd_+_$%"
"dd'dddddd"
";kfllfddf7777lskl¬¬"
If, instead, our alphabet consisted of English words, then a string (or sentence) would simply be a series of words, such as
"bucket July embarrass whatsoever isn't a the"
Although every string must be finite, notice how there are infinitely many possible finite strings.
"bucket"
"bucket bucket"
"bucket bucket bucket"
"bucket bucket bucket bucket"
etc.
For every alphabet, there is also a single empty string:
""
Notice, again, how strings have no meaning ascribed to them.
Given an alphabet, a formal language L is any set of finite strings over that alphabet.
A formal language may contain a finite number of strings.
The smallest possible language is the empty language, containing no strings at all (this is not to be confused with the language containing only the empty string).
Or, a language may contain infinitely many strings.
The largest possible language is the one containing every possible finite string.
Strings which are in L can be considered "correct". Strings which are not in L can be considered "incorrect". For example, if our alphabet was the set of all English words, then our language could be the set of all grammatically correct English sentences:
"The cat sat on the mat"
"No"
"Colourless green ideas sleep furiously"
etc.
For a language L to be well-defined, we only need one thing: a process which generates all of the strings in L.
A formal grammar is a finite set of rules for generating "grammatically correct" sentences.
A formal grammar works by always starting with the same single "unfinished sentence" or "root". Then, different rules are applied in turn, modifying the sentence each time, until the sentence is deemed to be "finished" and the process terminates.
(By definition,) the set of sentences which can be generated by following the rules of a formal grammar comprise a formal language. The sentences which cannot be generated in this way do not fall in the formal language.
Once again, note that a grammar ascribes no meaning to the language it generates.
A formal grammar has four components.
First we have a finite set Σ of terminal symbols. This is our alphabet. The final sentence must consist only of terminal symbols. For our example we'll take:
Σ = { "a", "the", "dog", "cat", "barked", "napped"}
Next we have a finite set N of nonterminal symbols. Non-terminal symbols are special. I think of them as wildcards. If a sentence still contains non-terminal symbols then it is not finished yet. To help keep this intuitive, my non-terminal symbols will all have <angle brackets>.
N = { "<sentence>", "<noun>", "<verb>", "<article>" }
Note that there can be no symbols shared between Σ and N.
One of the non-terminal symbols, S, must be nominated as a start symbol from which all sentences must be built.
S = "<sentence>"
Finally, we have a finite set P of production rules, which is things we are allowed to do to construct sentences.
Every rule must contain at least one non-terminating symbol on the left.
P = {
- "<sentence>" → "<article><noun><verb>"
- "<article>" → "a"
- "<article>" → "the"
- "<noun>" → "dog"
- "<noun>" → "cat"
- "<verb>" → "barked"
- "<verb>" → "napped"
}
A grammar G is thus expressed as
G = { N, Σ, P, S }
Obviously, we can theoretically concatenate any combination of symbols, to make example strings like
"<verb> <verb> <verb> a dog a dog <sentence>"
"cat napped a"
"a a a a a a a a a a a <noun>"
However, starting from S and following the rules P, only some of these can be constructed.
A sentence y can be derived in one step from another sentence x if there is a rule which can be applied to connect the two. For example:
x = "cat cat cat <noun> <verb> the"
Using rule 4, "<noun>" → "dog", we could generate:
y = "cat cat cat dog <verb> the"
Or, using rule 7, "<verb>" → "napped", we could generate:
y = "cat cat cat <noun> napped the"
A sentential form is any sentence which can be derived in a finite number of steps, starting from S. Examples are:
"<sentence>" (0 steps)
"<article> <noun> <verb>" (1 step)
"<article> <noun> barked" (2 steps)
"a <noun> barked" (3 steps)
"a cat barked" (4 steps)
A finished sentential form is any sentential form which contains no non-terminating symbols. Only the last of the previous five examples, "a cat barked", counts. All the others have "wildcards" in.
Finally, the formal language L(G) is the set of all finished sentences:
L(G) = {
"a dog barked"
"a dog napped"
"a cat barked"
"a cat napped"
"the dog barked"
"the dog napped"
"the cat barked"
"the cat napped"
}
These rules are a little more complex.
G = {
N = { S, B }
Σ = { a, b, c }
S = S, obviously
P = {
- S → aBSc
- S → abc
- Ba → aB
- Bb → bb
}
}
Here are some productions which follow this grammar. Check to see which rule is being followed for each step.
S
abc
S
aBSc
aBabcc
aaBbcc
aabbcc
S
aBSc
aBaBScc
aBaBabccc
aBaaBbccc
aBaabbccc
aaBabbccc
aaaBbbccc
aaabbbccc
The finished sentences are the last ones in each block: abc, aabbcc and aaabbbccc. All the others contain Ss or Bs. So, notice how the language described by this example grammar is:
L(G) = {
abc
aabbcc
aaabbbccc
...
}
Note that while N, Σ and P must always be finite (though they may be very large), L is infinite in this case.
The concept of a formal grammar is not a mathematical curio but of critical importance in the modern world.
Using the concept of a formal grammar to try to approach actual human languages like English and Chinese is possible, though extremely difficult and of dubious usefulness.
Formal grammars are used in mathematics to define what does or does not constitute a syntactically correct mathematical statement or equation. We can create a language of mathematics.
Bad:
488023a ++ 5353 ==== / 5 / 5 /
π * 2.3.34534532.!
993333339X999999
Good:
a2 + b2 = c2 + 17.3
X ⇒ ( ( X ⇒ Y ) ⇒ Y)
2 + 2 = 87
Note again that a grammar ascribes no formal meaning to the language it describes. The fact that the sentence is syntactically correct does not imply that it is factually correct. It doesn't even imply that the finished sentence is a statement of fact at all! Grammatical correctness draws a line between gibberish and nonsense, but not between nonsense and sense.
Very simple formal grammars are used as the basis for the statements of propositional logic, which itself is the basis for much more valuable mathematics.
Formal grammars are also used to define syntactical correctness in machine languages: that is, programming languages (such as C, Java, Python, and almost any other language you can think of), markup languages (most notably HTML, which is used to describe web pages), data languages (XML, JSON and CSV) and querying languages (SQL and its many brethren).
If you know HTML, a great example can be found by checking out the Document Type Definition of the web page that you're currently looking at: http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. A DTD defines what constitutes valid HTML in the document to follow, and it explains which attributes each HTML element may have, and which elements may nest inside that element. Although the syntax is a little different from the one above, you can probably work out how to read the DTD and spot the wildcards. Notice how the syntax of the DTD is itself very strict and rigorous; it, too, is subject to the constraints of a formal grammar.
Validation is the process of determining whether a sentence is "correct"; that is, whether it is contained within the formal language generated by a grammar.
Parsing is the process of determining the grammatical structure of a sentence with respect to a formal grammar. Specifically, it means determining the sequence of production steps by which the sentence was constructed.
The only way to validate a sentence is to demonstrate that it can be successfully parsed. Likewise, an invalid sentence, by definition, cannot be parsed. Therefore, in the general case, parsing and validation are almost exactly the same thing. Each entails and implies the other. The only distinction is that parsing involves collecting more information.
For an arbitrary sentence and an arbitrary formal grammar, parsing is very difficult. It requires a fully-functional Turing machine and can be very computationally intensive.
In the worst case scenario, you just have to generate every single possible grammatically correct sentence, waiting for your chosen sentence to show up. If it shows up, the sentence is valid; if not, you just have to keep waiting, forever, because it still might!
So, how can we make our machine languages more practical to parse?
We do this by putting restrictions on the types of rules and symbols we use in our formal grammar. By adding further restrictions we can classify all of the formal grammars into a hierarchy. Grammars higher in the hierarchy are very powerful, but also very difficult to validate. Grammars lower in the hierarchy are more constrained and have less expressive power, but they are also easier to validate.
The most important of the various types of restricted grammar is a context-free grammar. This is about halfway up the hierarchy, striking a good balance between simplicity of validation and expressive power.
In a context-free grammar, all of the production rules must start with a single non-terminating character. The rule
"<verb>" → "barked"
is non-contextual.
The rules
Ba → aB
Bb → bb
"<noun> <verb>" → "dog barked"
"<noun> <noun>" → "dog"
"dog <verb>" → "dog barked"
are contextual. They provide restrictions on when the sentence may be transformed - in the last example, we are dependent on the existence of the word "dog" immediately before the "<verb>", to provide context. If we had the sentence "cat <verb>" then this rule could not be used, because of the incorrect context.
In a context-free grammar, these rules would not be permitted. The position and context of the single non-terminating character would not affect the operations which could be performed upon it.
(Note how the first example grammar in this writeup is context-free as originally presented, while the second is not.)
Sentences in context-free languages can easily be broken down into their original components and production steps, and the automaton required need not be as complex as a full Turing machine - you can get away with a simpler machine called a pushdown automaton. Validating sentences in context-free grammars is computationally simpler, more mathematically tractable and has in fact been the subject of a great deal of valuable study. Taking an existing contextual grammar and formulating it to become context-free (by changing its rules, without altering the resulting formal language) is a very useful endeavour.
Discussion (16)
2010-01-13 01:41:09 by Chris:
2010-01-13 01:45:10 by Supergrunch:
2010-01-13 03:32:38 by TSC:
2010-01-14 04:10:11 by dankuck:
2010-01-15 05:11:47 by Josh:
2010-01-18 16:33:03 by Dentin:
2010-01-23 00:59:56 by Drake:
2011-06-12 22:51:15 by sean:
2011-10-05 15:37:57 by Kobi:
2011-10-05 15:45:07 by qntm:
2012-01-25 18:17:55 by Lee:
2012-01-25 18:32:49 by qntm:
2014-11-27 20:29:07 by Lana:
2015-01-31 17:08:55 by Cristiano:
2016-05-02 15:15:14 by Jonathan:
2016-08-16 15:01:13 by Aaron: