## Formal language

For a formal grammar G = { N, Σ, P, S }, where...

• N is a finite set of nonterminal symbols
• Σ is a finite set of terminal symbols, disjoint from N
• P is a finite set of production rules, each consisting of
• A finite string of symbols on the left (with at least one of them being nonterminal in N)
• A finite string of symbols on the right
• A special starting symbol S which is nonterminal in N

...the formal language L(G) is the set of all finite strings of symbols which can be produced by...

1. Starting with a string consisting of just an S
2. Applying a production rule to the current string, to turn it into a new string
3. Repeating step 2 until there are no nonterminal symbols left in the string.

At this point, all the symbols in the string are terminal in Σ, which means no more production rules can be applied. So, the sentence itself is terminal.

While G is finite, L may be finite, countably infinite, or empty. L is distinct from the set of all finite strings of terminal symbols in Σ, because L only contains the producible strings. L is the set of all terminal strings.

### The relationship between G and L

It is a relatively straightforward recursive process to derive all the the members of L from G.

2. Stage 2: take each production rule in turn. Try to apply it to S. Sometimes the rule is not applicable, but sometimes a new string is produced. Some of these are terminal. Add these to L. The rest are nonterminal. Hand these off to Stage 3.
3. Stage 3: take each production rule in turn. Take each nonterminal string from Stage 2 in turn. Try to apply the rule to the string. This produces many, many more strings. Some of these are terminal. Add these to L. The rest are nonterminal. Hand these off to Stage 4...
4. Continue...

This process may continue forever, as we know, because L may be infinite.

But what about reversing this process: turning a set of strings L back into a finite formal grammar G? This is much harder.

If L is finite, e.g....

L = {

fjjw
a
moocow
floipojwerpingelrnv
kkkkkkk
peas
...
zzzzz

}

...then in theory we can just enumerate all the members of L and have a grammar which produces each one in turn...

G = {

N = { S }

Σ = { a, b, ..., z } [every terminal character seen in L]

S = S, obviously

P = {

Sfjjw
Sa
Smoocow
Sfloipojwerpingelrnv
Skkkkkkk
Speas
...
Szzzzz

}

}

However, this is trivial, and not terribly helpful, because G is as huge as L and therefore doesn't tell us anything about L, which is the point of the exercise.

If we wish to make a small formal grammar from L, then the task becomes substantially more difficult. If L is infinite, then the task can even become impossible.

#### Proof that some languages have no formal grammar

If a language is a set of finite strings, then a language L is a subset of the set (call it V) of all possible finite strings. This means that the set of all languages is the set of all possible subsets of V. This is called the power set of V, which is written P(V). A power set is always strictly larger than the original set. V is countably infinite, meaning that P(V) is uncountably infinite. There are uncountably many languages.

However, every formal grammar G is finite. This means that the set of all possible formal grammars is countably infinite. Each formal grammar generates a single language L, which means, even including duplicates, that there only exists a countably infinite number of formal languages.

This means that uncountably many languages must not be formal languages; they cannot be generated by a finite formal grammar.

### Discussion (4)

#### 2010-02-11 03:57:51 by pascal:

doesn't really belong with the theory, but 3 years ago there was some buzz around a nice (linear time, if I remember correctly?) algorithm that produced a grammar from input strings. (The researchers wanted to use it for compression, i.e.: save the grammar and the way through the graph, produce the data by evaluating it)

#### 2010-02-12 04:18:06 by joel:

"If we wish to make a small formal grammar from L, then the task becomes substantially more difficult. If L is infinite, then the task can even become impossible." But this is what the brain does when it learns languages, isn't it? On the other hand, one could argue that the brain, being a neural network, heuristically reconstructs the grammar by trial and error in the process of network training (viz, constant exposure to language samples), which complicates matters somewhat. Meh.

#### 2010-02-12 10:38:11 by qntm:

Firstly, the human brain is the most powerful pattern recognition machine ever. The ability to handle language is what separates us from beasts. Also, not *all* formal languages are impossible to deconstruct into finite, small formal grammars. The impossible-to-deconstruct ones are the wild, irrational ones without patterns. Human languages generally obey some kind of rules, otherwise they are incomprehensible. Finally, natural human languages are not formal languages, and trying to formulate them as such is often counter-productive.

#### 2017-03-10 10:00:07 by Ingvar:

Very late-coming comment, but... This is eerily reminiscent of Kolmogorov complexity, although applied to sets of strings rather than single strings.

#### New comment by :

Plain text only. Line breaks become `<br/>`

The square root of minus one: