Learn regular expressions in about 55 minutes

This is another one from the "adapted from educational materials produced for work" department (previously).

I know there are a quadrillion tutorials on this subject but none of them explain regular expressions in exactly the way I prefer: specifically, by stating up front that a regular expression is just a computer program (i.e. a sequence of instructions) written in a terse domain-specific programming language.

I think the main area for improvement for this article is in the "further reading" section, which currently doesn't exist.

Back to Code
Back to Things Of Interest

Discussion (31)

2014-03-09 23:30:34 by DanielLC:

The way they taught me the theory of it they made it sound more like it was equivalent to a finite state machine, which is a program. I suppose the only difference is that it means that regular expressions have an interesting compiler.

2014-03-10 00:32:24 by qntm:

The finite state machine explanation is okay if you care about formal language theory, but if you're trying to teach someone to use regexes productively, it wastes a whole heap of time and gets you absolutely nowhere.

As soon as you introduce word/line/text boundaries, non-greedy multipliers, capture groups or back-references - all of which are extremely basic, entry-level regex features - the link to FSMs is broken, to such an extent that it wasn't even worth examining in the first place.

Even staying in "strictly regular" land, you have to jump through all kinds of hoops to make the behaviour of an FSM consuming a string line up with the behaviour of a text editor running a regex search. It's just not useful. I say this as someone who loves FSMs unconditionally.

2014-03-10 00:47:44 by Jake:

I noticed that you escaped the open square brackets in the character class. You don't actually need to do that - since character classes don't nest, [ has no special meaning in them.

That might be worth mentioning in the "rules for inside character classes are different from outside" note.

2014-03-10 00:55:40 by hobbs:

You might want to avoid the heading "Basic Regular Expression Syntax". It's clear to me that you're not talking about the awful POSIX "Basic RE", but it might not be clear to everyone.

You also might or might not want to mention (depending on whether you think it would cause too much confusion) that your rules about backslashes and metacharacters apply to almost all *modern* implementations (basically, the ones that copied off of Perl), but that in some older ones (especially found in text editors and unix tools), sometimes metacharacters are literal by default, but adding a backslash makes them meta.

2014-03-10 01:23:52 by KimikoMuffin:

@DanielLC: How did they define the different states?

2014-03-10 01:34:39 by qntm:

"All characters are literal and only become metacharacters when escaped" is how they do things in the Mirror Universe. The only reason for a tool to work like that in 2014 is because it was created in 1971 and never updated to reflect modern usage. And/or, to irritate people.

2014-03-10 04:14:34 by Mike:

"^.*$ will find your entire text, because a line break is a character and . will find it. To find a single line, use a non-greedy multiplier, ^.*?$."

This isn't always the case - for example, in Python this is an option (http://docs.python.org/2/library/re.html#re.DOTALL ) which is off by default.

2014-03-10 05:25:36 by MichaelSzegedy:

@Sam: Unfortunately, it is how Vim does it, and a huge number of people still use Vim (because it has merits that outweigh its silly treatment of regular expressions). I'm not sure if it's worth including in the tutorial, but it did cause me a lot of grief until I figured it out. (Why are my expressions with + in them always failing to match? They work with egrep.)

2014-03-10 13:43:40 by ianso:

May I suggest using highlighting colours of varying hues? This would make the syntax highlighting better for people with various types of colour-blindness.

2014-03-10 16:08:30 by Geoff:

Note: You use alternation to explain things in the non-greed section, just before the actual alternation section. If this is meant to be read straight through by people who don't know regular expressions, you may wish to flip the order of these sections. Or move alternation earlier, if you want all the multiplier sections together in that order.

2014-03-10 17:09:36 by qntm:

Geoff, good spot, that has been fixed.

2014-03-10 17:15:38 by Dominik:

"Suppose our regular expression is (\w+) had a ((\w+) \w+). If our input text is I had a nice day, then"...<br/><br/>I think you want * there instead of +. And in the example in the red box below that, I think the first capture group is dog and the second empty, going through the parentheses from left to right (I don't actually know regular expressions, so if this is wrong, it would probably be a good idea to explain why in the tutorial).

2014-03-10 17:20:11 by qntm:

In the first case I definitely want \w+, which means "a word of one of more characters". If I put \w*, that would match the empty string, which is not desirable (although it would still have the same result in the given example).

The second point is correct and has been fixed. I transposed those two to make the explanation clearer but didn't fix it completely.

2014-03-11 15:51:05 by Paul:

One thing that might be worth noting is that the behavior of backslash-escaping inside bracket expressions can vary between tools. On my machine (debian), perl and awk treat the backslash as an escape character, while sed and egrep treat it as literal.

2014-03-12 00:57:59 by qntm:

Adding exercises has surely increased the length of this thing past 55 minutes. Oh well.

Soon: answers to exercises will be hidden by default.

2014-03-13 12:32:28 by Itai Bar-Natan:

"Write a regular expression to match an integer between 1 and 31 inclusive. Remember, [1-31] is not the right answer."

I think I may have gone a bit overboard with that one.

0*(([1-9]|[12]\d|30)(\.\d*)?|31(\.0*)?)|0b0*1[01]{0,4}|0o0*([1-7]|[1-3][0-7])|0x0*([1-9a-fA-F]|1[0-9a-fA-F])

2014-03-13 19:00:16 by qntm:

"02.5" is an integer now?

2014-03-14 15:24:00 by Itai Bar-Natan:

Yes, I forgot they're supposed to be integers. This should be slightly less extravagant:

(0*([1-9]|[12]\d|3[01])|0b0*1[01]{0,4}|0o0*([1-7]|[1-3][0-7])|0x0*([1-9a-fA-F]|1[0-9a-fA-F]))(\.0*)?

2014-03-14 19:44:26 by Ethan:

Towards the beginning of the tutoral for regex you have:

Any metacharacter can be escaped using a backslash, \. This turns it back into a literal. So the regular expression

c\.t
means "find a c, followed by a full stop, followed by a t".

Shouldn't it say :
means "find a c, followed by a period, followed by a t".

2014-03-18 15:52:26 by DZ:

@Sam: This writeup treats regexes as programs and really goes into depth on implementation: http://swtch.com/~rsc/regexp/regexp2.html

@Ethan: A full stop is a period.

2014-03-22 12:33:39 by Veky:

Also, in "avocado" example, I know of no regex engine that puts "v" in \2. Everywhere I looked, it's "o".

2014-03-24 17:41:26 by qntm:

Noted and fixed, thank you.

2014-03-25 06:31:01 by Henry:

I'm someone aspiring to study NLP, and I found this tutorial really useful. Thanks!

2014-04-22 21:12:25 by Todd:

As a big user of linux and a programmer, this tutorial has helped me tons! I've awk'd and grep'd and sed'd all over the place and pieced together regexs from different places. Your tutorial helped me understand exactly what I was doing. Thanks for the help!

2014-06-11 09:00:56 by Scuzzball:

The only improvement I can think of is a link to http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags in the html parsng section.

Also, thank you for linking the email regex, I couldn't find it last time I wanted to show someone how complex emails could be.

2014-08-04 13:39:23 by MilesOfSand:

I've gotta say this is probably the first helpful/useful tutorial I've run across.

2014-11-06 22:08:13 by David:

It is simply the best and the unique regex tutorial I would recommend anytime to anyone really interested on them. It is so clear that it becomes impossible not to learn.

Really, great, so great job man.

Thanks a lot!

2015-01-24 00:29:30 by ckp:

i have to say i agree with David. i've taken a look at several over the last week or three, and this is quite simply the best of that lot. having said that, i'm encountering the "this tool does it THIS way" syndrome Way Too Much ... sad there have to be so many different "standards"

2015-07-02 02:42:38 by Clearwater:

This was an excellent tutorial but you lost me towards the middle. I can tell you're an expert at this. How can we get more teachings? Will be happy to pay a small fee for downloadable documents.

2015-11-14 05:11:18 by Quexint:

In the chapter "word boundary", the answer of the exercise is a line with the max length of 76 characters.

2016-09-18 17:48:54 by Leo:

I enjoyed your tutorial very much! Thank you for this fantastic introduction to regexes.

After completing the lecture I found two unclear parts. In the chapter Alternation:

[1-9]|[12][0-9]|3[01] may not find two-digit numbers because it first finds the first digit and then searches only the rest – so 21 will be found as 2 and 1. Better sort the alternatives for longer numbers to the beginning of the expression: [12][0-9]|3[01]|[1-9]

In the chapter Line Boundaries:

The longest line is 3260: End of Project Gutenberg's The Time Machine, by H. G. (Herbert George) Wells (76 characters)