A technical question about the qntm.org URL schema

Hey everybody.

So, you may or may not know that I designed qntm.org to have extremely terse, human-readable, human-memorable URLs. The full rationale behind this design decision can be found in my article from a few years ago, On short URLs.

One of the reasons given for this scheme is:

Because each page has a short, unique "slug" (e.g. in "http://qntm.org/destroy", the slug is "destroy"), each page has a unique URL. There is no redundancy in my URL schema - i.e., it is not possible to reach the same resource via multiple URLs. This keeps my site's complexity to a minimum and I'm told it's also good for SEO, although I honestly pay very little attention to the latter.

Here's what happens, then, when you try to load various URLs:

Now, new content on qntm.org is made available via an RSS feed which is in turn accessed by many feed readers. It was recently brought to my attention that the feedly feed reader was beginning to append query strings to the end of URLs when users visit them. For example, although the announced URL for my review of the movie Oblivion was

http://qntm.org/oblivion

, the URL that users of feedly were sent to when they clicked on that link (or however it is that feedly works) was

http://qntm.org/oblivion?utm_source=feedly

, which, as described above, results in a 404 response code. Unfortunately, this casts qntm.org as an unreliable website, which is undesirable. The intended purpose of these extra keys and values in the query string is outlined here. In theory I could gather up this information and use it to carry out analytics on my site usage, but I don't actually care.

This behaviour was brought to feedly's attention by a reader, and feedly added qntm.org to a whitelist overriding this behaviour. But the question remains.

Is it normal to add an arbitrary query string to the end of an existing URL and expect the modified URL to still locate the same resource? Is qntm.org's current behaviour in all four of the above cases technically correct? What are the pros and cons of, for example, serving a 301 Moved Permanently redirection to the canonical URL, instead of the 404?

Update 8 May 2013

I've implemented the suggested change of serving a 301 Moved Permanently to people who append random query strings. I also did some plumbing, so please let me know if you notice problems with the way that qntm.org is working.

Update 14 May 2013

Some backend changes are coming, none of which should (ideally) change the way that qntm.org functions other than to improve performance. I'm building some nominal testing procedures but please let me know.

Back to Blog
Back to Things Of Interest

Discussion (28)

2013-05-03 20:25:57 by Joshua:

God I love technical minutiae

2013-05-03 20:44:16 by DavidHaitch:

My first knee-jerk response would be to truncate incoming requests on the question mark, just sticking a bit of string processing before the resource lookup. Unless some pages have question marks in the name?

2013-05-03 20:46:04 by qntm:

That's a good question. No, there is no page with a question mark in the slug, nor is that something I want to do in the future.

2013-05-03 20:49:32 by speising:

i think defensive programming is a Good Thing.
That means ignoring any parameters you don't recognize. I don't know if the standards have anything to say about this, but i have the feeling that this is the commonly expected behaviour.

2013-05-03 20:56:48 by DanielParks:

> Is it normal to add an arbitrary query string to the end of an existing URL and expect the modified URL to still locate the same resource?

Yes, this is typical.

> What are the pros and cons of, for example, serving a 301 Moved Permanently redirection to the canonical URL, instead of the 404?

I'd say a 301 is a better solution than 404. It's obvious what is desired by the linker, and most tools that link to you aren't going to change just because you don't support something that most other web sites do. (Not to say that you're wrong.)

I would ignore the query and return a 200 response, and include a <link rel="canonical" href="http://qntm.org/whatever" /> in the header (unconditionally). That's what seems to be typical.

BTW: it seems that your comment form doesn't like &quot;.

2013-05-03 21:02:01 by Evan:

I'd go with Postel's <a href="http://tools.ietf.org/html/rfc793#section-2.10">robustness principle</a> here <em>“Be conservative in what you do, be liberal in what you accept from others.”</em> Never use the ?=whatever links yourself, but don't fail for the sake of it when other people use them on you.

2013-05-03 21:03:09 by DavidSimon:

Your best bet is to use a 301 permanent redirect to the actual target page. This is the spec-correct way to do it, since the response basically means "Ah, I think I know what you want, but here's the proper way to get it, go there instead". Spiders treat it that way, as do things like link validators. And the RFC agrees: http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

The situation you describe also reminds me of cache-busting techniques. When you want to force clients to redownload a file even if its cached (for example, if you don't control the CDN and can't make it just return useful caching headers in the first place), it's common to just append a meaningless argument to the URL. Every web server I've worked with supports this behavior by default, though they usually just return the contents directly instead of returning a redirect code.

2013-05-03 21:05:07 by hpc:

You seem to be describing canonical URLs. There's quite a few ways to handle it, but the easiest and most generically useful is HTTP permanent redirect.

Behind the redirection you continue to just have one URL per resource. So you continue to keep your site's simplicity this way. A properly functioning user agent interprets a permanent redirect as "this URL is wrong, here's what you actually want". For browsers, this means changing the URL in the address bar. For search engines, this means any references to a redirecting link are instead considered to be the redirect target. This is better for SEO. All those previous RSS links which were throwing a 404 now point to real content. Google will see from the redirect that the ?utm crap doesn't belong, and those previously useless links now contribute to the ranking of your existing pages.

I don't have an Apache instance in front of me at the moment, but something vaguely along the lines of what you want is:
RewriteCond ^([^?]*)?.*$
RewriteRule ^([^?]*)?.*$ /$1

2013-05-03 21:05:32 by Graeme:

Amusingly, I tried to visit this page from feedly, clicked on the title link ('A technical question...'), and got precisely the failure mode you describe thanks to an appended ?utm_source=feedly. I hadn't noticed this issue before, as I normally click on the link embedded in the story - it picks up your 'Blog>>' breadcrumb as lede - which is unaltered. So if they are claiming to have fixed it, the correction is only partial, but there is a way for feedly users to still access your content without having to reformat the URL by hand.

2013-05-03 21:08:30 by LeanderHarding:

I do hope that if this turns into a theological dispute about RESTful design it's at least an interesting one.

To answer your actual question, although feedly's behavior is odd and sort of possibly-impolite in the manner of getting confused whether one pays at the counter or waits for the wait staff to take your check at a diner, in my experience the expectation has been that web servers just ignore unknown query string parameters. The suggestion to just cut URLs at the beginning of the query in your server makes sense to me (since you are never interested in the query).

If anything, all of the various sites with URLs like http://localhost/?p=the_slug are horribly abusing how URLs are supposed to work. I think the confusion arises because many web developers never really gain a mental distinction between Locating a Resource and passing parameters to a remote procedure call (as, for instance, this very comment form).

2013-05-04 00:25:04 by slucidi:

qntm.org's response to the query parameters is not what I think anyone would consider wrong, exactly, but it is a bit unexpected. Since everything after the ? are just additional GET parameters, typical (and generally expected) behavior is just for the server to disregard any parameters or even headers that are unnecessary or unexpected.

I personally wouldn't throw a 301 or other redirect because the parameters aren't part of the main, hierarchical part of the URI-- they asked for the right resource, they just asked for it with a cherry on top, which you don't care about and don't have to honor.

2013-05-04 02:49:06 by Andrew:

Just seconding what others have said — if you know what the client meant, but you think they're asking for it by the "wrong" name, you should 301 them to the canonical location. It's friendly, it complies with web standards, it works equally well with browsers and automated UAs, and it keeps the SEO benefit of not having multiple URLs for the same resource.

If you're not able to redirect for some reason, <link rel="canonical"> is also useful — browsers rarely, if ever, do anything with it, but spiders understand it.

2013-05-04 06:49:24 by Scuzzball:

You should just accept it, and ignore it. You're already using mod rewrite, so it SHOULDN'T be hard, but I have used mod rewrite, so I wish you good luck, and few|no unexplained errors.


LeanderHarding, the thing about doing http://localhost/?p=the_slug is it's how you do dynamic content, at least from what I've seen.
In his actual post on his url scheme (http://qntm.org/urls) he mentions using lots of mod rewrite. It probably rewrites to something like http://qntm.org/?p=slug
But I do agree that it's terrible to leave it like that where users can see it.

Also, your comment html validator demands you close breaks. Self closing things are gone in HTML5. Besides that, it was claiming something else didn't close and I gave up.

Also, I really like the content of your site. It's quite interesting.

Scuzz

2013-05-04 11:38:53 by Michael:

I think that people who go around appending query strings to URLs are going it wrong. What if I happen to setup my website such that those query strings have meaning, and give a different page? (E.g. utm_source = Ultraman source, if you think that Ultraman comes from planet Feedly, click here, but if from Planet Xargon, click here.)

Next, we have what to do about it. There is nothing wrong with giving a 404 in cases where a page doesn't exist. However, to be friendly to your users, having these non-existent pages redirect to the correct page (301) is the best thing to do. Simply ignoring the query string, without redirecting can lead to all sorts of hassle later. E.g. someone starts spreading a url that looks like http://qntm.org/oblivion?utm_source=feedly, and then your nice short URL is actually not so short or nice.

Cheers.

2013-05-04 13:24:33 by Tom:

qntm.org/?primer goes to qntm.org/primer.

2013-05-04 14:48:16 by Calvin:

Is it normal to add an arbitrary query string... and expect... the same resource? Yes, although such is extremely poor design.

Is... current behaviour... technically correct? Yes. However, current design suggests second case should also work.

What are the pros and cons of... redirection... instead...?
Pros: Robust front-end.
Cons: Encourages poor practices, referer should suffice.

2013-05-05 20:11:10 by Veky:

301 vs 200 is reasonable discussion (and my subjective view is it really should be 200, but I understand the 301 arguments - you can even use 405 if you really want to punish/teach people:)), but the 404 crowd is factually wrong in an important detail.

Resource exists. In http://qntm.org/tax?param=value, resource _is not_ identified by /tax?param=value. Resource is identified by /tax. The ?= syntax is not just some convention among some web developers (or, it is a convention in same way HTTP itself is). Things are specified quite precisely in the standard.

Let's see somewhat analogous example, free of form parameters. What will server do if you GET http://qntm.org/tax#bla? (Sam, try to answer without checking first.;) What is the resource? What is the fragment and what is done with it if it is not recognized?

I think it is perfectly clear now. HTH,

2013-05-06 14:35:13 by onapthanh:

[onaprsc.com.vn]-I tried to visit this page from feedly, clicked on the title link I hadn't noticed this issue before, which is unaltered. So if they are claiming to have fixed it, the correction is only partial, but there is a way for feedly users to still access your content without having to reformat the URL by hand. I think that people go around appending query strings to URLs are going it wrong. And so, we have what to do about it. There is nothing wrong with giving a 404 in cases where a page doesn't exist. However, to be friendly to your users, having these non-existent pages redirect to the correct page (301) is the best thing to do. Simply ignoring the query string, without redirecting can lead to all sorts of hassle later@ http://onaprsc.com.vn

2013-05-06 17:32:52 by speising:

^^ interesting kind of spambot :/

2013-05-07 07:32:28 by Veky:

Yeah, it even knows the square root of minus one. :-D

2013-05-07 14:45:31 by Michael:

I just want to comment on Veky's incorrect comment. Vecky says:
"In http://qntm.org/tax?param=value, resource _is not_ identified by /tax?param=value. Resource is identified by /tax."
The query string is an integral part of identifying the resource. People (e.g. Feedly) who append query strings to URLs should not be surprised if it breaks the website.

The fragment (everything after a hash #) is not, and is never passed back to the server (in the ordinary scheme of things, AJAX has complicated matters slightly).

2013-05-08 06:07:47 by Kochier:

Just putting in my two cents, I would ignore everything after the ? if you don't use it, it makes it easier for users, perhaps a quick re-direct to one of your url friendly pages, but they should still get where they want to go. Also I think it's a little rude to edit your url like that, never know how a website handles those kinds of things.

2013-05-08 22:09:19 by slucidi:

The change seems to be working well. I'm sure this will make things easier for some of your readers-via-rss.

2013-06-05 19:10:22 by Psycho:

One thing I've noticed: there's now a black bar at the top of the website. Is that intentional? It seems rather out-of-place in an otherwise homogeneously coloured website.
1920x1080, 64-bit Chrome, Windows 7.

2013-06-05 21:07:03 by qntm:

Turn your ad blocker off.

2013-06-07 05:41:34 by Psycho:

Ah. Thank-you.
So I know how stupid to feel, was that previously mentioned?

2013-06-07 05:43:31 by Psycho:

Wait. How long have you had ads? I don't remember them being there before I saw the black line, and that was relatively recently.

2013-07-01 02:02:37 by OvermindDL:

Bit late, but relevant: http://www.mattcutts.com/blog/rel-canonical-html-head/

Why does the name not allow numbers, there are real names with numbers in them (experienced in my own job).