Crowdsourced atomic translation

You are familiar, of course, with Urban Dictionary, in which users submit definitions of various slang terms, and the most popular (presumably the most accurate) definition gets voted up to the top of the stack of definitions underneath each slang term. That's a pretty smart way to build a dictionary of slang, which is, of course, the set of all words whose definitions are loose, fluid and pretty much whatever the world defines them as.

Let's extend this model from slang terms to all English language terms. And also, terms in other languages. And also, multi-word sentence fragments. And also, complete sentences. Instead of a definition, what you submit is a translation into some other language. And then you vote up or down on other translations of the same entry. That could work, couldn't it?

There are a few more critical components to this idea.

  • One is an API which can be integrated into, among other things, an instant messaging client. This means that when somebody sends you a sentence in a language you don't understand, instead of copying and pasting that into Babel Fish, your client automatically sends out a request and receives a selection of possible translations in return. You then have the option of selecting from the list the translation which turned out to be the most accurate - or responding with a more finely-tuned translation of your own.

  • Another (the slightly more questionable component of my idea) is the bit relating to what the machine does if the specifically requested original phrase isn't present in the database. In a situation like that, we don't particularly want the machine to just go "urk, I got nothing". Even a wild guess pieced together from direct dictionary lookups would be better than nothing - because then the receiving user can go, "Well, that's just nonsensical, but if I alter a few words I can repair it, okay, here is a slightly better translation" and send it back. Thus, the translation stored in the database will iteratively migrate towards something which is more accurate. I hope.

  • The last hurdle is the problem of there being millions and squillions and quintillions of possible sentences, many of them differing from each other by only a very small, trivial "edit distance". I hope to solve this by drastically restricting the maximum length of a message in the database. I'm talking "atoms": five words, or 64 bytes, or something like that. This could be the fatal flaw in my plan, but the key point of my plan is that it must require no linguistic skill on my part, which means that it cannot require any kind of language-specific translation intelligence, only basic algorithms and raw data.

    My belief is this: it is easier to make a human being manually break up a complicated sentence into shorter, simpler, machine-translatable sentences than it is for a machine to accurately translate the original longer sentence. If we can train people to communicate with greater precision and terseness - and we're well on our way, we already have Twitter and "txtspk" - then we can effectively train ourselves to communicate in an unambiguous sub-language of English/French/Chinese/whatever, which a machine can translate perfectly.

Obviously, some boffin could build on this data set (once it's populated) and make an algorithm capable of translating longer sentences by referring to the various shorter sentences, but that's for another day.

This idea, like all of mine, is unrefined. Somebody want to attempt it?

Back to Blog
Back to Things Of Interest

Discussion (26)

2009-09-13 18:49:59 by YarKramer:

Hmm.

This has the smell of "crazy-awesome enough to work, but doomed to failure due to human stupidity."

2009-09-13 20:18:53 by Robert:

That's actually how Facebook does it's translations into various languages. At least with the crowd sourceing Last time I heard, the stat they were bandying about was that facebook was translated to frenche in under 24 hours?

The tricky bit is that the atoms would likely need to be context free. Not all languages deal with "that" or "it" very well and 5 words would likely be far enough away to lose the context.

2009-09-13 20:49:59 by qntm:

That's why you'd have a range of alternative translations, just like a slang term can have multiple definitions.

2009-09-13 20:53:05 by Boter:

"...then we can effectively train ourselves to communicate in an unambiguous sub-language of English/French/Chinese/whatever..."

...which, in science-fiction at least, would end up leading to a whole new language, and might even promote language convergence into a new language, not constructed but with accelerated growth away from other present languages.

Nifty.

2009-09-14 12:54:03 by Jason:

I've heard similar ideas thrown around in human computation circles. Another problem you face is getting enough people motivated to actually do it. Urbandictionary is based on something everyone has knowledge of - slang. The reward for using it is amusement, often the most humorous, correct entries move to the top. GWAP.com uses games as a reward, the human effort is rewarded by points and entertainment. Amazon's Mechanical Turk rewards with cash. If you can find a good motivator, it will go a long way towards making your site/idea a success.

I'm curious about your notion at the end, of training people to communicate in an unambiguous sub-language of English/French/whatever. If this is what you mean by the flaw in your plan, I'd agree. Getting people to use a controlled language does require training and a lot of effort. It's not how we naturally think. Twitter or txtspeak is not a controlled language, it's a loosely-agreed upon set of abbreviations. If the issue were just removing some words or letters, people readily do that. But what you are saying is that people will have to alter their own grammar. Also, I don't see how what you're proposing would even do this. Can you elaborate on that process?

2009-09-14 13:26:18 by qntm:

Twitter and txtspeak artificially constrain how you express yourself by making it so you only have a few bytes for your complete message, and/or making it so that typing out complete words is a long and time-consuming procedure. By limiting the maximum size of a sentence in the database, the simple situation arises in which people who use short sentences receive accurate translations, and people who use long sentences do not. As a result, people will naturally gravitate towards using short sentences if they want to be understood.

2009-09-14 21:11:17 by Val:

About the artificially created "world language", there was already a try: Esperanto. It generated a lot of interest, but it faded away, and today it is forgotten by all but a few very enthusiastics.

2009-09-15 03:47:34 by ZhenLin:

Grammar, especially word order, is likely to be the biggest problem with such a scheme. For example, "Where is the cake I made yesterday", in Japanese has almost exactly the reversed word order - "Yesterday I made cake the where is". However, I suppose with large enough "atoms", it should still be possible to get something intelligible out of it.

A more advanced version of this might use some statistical machine translation technology to automatically correlate words and grammatical structures to offer translations of unknown "atoms". Google Translate operates on this principle, actually, if I remember correctly. It even has the crowdsourcing bit with the "Contribute a better translation" feature.

2009-09-15 07:57:14 by qntm:

I would reduce that to two atoms:

"I made a cake yesterday"
"Where is it?"

And if the word order is obviously wrong, as in your suggestion, then it's very easy for a human to correct it.

2009-09-15 17:09:43 by kRemit:

too much redundancy will be generated, if you accommodate for such diverse languages. suppose you want to say "where are the cake and the loaf of bread i made yesterday?" - your atoms would then need to be "i made a cake yesterday", "i made a loaf of bread yesterday" "where are they?" 15 words and a lot of brains going into something that can be expressed easily with 12 words and no thinking at all - scale that upwards to more complicated speaking situations, and you've got yourself a deal breaker. also, from a linguistic perspective, there's another deal breaker in the form of the Sapir-Whorf hypothesis ( http://en.wikipedia.org/wiki/Linguistic_relativity ), that seems to throw a spanner into your works, or at the very least will limit this idea to child-like language levels. Sorry.

2009-09-15 17:24:05 by qntm:

Correction: 15 words and a little thought as an alternative to 12 words *which would be really difficult to translate using a machine*.

Yes, it constrains freedom of expression to some extent, but the prize you get for restricting yourself to machine-translatable sentence fragments is that your sentences are machine-translatable and the other guy can understand you. Yes, all languages contain huge quantities of nuance and idiom and cultural context. Machines can't easily translate nuance, idiom and cultural context. What's the solution? This is mine. There are others.

2009-09-15 17:33:52 by kRemit:

but yours isn't a solution, because you're not translating language or speech. all you're doing is translate some (not even all!) lexemes in that language/speech, ignoring everything else. this will simply not work. I recommend you read up on basic grammar (beware: dry stuff).

2009-09-15 17:48:30 by qntm:

Sure, it only translates tiny sentence fragments. I never said it was perfect. Just because a solution isn't perfect doesn't mean it's completely without merit. I maintain that the majority of concepts worth expressing can be expressed using atoms, and a majority is good enough. Also, the whole point of this specific idea is that the administrator of the translation database and the systems to which it is connected doesn't actually have to know any grammar, in English or otherwise. The whole point is to see how well we can do without actually doing any language research.

2009-09-15 23:07:44 by Fjord:

To you doubters: How do pidgin and creole languages get their beginning? This is the same thing: one person or group which doesn't understand what the other person or group is saying simplifies their speech for the benefit of the other party in the hopes that they can develop a method of communication. Yes, "Do you know where the cake and pie I made yesterday are?" is simpler to speak and type, but only if you are talking to someone who understands you. Say you were talking to a Frenchman, and you knew a little French, but not enough to say that particular sentence. So you'd end up with something like, "Erm, the - the gateau from yesterday? Where is it? Hier soir? Ou?" and you'd hope that the pie was in the same place or something.

Anyway, my point is that that's the basic principle here. If you KNOW that you're chatting with someone who speaks a different language, you would automatically try to make yours as them-friendly as possible, just as when you modify your speech slightly for a four-year-old.

2009-09-16 06:00:28 by Boter:

Val, if you were referring to me, I mentioned how this would evolve a common language - which is different from Esperanto, which is constructed from the ground up. This new language that would (in sci-fi, anyways) evolve would do so naturally, not in a constrained manner.

2009-09-18 09:24:26 by Val:

Boter: you're right, there are examples for it already. In the Old West, a kind of language combined from English, Spanish and native languages was understood by many. However, to enforce something like this as a universal "World language" would result in immense cultural loss. I think it would only be useful in informal communication with strangers, and even the languages of choice would be dependent of location, etc.

There are, however, situations where without a native speaker any try with a software would result in epic failure: Hungarian, for example. Any other European languages I tried translating via computers, even if not perfect, were easy to understand what the general meaning was. Translating to or from Hungarian, even with the best of our current softwares, is close to impossible. It will not even resemble the original meaning.

2009-09-18 20:29:57 by Azrael:

I see another possibility here to using the atomic approach with all the horrible grammar issues that might generate, not sure if it's practical or not though. Translate whole sentences at a time, as Sam rightly points out there's an inconceivably large number of sentences possible and most of them will vary only slightly from others. Surely it would be possible to create an algorithm which when it doesn't find a perfect match, looks for close matches and offer them with a warning. Then some of that unmanageable range of sentences could be reduced to a smaller set of very similar sentence groups? Any of the linguists want to tell me why this wouldn't work? :-)

2009-09-25 08:57:46 by kRemit:

again: this might "work" (for a reasonable subset of actually" for a majority of language situations/expression, especially when these are restricted to a certain usage - e.g. via internet, talking business or computer-things.

BUT: it won't result in a "real" new language akin to creole or pidgin - these took a looooong time to evolve, i can't see how this process could be accelerated by the technological additions proposed here. what is more, a new language won't evolve by just using it over the internet - nobody will use anything other than rudimentary elements (certain vocabs, just like some people use 1337-speak in reallife) in everyday life. - part-time languages generally don't evolve, they die.

Also: in almost all of the aforementioned speaking situations (business, computer-talk, most other things that are talked about on the internet), english is already used. and english proficiency is spreading even as we speak - so i don't really see a need for this "solution" anyway.

2009-10-02 11:10:11 by MikeUnwalla:

Quote: If we can train people to communicate with greater precision and terseness... then we can effectively train ourselves to communicate in an unambiguous sub-language of English/French/Chinese/whatever, which a machine can translate perfectly.

If you train people to communicate clearly, your tool is not necessary, because machine translation is sufficiently good (http://www.international-english.co.uk/mt-evaluation-en-es.html).

2009-10-07 13:07:53 by D:

Actually, this is being done right now, after a fashion. I recently attended a talk by Peter Norvig of Google, who is using automatically-correlated words in different-language versions of web pages to assemble a database of corresponding language tokens much like you describe. The video is on the following page, if you care to view it:
http://ucberkeley.citris-uc.org/events/RE-Sept02

2009-10-12 01:27:52 by doomsought:

It would probably help to set up some sort of grammar learning AI.
Latin would work as a strong basis point or control, most western languages feel its influence and the suffixes make it easier to parse and identify its place in the sentence.

2009-10-23 09:52:55 by Artanis:

This is an intriguing idea, but I think it is not really needed, and to note a previous issue, little motivation to draw the crowd. It may be an interesting experiment, though.

I do think Google's take on the issue is fairly ideal: build a translation algorithm that can handle on its own many statements (in a highly generalized example, identify statement parts (subject, verb etc.,) translation by definition, reconstruct in target language word order,) and solicit input on those translations via suggestion because that process is lossy.

2010-11-03 05:03:26 by Dennis:

What is you had a computer program to do the simplifying? It would interrogate the user, "Do you mean this? Or this? Give some alternatives for this." Etc. Much like a spelling checker, though more elaborate.

2010-11-03 05:05:14 by Dennis:

Pardon me, I see you had this at the beginning! However, the point seems to have been lost in the discussion.

2011-05-17 16:45:01 by Eugene:

Why this would not work:

* when somebody sends you a sentence in a language you don't understand... your client automatically sends out a request and receives a selection of possible translations in return...

Comment: 9 times out of 10, the request will come up empty (if no machine translation is employed).

* selecting from the list the translation which turned out to be the most accurate

Comment: but in order to do that, I'd already have to know what the foreign-language message said, innit?

* or responding with a more finely-tuned translation of your own

Comment: See above. And even if I do know, it's a significant investment of my time for a very uncertain return.

* wild guess pieced together from direct dictionary lookups would be better than nothing - because then the receiving user can go, "Well, that's just nonsensical, but if I alter a few words I can repair it, okay, here is a slightly better translation" and send it back. Thus, the translation stored in the database will iteratively migrate towards something which is more accurate. I hope.
    *
Comment: Return on investment. Also, the result space of conceivable alterations, almost all of which will be lateral or even worse, is vastly larger than the space of improvements.

* squillions and quintillions of possible sentences, many of them differing from each other by only a very small, trivial "edit distance". I hope to solve this by drastically restricting the maximum length of a message in the database. I'm talking "atoms": five words, or 64 bytes, or something like that.

Comment: It's still "infinity". A lesser "infinity", to be sure, but still. (Let's say that English has a million words -- a low estimate, no matter how you define "word" -- then the binomial coefficient for combinations of any five elements is technically not infinite; however, natural languages never stop gaining new words.)

* an unambiguous sub-language of English/French/Chinese/whatever, which a machine can translate perfectly.

Comment: All previous efforts to accomplish this have failed, and failed resoundingly.

Others have mentioned the ubiquity of English as a second language. Even poor students of ESL do better at communicating to you than machine translation from Foreign into English most of the time. And they can usually understand your English better than any machine translation into their language. You can aid their comprehension by keeping it simple, no need to break it down into atoms.

The above is about quick, online communication. Huge quantities of text benefit from being translated professionally into the languages of target audiences and that's exactly what happens if someone is willing to foot the bill or to volunteer (cf. open-source community). Human translators employ a wide range of tools to aid in this task, from word processing software to research on and off the net, online dictionaries, "translation memories", mutual help forums, and more.

There will always be a niche for machine translation, to "get the gist" of what a foreign-language text says (but sometimes that fails and all you can hope for is to get an indication of the field -- petrochemicals, not law; engineering, not music). Incremental improvements will continue to be made.

2011-05-18 00:26:37 by Eugene:

Ah, for the want of an edit button!

I was wrong, of course, when I said that all efforts have failed. There is at least one sub-sub-field where this approach does work. The Canadian weather service has been known to employ machine translation reliably to translate its weather forecasts between French and English -- forecasts that are written using a tightly circumscribed set of locutions according to a rulebook.

However, just as a Roomba vacuum cleaner cannot replace a housekeeper, this limited counter-example does not invalidate my contention. Ever since Marvin Minsky started making grand promises in the 1950s, broadly useful machine translation has been said to be "just around the corner". Apparently the difference nowadays is that Google has stepped in as the sugar daddy in place of the government (i.e., the U.S. Department of Defense).

If artificial intelligence is a pipe dream (I saw that you have a blog post on that topic although I haven't had the time to read it yet) then so is machine translation (except in the niches where it is useful). The best hand-held translator still is a curvaceous Lithuanian blonde that you can take to bed with you :)