“Keyboard? How Quaint!”…well, maybe not?

I have several post ideas queued up waiting for my brain to reorganize itself after several very stressful weeks, but today I came across an article thanks to Slashdot that I feel a need to comment upon.

The article is entitled, “Rest in Peas: The Unrecognized Death of Speech Recognition“, by a fellow named Robert Fortner. I don’t know who Mr Fortner is, honestly, but that doesn’t matter to me much. I mean, not a lot of people know who I am, either, and that doesn’t keep me from venturing opinions I hope people will find interesting :-)

ANYWAY, the gist of Mr Fortner’s article is that, bluntly, speech recognition is a failed technology, and possibly an impractical one for the foreseeable future. Despite 40 years or so of research, despite Google releasing a corpus of a trillion words to feed to recognition engines, speech recognition accuracy has more or less topped out at 80%, and stalled there for over a decade. No really serious research seems to be ongoing either into existing approaches or into completely new ones. The longstanding belief that, if we could teach computers language, that would lead to AI, is being turned on its head, with some people now certain that, without true AI, computers will never really understand language.

It’s a disappointing conclusion, but one that I can’t really disagree with. The truth is that reliable speech input is an active part of our cultural imagination in large part because it’s been so heavily used as a dramatic device in science fiction television and movies. It is much more viscerally satisfying to have characters tell the computer what they want it to do than type at a keyboard. It’s also easier for human beings to figure out what a computer is doing on its user’s behalf if it says so rather than having to have the camera focus on a screen. This was especially true in the pre-HDTV days (in other words, until about last year), when most people had TV sets that only gave off 480i or 576i resolution and would render any meaningful quantity of text unreadable.

The original Star Trek pilot, “The Cage” gets around this by having an actual printer on the bridge and having someone read off what the printout says. The rest of Classic Trek uses a mixture of having Spock relay computer findings, or actually having the computer talk (and be talked to), and this is more or less the beginning of speech recognition in the common imagination. This function is parodied in Galaxy Quest, where the computer can be spoken to, but only by its operator.

Other shows have also relied on this device. Zen and Orac in Blake’s 7, for example, who are really characters in their own right. Terry Nation could have written the show such that the crew figured out the ship by somehow hacking into and reading the ship’s library or some such thing, but it was much more dramatic to have them verbally convince the ship’s central computer to work with them.

Anyway, given that many other Star Trek technologies have been realized or even exceeded by our modern era, it seems frustrating that speech recognition should remain so far away. But I have to concur with researchers who believe that, without AI, there’s no way speech recognition will ever really work. Natural language is more than just stringing words together in a prescribed order. It’s a highly contextual thing, more so in some languages than others. Add in things such as accent and regional variation and you merely increase the contextual complexity. Think about how hard it is to converse over the phone with someone whose accent you’re not familiar with (and I’m not just singling out Mumbai call-center workers, although certainly that’s going to come to mind almost immediately)!

In the end, what allows you to have that conversation at all is the fact that the human brain is an incredible engine for pattern-matching and contextualization. This ability is what enables me, for example, have a conversation with someone from Boston and know, based on what conversation we’re having, whether, when he says /spahk/, he means a bridging of electrons between two points, a brand of RISC CPUs made by Sun Microsystems, the NCC-1701′s science officer, or the famed child development expert.

The problem is even worse in a language like Mandarin. First of all, even the written language is highly context-dependent and idiomatic. Second of all, the spoken language is rife with homophones, even if you distinguish the different tones correctly–and if you don’t, the spoken language is pretty much impossible; the wrong tone can turn an innocent phrase into a vile epithet whose meaning is nowhere remotely related to what you meant to say!

So, in the end, I find myself in almost perfect agreement with the article and its underpinnings. True AI seems to be decades if not centuries away, at this point, and without true AI that is capable of reasonably emulating human thought processes, I don’t see any way we can expect truly natural, accurate speech recognition. Human comprehension of speech simply depends too strongly on how our brains work!

4 Comments

  • Sally wrote:

    I think that we may need to think a bit like the computer, in order for it to do what we want; or phrase things with the computer in mind. Just in dealing with data bases, I find myself needing to anticipate what form a particular Chief Complaint or Medication needs to be asked for. I know this has to do with who programs it, but that’s true in any case.

    Perhaps the verbal recognition dictation is a start for dealing with the wording piece. They have accent packages for that. There is a particular cardiologist, who I never thought would be able to dictate to a computer, and yet he is.

  • Charissa wrote:

    Personally, I think the biggest barrier to computers & speech relates more to culture than anything else. Maybe that’s just context as you noted above, but a word is not just a word. Words grew out of the culture they were created in (or vice versa I’m not an anthropologist, so maybe this is a chicken-and-egg problem or maybe people more knowledgeable than myself know this already) and to take them out of that culture creates issues. Computers don’t really get those subtle elements, and computers today need to be more “multicultural” than they’re capable of being, and so, until you can program cultural subtleties into a computer, voice recognition is never going to be as great as it seems like it should be.

  • Berwyn wrote:

    Because typing is difficult for me, I tried VR for a bit. Was mused that it couldn’t understand simple words like ‘blue’ (glue? flew?), but had no problem at all with ‘Jararvellir’.

  • Blaise Pascal wrote:

    Sally: I think that we may need to think a bit like the computer.

    We’ve been down that road with handwriting recognition. The earliest consumer-level handwriting recognition devices tried to recognise human handwriting and mostly failed. Handwriting recognition didn’t take off in consumer devices until Palm introduced Graffiti, in which the humans had to learn how “think a bit like the computer” and write in a prescribed manner.

    However, where is handwriting recognition now? It’s been 9 Moore-doublings since Graffiti was introduced. Yet a 512-increase in processing power has not resulted in a corresponding increase in acceptance of handwriting recognition. I’d go as far as to say that handwriting recognition, Graffiti and all, has been rejected by the market.

    I suspect there are parallels to the voice recognition issue as well.

Leave a Reply

Your email is never shared.Required fields are marked *

Connect with Facebook

*
*