Sunday, October 17, 2010

Panda 1.7.1 and Speech Recognition with SAPI 5.4 on Win 7 x64



Thanks to my classmate Navi for showing this method to do speech recognition. I'd have had a tougher time had an introduction not been given!
I also evaluated PocketSphinx to do the same and ended up sharing minor bug fixes with the developer. But due to lack of documentation and support w.r.t. how to set up the code in the environment I was in, I had to ditch that approach after a few days of work :(

Before we proceed, readers may be interested in checking out PySpeech if Python 2.4 or 2.5 is being used.

Installation

  1. Install pywin32-214.win32-py2.6.exe to panda 1.7.1. Different versions of panda come with different python release and pywin depends on python's version.
  2. Execute the following command to make python aware of pywin32 (assuming Panda was installed to C: )
    C:\Panda3D-1.7.1\python\python.exe C:\Panda3D-1.7.1\python\Lib\site-packages\win32com\client\makepy.py
  3. In case program does not work, Microsoft SDK may not be there. Install it. YOU'LL NEED INTERNET FOR THIS AND ITS A 600+ MB INSTALL. This should not be required if Visual Studio 2008 or newer is installed on the machine.
SpeechRecognition.py
(code based on ActiveState's code.. which in turn can be found at a lot of places on the internet like here (who I *think* is the original author - Inigo Surguy), here, here and many other places I came across while making/customizing the solution)
from win32com.client import constants
import win32com.client
import pythoncom
import sys
#import the panda modules to make a task to push speech values to code.
import direct.directbase.DirectStart    #for taskMgr
from direct.task import Task            #for Task.cont
from pandac.PandaModules import *

"""SAPI 5.4 docs: http://msdn.microsoft.com/en-us/library/ee125077%28v=VS.85%29.aspx"""

"""Sample code for using the Microsoft Speech SDK 5.4 via COM in Python."""
class SpeechRecognition:
    """ Initialize the speech recognition with the passed in list of words """
    def __init__(self, wordsToAdd):
        print wordsToAdd
        # For text-to-speech
        self.speaker = win32com.client.Dispatch("SAPI.SpVoice")
        # For speech recognition - first create a listener
        self.listener = win32com.client.Dispatch("SAPI.SpSharedRecognizer")
        # Then a recognition context
        self.context = self.listener.CreateRecoContext()
        # which has an associated grammar
        self.grammar = self.context.CreateGrammar()
        # Do not allow free word recognition - only command and control
        # recognizing the words in the grammar only
        self.grammar.DictationSetState(0)
        # Create a new rule for the grammar, that is top level (so it begins
        # a recognition) and dynamic (ie we can change it at runtime)
        self.wordsRule = self.grammar.Rules.Add("wordsRule",
                        constants.SRATopLevel + constants.SRADynamic, 0)
        # Clear the rule (not necessary first time, but if we're changing it
        # dynamically then it's useful)
        self.wordsRule.Clear()
        # And go through the list of words, adding each to the rule
        [ self.wordsRule.InitialState.AddWordTransition(None, word) for word in wordsToAdd ]
        # Set the wordsRule to be active
        self.grammar.Rules.Commit()
        self.grammar.CmdSetRuleState("wordsRule", 1)
        # Commit the changes to the grammar
        self.grammar.Rules.Commit()
        # And add an event handler that's called back when recognition occurs
        self.eventHandler = ContextEvents(self.context)
        # Announce we've started using speech synthesis
        self.say("Welcome!")
        #Add the task that'll push recognized sounds every frame.
        taskMgr.add(self.pushMsgs, "pushMsgs")


    """Speak a word or phrase"""
    def say(self, phrase):
        self.speaker.Speak(phrase)

    def pushMsgs(self, task):
        pythoncom.PumpWaitingMessages()
        return Task.cont

    """The engine and audio input are inactive and no audio is being read,
    even if there rules active. The audio device will be closed in this state.
    http://msdn.microsoft.com/en-us/library/ee431860%28v=VS.85%29.aspx"""
    def stopListening(self):
        self.listener.State = constants.__getattr__("SRSInactiveWithPurge")

    def startListening(self):
        self.listener.State = constants.__getattr__("SRSActive")

    def isListening(self):
        if self.listener.State == 1 or self.listener.State == 2:
            return True

        return False


"""The callback class that handles the events raised by the speech object.
    See "Automation | SpSharedRecoContext (Events)" in the MS Speech SDK
    online help for documentation of the other events supported. """
class ContextEvents(win32com.client.getevents("SAPI.SpSharedRecoContext")):
    """Called when a word/phrase is successfully recognized  -
        ie it is found in a currently open grammar with a sufficiently high
        confidence"""
    def OnRecognition(self, StreamNumber, StreamPosition, RecognitionType, Result):
        newResult = win32com.client.Dispatch(Result)
        print "Guest said: ",newResult.PhraseInfo.GetText()
        #raise an event with the said word
        messenger.send(newResult.PhraseInfo.GetText())
Using the class in Angela.py
#Import the speech handler class
from SpeechRecognition import *


WORDS_TO_RECOGNIZE = [ "One", "Two", "Three", "Four", "ego" ] 
"""After running this, then saying 'One', 'Two', 'Three', 'Four' or 'ego' should 
display 'Guest said One' etc on the console. When 'ego' is said, an additional 
line saying 'Angela caught the event' should also be displayed. The recognition 
can be a bit shaky at first until you've trained it (via the Speech entry in the 
Windows Control Panel."""
class Angela(DirectObject):
    def __init__(self):
        #INITIALIZE SPEECH RECOGNITION
        self.speechReco = SpeechRecognition(WORDS_TO_RECOGNIZE)

        #Events accepted by the world
        self.accept('ego', self.event_message_ego)
        self.accept('s', self.event_keypress_s_toggleSpeechRecognition)

    def event_message_ego(self):
        print 'Angela caught the event.'

    def event_keypress_s_toggleSpeechRecognition(self):
        if self.speechReco.isListening():
            self.speechReco.stopListening()
        else:
            self.speechReco.startListening()

if __name__ == '__main__':
    spooky = Angela()
    run()

Explaining Angela.py

Once a word is recognized, SpeechRecognition.py will raise an event with word, just like it does for a keypress; like in the given example it is done for keypress 's'. As 'ego' is one of the words, I've handled the case when ego is said as if I am handling a keystroke. Anything else that is said, if not a command, will be ignored by the speech engine system-wide (as we are using the library in shared mode).

Adding more features

You can simply add more functionality to SpeechRecognition.py by exploring sapi.dll (%systemroot%\System32\Speech\Common\sapi.dll) in any of the COM browsers. I am outlining a method to view in one of the COM browsers that comes with Microsoft SDK.
  1. Install Microsoft SDK if not already installed.
  2. Fire up "C:\Program Files\Microsoft SDKs\Windows\v6.0A\Bin\OleView.Exe" OR Start Menu > Programs > Microsoft Windows SDK v6.0A > Tools > OLE-COM Object Viewer
  3. Select File > View TypeLib
  4. Select %systemroot%\System32\Speech\Common\sapi.dll in it.
  5. Click View > Group By Kind to see code organized in meaningful way.


    Note:
    • When multiple interfaces are implemented, like in case of SAPI.SpSharedRecognizer, methods of the ones having red icon will be used, as that is marked as the default interface for that class
    • When you select some method from the class and it has propget in its description (in the right pane), that means it will always come on the right hand side of = sign. This means you can only read this value and never assign anything to it. You will be able to call as a method or assign values to method that has propput in its description.
    • Everything defined as constants, even in enum, can be had as constants.__getattr__("constantName").
      eg: SpeechRecognizerState is
      enum SpeechRecognizerState {SRSInactive, SRSActive, SRSActiveAlways, SRSInactiveWithPurge}
      
      We can use its constant as
      constants.__getattr__("SRSInactiveWithPurge")

EDIT (Oct 21 2010):

Avoiding Windows commands: SpInProcRecognizer

The following code will detect only the words you have defined in your vocabulary i.e. the wordsToAdd list. This is will avoid random things happening when speech engine recognizes windows commands like 'close', 'escape', 'switch' etc.
Thinking aloud, it would be a good idea to have a bigger vocabulary, i.e. the engine's default list of words associated to your program and fire off/handle events only when the word you want has been detected. This will take care of the ambient noise problem to a great extent. Also, if you don't do this and keep a small list of words as vocabulary, you may end up getting many wrong matches as the engine will try to match any sound that you utter to the list it has. One way to bypass this will be to monitor confidence level of the match, but I don't know if SAPI will let me know that. Sphinx does.
Well, while writing this post, I went out looking for a good default word list that I can load into the program. After some searching, I found Grady Ward's Moby word lists. I used singles.txt from that archive and ended up adding 177,470+ words. The speech engine took a good few minutes to register these words. Moreover the sensitivity went beyond control. So I discarded that list, picked up commons.txt, replaced all capital letters with their non-capital form, removed all lines with ' ', ',', '!', '\', '/', '&', '-' in it as they were causing the engine to throw an exception. We want to detect only single words, and the list should contain only them... but in a manageable number. So, now I am working with 17700+ words and it takes around 10 secs for it to load on a quad core CPU with NVIDIA GT330.
Refining code:

Speech recognition has its own challenges. I would strongly suggest to test in an environment which resembles your presentation space. 99% of the times, it will not give you the word you want but would be a similar sounding word. You want all these similar sounding words and fire same event when any of them is detected. Also, different word lists will give different results. For example, for word 'ego', I have added the following words and will test more with the live performer before the final show.
self.accept('ego', self.event_message_ego)
self.accept('ito', self.event_message_ego)
self.accept('ido', self.event_message_ego)
self.accept('edile', self.event_message_ego)
self.accept('beetle', self.event_message_ego)
self.accept('beadle', self.event_message_ego)
self.accept('kiel', self.event_message_ego)
self.accept('yell', self.event_message_ego)
self.accept('gold', self.event_message_ego)
self.accept('told', self.event_message_ego)
self.accept('toll', self.event_message_ego)
self.accept('gaul', self.event_message_ego)
self.accept('whole', self.event_message_ego)
These have been added after speaking softly and not-so-softly into the mic and then writing what the speech engine thought of my speech.
Caveats: Code will throw an exception if Windows cannot detect any mic connected to the computer.

from win32com.client import constants
import win32com.client
import pythoncom
import sys
#import the panda modules to make a task to push speech values to code.
from pandac.PandaModules import *
#loadPrcFileData("", "want-directtools #t")
#loadPrcFileData("", "want-tk #t")
import direct.directbase.DirectStart    #for taskMgr
from direct.task import Task            #for Task.cont

"""SAPI 5.4 docs: http://msdn.microsoft.com/en-us/library/ee125077%28v=VS.85%29.aspx"""

"""Sample code for using the Microsoft Speech SDK 5.1 via COM in Python.
    Requires that the SDK be installed; it's a free download from
            http://microsoft.com/speech
    and that MakePy has been used on it (in PythonWin,
    select Tools | COM MakePy Utility | Microsoft Speech Object Library 5.1).

    After running this, then saying "One", "Two", "Three" or "Four" should
    display "You said One" etc on the console. The recognition can be a bit
    shaky at first until you've trained it (via the Speech entry in the Windows
    Control Panel."""
class SpeechRecognition:
    """ Initialize the speech recognition with the passed in list of words """
    def __init__(self, wordsToAdd = None):
        # For text-to-speech
        self.speaker = win32com.client.Dispatch("SAPI.SpVoice")
        # For speech recognition - first create a listener
        self.listener = win32com.client.Dispatch("SAPI.SpInProcRecognizer")
        #Set the mic (as recognized by Windows MultiMedia layer) to speech recognition engine.
        self.listener.AudioInputStream =  win32com.client.Dispatch("SAPI.SpMMAudioIn")
        # Then a recognition context
        self.context = self.listener.CreateRecoContext()
        # which has an associated grammar
        self.grammar = self.context.CreateGrammar()
        # Do not allow free word recognition - only command and control
        # recognizing the words in the grammar only
        self.grammar.DictationSetState(0)
        # Create a new rule for the grammar, that is top level (so it begins
        # a recognition) and dynamic (ie we can change it at runtime)
        self.wordsRule = self.grammar.Rules.Add("wordsRule",
                        constants.SRATopLevel + constants.SRADynamic, 0)
        # Clear the rule (not necessary first time, but if we're changing it
        # dynamically then it's useful)
        self.wordsRule.Clear()
        # And go through the list of words, adding each to the rule
        print '\nSpeechRecognition.py: Starting to add words to be recognized.'
        numWordsAdded = 0
        #BEGIN DIRTY CODE TO GET USABLE FILE FROM COMMON.TXT
        #refuse = [' ', '.', '-', '/', '!', '&']

        #writeOut = []
        #wordList = open("COMMON.TXT", "r")
        #while wordList.readline()!='':
            #add = True
            #word = wordList.readline()
            #word = word.strip()
            #for sym in refuse:
                #if sym in word:
                    #add = False

            #if add:
                #writeOut.append(word)

        #newC = open("commonWords.txt", "w")
        #for w in writeOut:
            #newC.write(w+'\n')
        #newC.close()
        #wordList.close()

        #sys.exit()
        #END DIRTY CODE TO GET USABLE FILE FROM COMMON.TXT
        if wordsToAdd == None:
            wordList = open("commonWords.txt", "r")
            while wordList.readline()!='':
                word = wordList.readline()
                word = word.strip()
                self.wordsRule.InitialState.AddWordTransition(None, word)
                numWordsAdded += 1
            wordList.close()
        else:
            for word in wordsToAdd:
                self.wordsRule.InitialState.AddWordTransition(None, word)
                numWordsAdded += 1
        print 'SpeechRecognition.py: ', numWordsAdded, ' words added.'

        # Set the wordsRule to be active
        self.grammar.Rules.Commit()
        self.grammar.CmdSetRuleState("wordsRule", 1)
        # Commit the changes to the grammar
        self.grammar.Rules.Commit()
        # And add an event handler that's called back when recognition occurs
        self.eventHandler = ContextEvents(self.context)
        # Announce we've started using speech synthesis
        #self.say("Welcome!")
        #Add the task that'll push recognized sounds every frame.
        taskMgr.add(self.pushMsgs, "pushMsgs")
        print 'SpeechRecognition.py: Done setting up speech recognition.\n'

    """Speak a word or phrase"""
    def say(self, phrase):
        self.speaker.Speak(phrase)

    def pushMsgs(self, task):
        pythoncom.PumpWaitingMessages()
        return Task.cont

    """The engine and audio input are inactive and no audio is being read,
    even if there rules active. The audio device will be closed in this state.
    http://msdn.microsoft.com/en-us/library/ee431860%28v=VS.85%29.aspx"""
    def stopListening(self):
        self.listener.State = constants.__getattr__("SRSInactiveWithPurge")

    def startListening(self):
        self.listener.State = constants.__getattr__("SRSActive")

    def isListening(self):
        if self.listener.State == 1 or self.listener.State == 2:
            return True

        return False


"""The callback class that handles the events raised by the speech object.
    See "Automation | SpSharedRecoContext (Events)" in the MS Speech SDK
    online help for documentation of the other events supported. """
class ContextEvents(win32com.client.getevents("SAPI.SpInProcRecoContext")):
    """Called when a word/phrase is successfully recognized  -
        ie it is found in a currently open grammar with a sufficiently high
        confidence"""
    def OnRecognition(self, StreamNumber, StreamPosition, RecognitionType, Result):
        newResult = win32com.client.Dispatch(Result)
        print "Guest said: ",newResult.PhraseInfo.GetText()
        #raise an event with the said word
        messenger.send(newResult.PhraseInfo.GetText())

'''
Dev notes:
seeing the COM interface and application
- install winSDK
- fire up "C:\Program Files\Microsoft SDKs\Windows\v6.0A\Bin\OleView.Exe"
- select File > View TypeLib
- open C:\Windows\System32\Speech\Common\sapi.dll in it.
- Click View > Group By Kind to see code organized in meaningful way.
Note:
1. When multiple interfaces are implemented, like in case of SAPI.SpSharedRecognizer, methods of the
ones having red icon will be used, as that is marked as the default interface for that class
2. When you select some method from the class and it has propget in its description (in the right pane),
that means it will always come on the right hand side of = sign. This means you can only read this
value and never assign anything to it. You will be able to call as a method or assign values to
method that has propput in its description.
3. everything defined as constants, even in enum, can be had as
constants.__getattr__("constantName").
eg: SpeechRecognizerState is
enum SpeechRecognizerState {SRSInactive, SRSActive, SRSActiveAlways, SRSInactiveWithPurge}
We can use its constant as
constants.__getattr__("SRSInactiveWithPurge")
'''

3 comments:

KAV2 said...

"code based on ActiveState's code.. which in turn can be found at a lot of places on the internet"

You should provide a link to a reference even if it "can be found at a lot of places on the internet."

Also, be more careful with your wording. Your code really just adds three little things to another person's code. All you did was add an accessor function and two mutator functions to the original class (yes this does make it easier to change the state, but it could have been done without them), sent a message, and moved the pumpwaitingmessages call to a task from a while loop (a very trivial thing to do).

Be sure to give credit where credit is due. A simple link to the reference (assuming it is a site) would be more than sufficient and it would act as a way of showing your appreciation to the person/people who provided the reference material.

Dev Ghai said...

It is a trivial thing to do when one knows the COM interface and how pywin module accesses it. The problem is I knew neither and first learned to discover the complete COM interface using MS SDK's COM/OLE viewer, and also learning what is COM first. The COM viewer (combrowser.py) that comes with pywin doesn't help a lot. In addition, documentation at MSDN did not help me much as it I could not find it talking about ISpeechRecognizer interface taking precedence over all other interfaces which were mentioned there. I spent the whole trying to stop the engine from listening and kept getting the same error. :(

This solution is specially tailored for Panda3D engine and provided to help students at my school to focus more on the task at hand and stay away from connecting the dots.

Thanks for your criticism. References have been appropriately added. I hope you found this post useful in some way.

Dev Ghai said...

whole night*

[Me and my stupid habit to miss writing words :(]