Google Voice Search


Honta Sergiy



The explosion in recent years of mobile devices, especially web-enabled smartphones, has resulted in new user expectations and needs. There is also the growing expectation of ubiquitous availability. Users in-creasingly expect to have constant access to the information and services of the web. A goal at Google is to make spoken access ubiquitously available. Wewould like to let the user choose — they should be able to take it for granted that spoken interaction is always an option. Achieving ubiquity requires two things: availability and performance. Voice — a step toward our long term vision of ubiquitous access. There are two themes which underlie much of the technical approach we are taking: delivery from the cloud and operating at large scale.Delivery of services from the cloud enables a number of advantages when developing new services and new technologies. In general, research and development at Google is conducted «in-vivo» as part of actual services.  Operating at scale: Mobile voice search is a challenging problem for many reasons — for example, vocabularies are huge, input is unpredictable, and noise conditions may vary tremendously due to the wide-ranging usage scenarios while mobile. Many of the techniques and research directions described below are focused on building models that can take advantage of huge amounts of data and the compute power to train and run them.  In addition we can take the advantage of the cloud and use our ability to operate at large scale, we also take advantage of other recent technological advances. The maturing of powerful search engines provides a very effective way to give users what they want if we can recognize the words of a query. Searching for information by voice has been part of our every day lives since long before the internet became prevalent. This basic dialog has been ingrained in our minds since long before interactive voice response systems (IVR) replaced all or part of the live operator interaction.

In March 2008 company introduced it first multi-modal speech application for Google Maps for Mobile (GMM). The contact information, address, and any other meta information about the business can easily be displayed. A second major advantage relates to the time it takes the user to both search for and digest information. Mobile web search is a rapidly growing area of interest. Internet-enabled smartphones account for an increasing share of the mobile devices sold throughout the world. Users are increasingly turning to their mobile devices while doing web searches. Although mobile device usability has improved, typing search queries can still be cumbersome, error-prone, and even dangerous in some usage scenarios.

In November 2008 Google introduced Google Mobile App (GMA) for iPhone (Figure 4) that included a search by voice feature. The goal of Google search by Voice is to recognize any spoken search query. Choosing appropriate metrics to track the quality of the system is critical to success. The word error rate measures misrecognitions at the word level: it compares the words outputted by the recognizer to those the user really spoke. For Google search by Voice, individual word errors do not necessarily effect the search results shown. We therefore track the semantic quality of the recognizer (WebScore) measuring how many times the search result as queried by the recognition hypothesis varies from the search result as queried by a human transcription. Perplexity is, crudely speaking, a measure of the size of the set of words that can be recognized next, given the previously recognized words in the query. The out-of-vocabulary rate tracks the percentage of words spoken by the user that are not modeled by our language model. Latency is defined as the total time (in seconds) it takes to complete a search request by voice. Acoustic models provide an estimate for the likelihood of the observed features in a frame of speech given a particular phonetic context. The basic form of the acoustic models used are common in the literature. Potential bugs in experiments make learning from negative results sketchy in speech recognition. When some technique doesn’t improve things there’s always the question of whether the implementation was wrong. The growing user base of voice search together with Google’s computational infrastructure provides a great opportunity to scale our acoustic models. Developers use written queries to in order to bootstrap our language model for Google Voice Search. Written queries contain a fair number of cases which require special attention to convert to spoken form. «Multi-modal» features, like Google Voice Search, provide a highly exible and data-rich alternative to the voice-only telephone applications that preceded them. After all, they take advantage of the best aspects of both speech and graphical modalities. Another general advantage of mobile voice search is the exibility and control it accords users.

Capturing clear and complete user utterances is of paramount important to any speech application.  While most mobile speech apps require the user to press a button to initiate recording, only some require the user to manually stop the recording after speaking by pressing it again, or pressing another button. While endpointers may be convenient for mobile speech, they seem to be better suited for applications like web search or voice commands in which the input is shorter, generally one phrase. Another way to manually endpoint is to press and hold the button while speaking. This is based on the «walkie talkie» model. This premature endpointing can result in misrecognition.

Putting buttons aside for the moment, gesture-based triggers for initiating speech are another strategy which has been implemented in the iPhone version of GMA. Speech recognition isn’t perfect and designing speech-based applications requires to pay special attention to these inevitable errors. In these scenarios, making sure the n-best is easily accessible saves the user the frustration of having to re-speak the utterance and fosters a more positive impression of the application in general.There are several reasons for this.




PowerPoint Presentation — click to download

Добавить комментарий

Ваш e-mail не будет опубликован. Обязательные поля помечены *

Можно использовать следующие HTML-теги и атрибуты: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>