The O-COCOSDA 2012 Organizing committee is pleased to announce the following distinguished keynote speakers to give plenary talks at the conference:

  • Prof. Isabel Trancoso, Speech Technologies Applied to eHealth and eLearning
  • Prof. Aijun Li, Spoken Language Resources at CASS: Challenges and New Orientation
  • Prof. Mark Liberman, Towards Automatic Phonetic Analysis of Unrestricted Text


Speech Technologies Applied to eHealth and eLearning
(Isabel Trancoso, Alberto Abad, Thomas Pellegrini)


Prof. Isabel Trancoso
IEEE Fellow
Instituto Superior Técnico / INESC-ID Lisbon, Portugal

Isabel Trancoso received the Licenciado, Mestre, Doutor and Agregado degrees in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, Portugal, in 1979, 1984, 1987 and 2002, respectively. She has been a lecturer at this University since 1979, having coordinated the EEC course for 6 years. She is currently a Full Professor, teaching speech processing courses. She is also a senior researcher at INESC ID Lisbon, having launched the speech processing group, now restructured as L2F, in 1990. Her first research topic was medium-to-low bit rate speech coding. From October 1984 through June 1985, she worked on this topic at AT&T Bell Laboratories, Murray Hill, New Jersey. Her current scope is much broader, encompassing many areas in speech recognition and synthesis, with a special emphasis on tools and resources for the Portuguese language. She was a member of the ISCA (International Speech Communication Association) Board (1993-1998), the IEEE Speech Technical Committee (since 1999) and the Permanent Council for the Organization of the International Conferences on Spoken Language Processing (since 1998). She was elected Editor in Chief of the IEEE Transactions on Speech and Audio Processing (2003-2005), Member-at-Large of the IEEE Signal Processing Society Board of Governors (2006-2008), Vice-President of ISCA (2005-2007) and President of ISCA (2007-2011). She chaired the Organizing Committee of the INTERSPEECH'2005 Conference that took place in September 2005, in Lisbon. She received the 2009 IEEE Signal Processing Society Meritorious Service Award, and was elevated to IEEE Fellow in 2011.

Spoken language technologies have reached enough maturity to be integrated in many applications in eHealth and eLearning. The challenges and the potential are enormous. There are many other areas in which this claim could be equally made, but these two areas share many technical issues and, of course, they also share a huge significance from a social point of view. This was the driving force for our recent efforts at the Spoken Language Systems Lab of INESC-ID in terms of eHealth and eLearning. This talk tries to give an overview of these efforts and, in spite of the fact that they will be demonstrated for the Portuguese language, it will also try to emphasize how easily they can be extended to new languages.
Our most recent eHealth project focus on aphasia patients. The Virtual Therapist platform (Vithea) has two key features: personalization and modularity. The first one takes into account the importance of being able to create new exercises for the patients, adapted to their hobbies or their favorite memories, thus adding to their motivation for completing the exercises. The second one will hopefully facilitate the adaptation of the platform to other diseases such as Alzheimer’s or Parkinson’s. The main speech module is currently based on keyword spotting, but by integrating other speech analysis modules, many different therapy / diagnosis tools can be developed, targeting pathologies such as dysarthria, sigmatism, cleft lip and palate, removed larynx, cancer of the oral cavity, etc.
Although the talk focuses mostly on therapy tools, speech and language technologies are also of paramount importance to the areas of active ageing and independent living. In this context, our most recent work is in the context of the DIRHA European project (Distant speech interaction for robust home applications) which aims at integrating speech technologies in an automated home equipped with digital microphone arrays, and evaluated by motor-impaired end-users.
In terms of eLearning, and in particular in terms of CALL (Computer Assisted Language learning), our efforts started by the development of a Portuguese version of a tutoring system from Carnegie Mellon University, focused in vocabulary learning. In the baseline version, students can learn from real texts selected from an open corpus such as the Web, on topics for which they previously marked their preference. Although the REAP (Reading Practice) platform provided the framework for several interesting thesis, dealing with readability measures, generation of distractors for cloze questions, etc., the work rapidly extended beyond the original goal of vocabulary learning.
One of the two main directions was the area of serious games. Our continuously growing set of games targets totally different goals such as learning grammar, practicing vocabulary, or improving its perception, just to name a few. Practically every NLP or speech technology module available at our lab has found an application in these games, from statistical machine translation to speech synthesis and recognition, integrating 3D technologies as well to make the games more appealing. Sometimes, the games may even use side information provided by these modules – for instance, the confusion matrix of a speech recognition module may be used to generate distractors in a listening comprehension game.
The other direction was a multimedia version of REAP. The students may now learn vocabulary from other documents beyond text, such as automatically aligned audiobooks or automatically recognized TV documentaries. In fact, our Daily REAP version is updated every day to allow students to learn from the written or broadcast news of the last 7-days, on the topics they choose. This version does in fact use all the different technologies integrated in our long broadcast news processing chain, starting with audio segmentation and speech recognition (marking the words recognized with lower confidence), and including as well capitalization, punctuation, story segmentation, and topic indexation.
ELearning, however, is far from being restricted to Language Learning. Tutoring systems are now being developed for many curricular activities, and the existence of on-line video courses such as the Khan Academy opens fascinating possibilities of using, for instance, question answering in these courses. Soft skills (such as presentation skills) can be also the target of eLearning tools, and here as well speech and language technologies may play a dominant role.
In conclusion, I hope that the examples that will be shown in this talk will contribute to illustrate my take home message: we have hardly begun to explore the full potential of speech and language technologies in these two fields.

Spoken Language Resources at CASS: Challenges and New Orientation
(Aijun Li)


Prof. Aijun Li
Professor and Director, Laboratory of Phonetics and Speech Science, Institute of linguistics, Chinese Academy of Social Sciences

Aijun Li is the professor and director of the phonetics Laboratory, Institute of linguistics, Chinese Academy of Social Sciences. Her academic interests include speech corpus collection and annotation, first and second language acquisition (especially on phonetic aspects), emotional and expressive speech production and perception. She serves as the vice chair of the SIG-CSLP, the associate editor of Chinese Journal of Phonetics and the Vice President of Phonetics Association of China.

Speech corpus is a kind of infrastructure which is indispensable to the undertaking of spoken language analyzing and processing. This talk will first introduce a variety of spoken language resources collected by the Lab of Phonetics and Speech Science, Institute of Linguistics, CASS in the traditional way which is called ‘Field Collection’. Secondly, the talk will point out the challenges in traditional linguistic and phonetic theories which are induced by the appearance of ‘network speech’. Nowadays, with the development of web-based applications and technologies, people communicate in a virtual and instant way. The web-based communication induces to the emergence of huge amounts of ‘virtual spontaneous speech’ produced by millions of internet users in exchanges of instant messages. This new variety of speech elicits the requirements for more compatible theories and the new way to collect speech data, i.e. the web-based way like the ‘Wizard of OZ’, the only ‘power’ that researchers need is to ‘own’ a user-friendly data-sharing platform to attract netizens to visit, through which they can simply get what they want through this platform rather than conducting the data recording face-to-face.

Towards Automatic Phonetic Analysis of Unrestricted Text
(Mark Liberman)


Prof. Mark Liberman
University of Pennsylvania

Mark Liberman worked at AT&T Bell Laboratories from 1975 until 1990, when he moved to the University of Pennsylvania, where he is Trustee Professor of Phonetics, Director of the Linguistic Data Consortium, and Faculty Director of College Houses and Academic Services. He was a co-founder of the popular linguistics blog Language Log. His current research is focused on the scientific application of techniques from speech and language engineering, and on the phonetics and phonology of tone and intonation.

For a century and a half, phoneticians, sociolinguists, and psycholinguists have tested theories through careful study of how manipulations of selected factors (e.g., phonological or phrasal context, regional or ethnic varieties, formality, word identity, word frequency, speaking rate, vocal effort) affect dependent variables such as vowel quality, voice onset time, segment duration, “g-dropping”, “t/d deletion”, and others.  This work relies heavily on manual phonetic classification and measurement of the large number of tokens needed to provide sufficient statistical power for the intended analysis. This raises a host of problems that include human error, experimenter bias, and very high costs in human annotation time.
Modern machine-learning methods now make it possible to automate the selection, classification, and measurement of instances of allophonic variation. In the limit, this approach promises to yield automated phonetic transcriptions, along with automated measurement of relevant phonetic dimensions, for speech datasets comprising millions or billions of words.
The approach begins by using an HMM speech-recognition system in “forced alignment” mode to determine phone boundaries and to classify some types of allophonic variation. Given this basic alignment, we can use appropriate acoustic features and machine-learning techniques to make finer phonetic distinctions and accurate phonetic measurements. For example, a Bayesian approach to formant analysis has been able to replicate successfully the results of human analysis of 134,000 manual formant measurements in a large corpus of regional variation in North American English.  As another example, it is possible to automatically classify and measure the voice onset time and closure duration for intervocalic stops at human-like levels given only an audio recording and transcript. In one large dataset our algorithms produced VOT measurements with an mean absolute difference of only 1.9 ms relative to the careful annotations of two experienced phoneticians (who themselves exhibit a mean absolute difference of 1.5 ms).
In this presentation, I will discuss the application of these techniques to other features, including /l/ allophony, “g-dropping”, vowel nasalization, and t/d deletion; and I will describe the development of open-source packages designed to allow others to use such methods in their own research.
(Describes joint work with Jiahong Yuan, Neville Ryant), Stephen Isard and Keelan Evanini).