Research Topics

We are conducting a wide range of research from theory to practice with a focus on language processing and speech dialogue, based on speech recognition. The following is a list of research area.

  • speech recognition
  • voice interactive system
  • natural language processing
  • conversational voice interface
  • voice interaction
  • Voice Interactive Content and Media

The following is an overview of each study and examples of research themes.

Speech Recognition

Automatic speech recognition, which transcribes human speech into text, is an important technology that is fundamental to speech media processing. It is made by integrating various information processing technologies, from signal processing to language processing and speech comprehension. For many years, we have been working on speech recognition based on statistical models and machine learning.

Highly efficient voice decoding

Research to improve the efficiency, optimization and accuracy of continuous speech recognition algorithms. One typical goal is to realize a robust speech recognition model that can run in a variety of environments, devices, and tasks, from embedded IC chips to cloud environments.

Low latency decoding

In the voice interface for real-time operation, response delay is a major problem. We are working on a speech recognition algorithm that achieves extremely fast responses by confirming the recognition result immediately after or before the end of the speech.

Speech classification

Research on methods for extracting and classifying various information inherent in the human voice. Language proficiency discrimination for second language learners, estimation of speech styles in lectures, and speech-emotion classification.

Sound environment detection

A study on the automatic classification and identification of various acoustic events and acoustic scenes in everyday life.

Spoken Dialogue System

We are conducting research on a voice interaction system that speaks to a machine and receives a voice response. Achieving human-like intelligent dialogue requires not only speech recognition and speech synthesis, but also a variety of intelligent processes such as speech understanding and dialogue management. Our laboratory is engaged in research on models, design, and real-world system operation.

Mobile-oriented conversational agent

Research and development of a dialogue system in mobile environments that allows users to communicate anytime and anywhere.

Statistical spoken dialogue system

Research on modeling data-based statistical speech dialogue, including automatic completion of dialogue scenarios, system construction from small amounts of task data, task adaptation, and automatic scenario extension.

Situation-aware multi-modal dialogue modeling

An interactive system in a real-world environment needs to have a good understanding of the user’s context. Research on a multimodal interaction system that understands the context of the user’s own behavior and the surrounding environment, as well as the voice, and responds appropriately.

Direct voice-to-response modeling

Using only the recognized text causes loss of non-verbal information contained in the voice. Therefore, we are investigating the direct use of acoustic information at the phoneme, senone, or frame level to select a dialogue response.

Natural Language Processing

We have been working on end-to-end speech dialogue and response sentence generation based on NN-based natural language processing since 2017, and have actively participated in international competitions such as DSTC7 and DSTC8.

E2E dialogue sentences that make natural transitions on a topic

Emphasizing speech styles from small corpora in E2E dialogues

Modeling the Individuality of Speech

Selecting a Response Sentence in a Multi-Person Conversation

Robust NN-based Speech Meaning Extraction and Dialogue State Tracking

Conversational Voice Interface

Voice interfaces that operate machines by voice have the potential to become the next major interface to replace the existing keyboard, mouse and touch. We are researching an interactive voice interface that allows anyone to perform simple interactions as naturally as if they were talking to a person.

Usability assesment

Research on the usability of voice interfaces, including user psychological load and cognitive load. Research on interactive cognition in voice interfaces, and techniques, designs, and evaluation scales to enable users to acquire “interactive cognition” so that they can feel a machine as a talking partner and speak naturally.

Multi-agent interface

Research on multi-agent interfaces that simultaneously present multiple interactive agents according to the multiple tasks they handle.

Voice Interaction

Mainly research to reduce barriers in conversations with humanoid agents. It is an engineering analysis and evaluation of approaches mainly from cognitive aspects, such as interactive cognition, conversational affordances, and the role of emotions. Basic research is conducted from paralinguistic, cognitive scientific and interactional aspects.

The affordance of conversationality

Research on the elucidation of the “affordance of speech input” and “affordance of conversation” that a speech dialogue system and speech interface should express in order to enable all people to speak to a machine naturally.

Engagement in voice interaction system

Research on a human-friendly voice interaction system that estimates and attunes to a person’s state and properties

Dialogue as a Media

In the near future, when interactive speech interfaces and systems are widely used, users will be able to freely choose a variety of designs (speech style, personality) of interactive machines, such as smartphones, operating systems, and cars. At that time, elements such as the content of the dialogue, the content of the response, and the design of the appearance will gradually become detached from the technology itself and become independent as “content” to be generated and consumed in society. In our laboratory, the contents of such speech dialogue systems, such as dictionaries for speech recognition, voice models for speech synthesis, and dialogue management units, are defined as “speech dialogue contents.

Packing spoken dialogue system as media content

Research on the design of an infrastructure for treating elements of spoken dialogue as system-independent content. Development of the open source toolkit MMDAgent as a test implementation.

Distribution of Spoken Dialogue Content

Research on element definitions, content structures, and description methods to enable the free distribution, appropriation, and modification of spoken dialogue content.

Promoting content creation

Design and build tools for creators to build voice-activated content for free.

Toward user-generated media

This research aims to establish voice interactive content as user-generated content such as YouTube and Wikipedia. Motivations, incentives, and demonstrations on the user side.

Talking with CG character agents

Dialogues and conversations with CG characters and virtual beings on the screen has been attracting much attention in these days. “VTubers” are becoming widely accepted and pervasive in human society. We have begun targeted research on dialogue systems with anime-style 2-D CG characters. In particular, we are now tuckling toward a new hybrid approach of mixed human-to-human and human-to-machine dialogue system by ensembling spoken dialogue system with “avatar” control.

Dialogue-capability congnitions with CG character agents

We are studying about design methodologies and dialogue control scheme that enables humans to easily acquire the perception of CG characters as “natural, smooth, and sustainable conversation partners” (dialogue perception).

Multi-modal dialogue behavior modeling for CG characters

This research focuses on the modeling of CG-specific converstation styles and behaviors. Although the behavior of a CG character should be based on natural human, it often includes much exaggeration and emphasis on the style and behaviors that are unnatural but acceptable. We aim to create data-driven models by referring to the conversation styles of so-called VTubers as reference of CG-based conversations and interactions, to obtain models for automatic generation or conversion of the CG-specific dialogue behavior.