Tsinghua Researchers Develop Brain-Inspired AI Model for Advanced Speech Separation

Apr 07, 20243 Mins read598

In the subway during commutes, in noisy restaurants, with the broadcast, music, and people’s voices all around, it seems like nothing interferes with your conversation with your companions. And this is a special advantage the brain exhibits when processing auditory information—it can focus attention on interesting conversations or sounds while ignoring other irrelevant noises.

In fact, more than 70 years ago, neuroscientists noticed this remarkable ability of the brain, which they termed the “cocktail party effect.”

“Although we unconsciously utilize this advantage of the brain every day, effortlessly identifying specific target sounds in mixed noises, making computers do the same is very challenging.” Recently, a team led by Professor Yuan Kexin from the School of Biomedical Engineering at Tsinghua University, in collaboration with Professor Hu Xiaolin’s team from the Computer Science Department of the same university, based on the principles of integrating multimodal sensory information in the mammalian thalamus and cortex, developed a new brain-inspired AI model (CTCNet) that achieved a breakthrough in mixed speech separation technology, allowing computers to further mimic the behavior of the human brain in “listening.” The research findings were published in the latest issue of IEEE Transactions on Pattern Analysis and Machine Intelligence.

Computers Still Cannot Effectively Separate Two Voices

It is understood that existing multimodal speech separation methods mostly simulate the cortical pathway of mammals, from lower functional areas (such as primary visual and auditory cortices) to higher functional areas (such as higher visual and auditory cortices). Meanwhile, in the field of artificial intelligence (AI), there has been no effective way to make computers effectively separate two people’s voices.

However, in the integration of auditory and visual information, the subcortical structures represented by the thalamus play an essential role.

During the Latest interview that Yuan Kexin’s team has long focused on the auditory processing mechanisms of the brain, and in recent years, a series of works have gradually revealed the connections, functions, and working mechanisms of the superior auditory thalamus.

“As a key central node for processing auditory information, the superior auditory thalamus has bimodal characteristics of auditory and visual senses. Its ventral medial part plays a crucial role in mediating the perception of danger triggered by auditory and visual stimuli; its dorsal part receives projections from the 5th layer of the auditory cortex as well as projections from the 5th layer of the visual cortex, forming a cortical-thalamic-cortical (CTC) loop connection architecture as a whole.” Yuan Kexin said, indicating that the superior auditory thalamus may enhance auditory perception by integrating auditory and visual information through special connection patterns.

Inspired by the connectivity characteristics of the dorsal superior auditory thalamus, Yuan Kexin’s team collaborated with Hu Xiaolin’s team to propose a cortical-thalamic-cortical neural network (CTCNet) to perform audio-visual speech separation tasks.

New Model Opens Brain-Inspired Paradigm

According to the introduction, the CTCNet model consists of three modules: auditory sub-network, visual sub-network, and auditory-visual fusion sub-network, which respectively simulate the auditory cortex, visual cortex, and dorsal superior auditory thalamus.

Its basic working principle is as follows: first, auditory information (speech) and visual information (lip movement) are processed independently in separate auditory and visual sub-networks in a bottom-up manner; then, the processed auditory and visual information is fused at multiple time resolution scales in the auditory-visual fusion sub-network through top-down connections; finally, the fused information is fed back to the auditory and visual sub-networks. This process is repeated several times, ultimately outputting to the auditory sub-network.

Research shows that testing results on three speech separation benchmark datasets demonstrate that CTCNet can accurately separate mixed speech with the assistance of visual information (lip movement) under extremely few parameters.

Yuan Kexin stated that the speech separation model CTCNet is a brain-inspired AI model constructed based on the fusion ability of the dorsal superior auditory thalamus and its cortical-thalamic-cortical loop connection architecture. By repeatedly fusing and processing auditory and visual information, AI can better achieve the “cocktail party effect.”

“The speech separation performance of the new model is significantly ahead of existing methods, not only providing a new brain-inspired paradigm for computer perception information processing but also having the potential to play an important role in intelligent assistants, autonomous driving, and other fields.” One of the lead authors of the study, Yuan Kexin, said that in the future, the research team will further enhance the perceptual capabilities of artificial intelligence systems in natural scenes based on self-developed organizational optical transparency methods by deeply analyzing the input-output connection patterns of the superior auditory thalamus at the single-neuron level.

It is reported that Li Kai, a master’s student at Tsinghua University, is the first author of the paper, Tsinghua University postdoctoral fellow Xie Fenghua, and doctoral student Chen Hang are the second and third authors of the paper respectively, and Yuan Kexin and Hu Xiaolin are co-corresponding authors of the paper.