E.Tsiang, "System for recognizing speech", U.S. Patent 5377302, 1994

A pattern recognition system particularly useful for recognizing speech or handwriting. An input signal is first filtered by a filter bank having two stages where the outputs of the first stage is fed forward to the second stage of a significant number of filters and the output of the second stage is fed back to the first stage of a significant number of the filters. Such feedback enhances the signal-to-noise ratio and resembles the coupling between the different sections of the basilar membrane of the cochlear. The output of the filter bank is a two-dimensional frequency-time representation of the original signal. A second set of filters which takes as input two-dimensional signals, detects the presence of elementary tonotopic features such as the onset, rise, fall and frequency of any significant tones in a speech signal. A third set of filters detects any contrasts in the elementary features at various levels of resolution. After such filtering, a neural network is employed to learn patterns formed from the multi-resolution contrasts in the identified features so that the system recognizes symbols from an input signal that is continuous in time. In the case of speech, the system recognizes continuous speech in a speaker-independent manner, and is also tolerant of noise

E. Tsiang, "A Cochlea Filter Bank for Speech Analysis", Proc. International Conference on Signal Processing Applications and Technology, pp.1674-1678, 1997 (postscript)

The filter bank consists of resonators whose characteristic frequencies are ordered tonotopically. By tonotopy, we mean any monotonically increasing function of frequency, such as the logarithm. The proposed filter bank differs from other cochlea filter banks in its substantial inter-filter feedback. Each filter may be coupled to all other filters at higher frequencies. The feedback enhances the filter bank response for a selective range of frequency modulations. For a filter bank design based on the human cochlea, this range coincides with formant movements in speech signals. Decreasing inter-filter feedback has little effect on the response at the location of formants, but significantly attenuates their changes. In the case of feedback from only the adjacent higher-frequency filter, the filter bank reduces to a delay line. Because formant dynamics encode perceptual information, this filter bank may be useful as initial analysis for automatic speech recognition. The filter bank design derives from a three-dimensional solution to the equation of motion of the basilar membrane. In particular, the inter-filter coupling constants depend quadratically on the breadth of the basilar membrane, which may vary along its length. Increasing inter-filter feedback sharpens the frequency responses of individual filters. We describe a DSP implementation of a design based on the physical properties of the average human cochlea. Cochleagrams for various sample signals, including some speech signals, demonstrate the enhancement of formant movements. This design is currently used as a front-end processor to a speech recognizer.

E. Tsiang, "Multiresolution Elementary Tonotopic Features for Speech Perception", Proc. International Conference on Neural Networks, pp. 575-579, June 1997(postscript)

We define multiresolution elementary tonotopic features (ETFs) in general, and present specific functions and decompositions for computing them. Such decompositions, when cast in the form of local, fixed-weight FIR neural networks, have definite architectures. Results of their use as front-end inputs to a speaker-independent continuous-speech phoneme recognizer are encouraging. We analyze the dependence of the recognition performance on the various ETFs at different levels of resolution.

E. Tsiang, "A Neural Architecture for Computing Acoustic-phonetic Invariants", Proc. ICASSP, pp. II-1109, May, 1998(postscript)

The proposed neural architecture consists of an analytic lower net, and a synthetic upper net. This paper focuses on the upper net. The lower net performs a 2D multiresolution wavelet decomposition of an initial spectral representation to yield a multichannel representation of local frequency modulations at multiple scales. From this representation, the upper net synthesizes increasingly complex features, resulting in a set of acoustic observables at the top layer with multiscale context dependence. The upper net also provides for invariance under frequency shifts, dilatations in tone intervals and time intervals, by building these transformations into the architecture. Application of this architecture to the recognition of gross and fine phonetic categories from continuous speech of diverse speakers shows that it provides high accuracy and strong generalization from modest amounts of training data.