Much less Is Extra: A Unified Structure for Machine-Directed Speech Detection with A number of Invocation Varieties



Suppressing unintended invocation of the system due to the speech that feels like wake-word, or unintended button presses, is important for an excellent person expertise, and is known as False-Set off-Mitigation (FTM). In case of a number of invocation choices, the normal method to FTM is to make use of invocation-specific fashions, or a single mannequin for all invocations. Each approaches are sub-optimal: the reminiscence value for the previous method grows linearly with the variety of invocation choices, which is prohibitive for on-device deployment, and doesn’t reap the benefits of shared coaching knowledge; whereas the latter is unable to precisely seize acoustic variations throughout completely different invocation sorts. To this finish, we suggest a Unified Acoustic Detector (UAD) for FTM when a number of invocation choices can be found on system. The proposed UAD is skilled utilizing a multi-task studying framework, the place a collectively skilled acoustic encoder mannequin is augmented with invocation-specific classification layers. Within the context of the FTM activity, we present for the primary time that utilizing the shared mannequin structure throughout invocations (thus, maintaining the mannequin dimension much like that of a monolithic mannequin used for a single invocation sort), we cannot solely match however largely enhance the accuracy of the invocation-specific fashions. In explicit, within the difficult case of touch-based invocation, we acquire 50% and 35% relative enchancment in false optimistic price at 99% true optimistic price, compared with a single-output mannequin for each invocations, and separate fashions per invocation, respectively. Moreover, we suggest streaming and non-streaming variants of the UAD, and present that they each outperform a conventional ASR-based method to FTM.