Using Vosk grammars

Sentence-based speech-to-text, specifyng a grammar
💡 Stateful & low latency ASR. Proposed architecture

Sentence-based speech-to-text, specifyng a grammar

grammar.js is a basic demo using Vosk recognizer using a specified grammar. The output structure format now allows dofferent alternatives)

node grammar

$ node grammar.js
model directory      : ../models/vosk-model-small-en-us-0.15
speech file name     : ../audio/2830-3980-0043.wav
grammar              : experience proves this,why should one hold on the way,your power is sufficient i said,oh one two three four five six seven eight nine zero,[unk]
load model latency   : 328ms
{
  alternatives: [
    {
      confidence: 197.583099,
      result: [
        { end: 1.02, start: 0.36, word: 'experience' },
        { end: 1.35, start: 1.02, word: 'proves' },
        { end: 1.98, start: 1.35, word: 'this' }
      ],
      text: ' experience proves this'
    }
  ]
}
transcript latency : 118ms

IMPORTANT: latency is very low if grammar sentences are provided!

See details here:

https://github.com/alphacep/vosk-api/blob/master/nodejs/index.js#L198
https://github.com/alphacep/vosk-api/blob/91a128b3edf7e84d55649d8fa9a60664b5386292/src/vosk_api.h#L114
alphacep/vosk-api#500

That's not an issue, just a question/discussion for you/everyone about the proposed architecture.

Preamble about latencies Vosk decoding latencies time are very fast! On my PC, for short (few words) utterances transcripts I got:

Using grammar-based models (e.g. pretrained model model-small-en-us-0.15)
- If I DO NOT specify any grammar I achieve latency of ~500-600 msecs
- If I DO specify a grammar (also pretty long) I achieve few tents of msecs ( << 100 msecs)
Using large / static graph model (e.g. vosk-model-en-us-aspire-0.2), I got ~400-500 msec latency (with a better accuracy for open-domain utterances).

💡 Stateful & low latency ASR. Proposed architecture

Considering a stateful (task-oriented closed-domain) voice-assistant platform, I want to experiment how much can I slow-down latencies, with a stateful ASR. My idea is to connect Vosk ASR with a state-based dialog manager (as my own opensource NaifJs),

Workflow:

Initialization phase:
- to load model that allow grammars (e.g. model model-small-en-us-0.15)
- to prepare/create N different Vosk Recognizers for each grammar(N) (one grammar for for each state(N) )
Run-time (decoding time)
- a "Decode Manager" decides which Recognizer us to be used, depending on the state injected by the dialog manager
- The Decode Manager could use a fallback Recognizer, based on the original model, without a grammar specified for a final decision

See the diagram:

                       state(S-1) -> grammar(S-1)
                      ┌────────────────────────────────────────────────────────────┐
                      │                                                            │
                      │                                                            │
                      │                                                            │
                      │       (1)                                                  │
           ┌──────────▼─────────┐                                                  │
           │                    │                                                  │
           │                    │                                (2)               │
           │                    │   ┌──────────────┐   ┌───────────┐               │
           │                    │   │              │   │           │               │
           │                    │   │ Grammar 1    │   │           │               │
           │                    ◄───┤ Recognizer 1 ◄───┤           │               │
           │                    │   │              │   │           │               │   (3)
           │                    │   │              │   │           │         ┌─────┴─────┐
           │                    │   └──────────────┘   │           │         │           │
           │                    │                      │           │         │           │
           │                    │   ┌──────────────┐   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │ Grammar 2    │   │           │         │           │
           │                    ◄───┤ Recognizer 2 ◄───┤           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   └──────────────┘   │           │         │           │
pcm audio  │       DECODER      │                      │  MODEL    │         │  DIALOG   │
───────────►       MANAGER      │   ┌──────────────┐   │  ALLOWING │         │  MANAGER  ├───────►
           │                    │   │              │   │  GRAMMARS │         │           │
           │                    │   │ Grammar N    │   │           │         │           │
           │                    ◄───┤ Recognizer N ◄───┤           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   └──────────────┘   │           │         │           │
           │                    │                      │           │         │           │
           │                    │                      │           │         │           │
           │                    │                      │           │         │           │
           │                    │   ┌──────────────┐   │           │         │           │
           │                    │   │              │   │           │         │           │
           │                    │   │ No-Grammar   │   │           │         └─────▲─────┘
           │                    ◄───┤ Recognizer 0 ◄───┤           │               │
           │                    │   │              │   │           │               │
           │                    │   │              │   │           │               │
           │ ┌────────────────┐ │   └──────────────┘   └───────────┘               │
           │ │ acceptWaveForm │ │                                                  │
           │ │                │ │                                                  │
           │ └───────┬────────┘ │                                                  │
           │         │          │                                                  │
           │         │          │                                                  │
           └─────────┼──────────┘                                                  │
                     │                                                             │
                     │                                                             │
                     │                                                             │
                     │                                                             │
                     └─────────────────────────────────────────────────────────────┘
                     decode result S

That approach would minimize new Recognizer elapsed, even if I noticed this partial latency is really low (few msecs) when a grammar is specified, whereas it increases to many tents of msecs if a grammar is NOT specified.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grammars.md

grammars.md

Using Vosk grammars

Sentence-based speech-to-text, specifyng a grammar

💡 Stateful & low latency ASR. Proposed architecture

Files

grammars.md

Latest commit

History

grammars.md

File metadata and controls

Using Vosk grammars

Sentence-based speech-to-text, specifyng a grammar

💡 Stateful & low latency ASR. Proposed architecture