Very randomly and personally I appear to have experimented with that few months ...

Very randomly and personally I appear to have experimented with that few months ago, based on a Japanese advent calendar project[1] - the code is all over the place and only works with Japanese speeches, but the gist is as follows. Also in [2].

The trick is to NOT wait for the LLM to finish talking, but:

  1 at end of user VAD, call LLM, stream response into a buffer(simple enough)   
  2 chunk the response at [commas, periods, newlines], and queue sentence-oid texts  
  3 pipe queued sentence-oid fragments into a fast classical TTS and queue audio snippets   
  4 play queued sentence-oid-audio-snippets, maintaining correspondence of consumed audio and text 
  5 at user VAD, stop and clear everything everywhere, undoing queued unplayed voice, nuking unplayed text from chat log 
  6 at end of user VAD, feed the amended transcripts that are canonical to user's ears to step 1
  7 (make sure to parallelize it all)

This flow (hypothetically)allow such interactions as:

  user: "what's the date today"
  sys:  "[today][is thursday], [decem"
  user: "sorry yesterday"
  sys:  "[...uh,][wednesday?][Usually?]"

  1: https://developers.cyberagent.co.jp/blog/archives/44592/
  2: https://gist.github.com/numpad0/18ae612675688eeccd3af5eabcfdf686