Monday 19 October 2020

Audio signal split at word level boundary

I am working with audio file using webrtcvad and pydub. The split of any fragment is by silence of the sentence. Is there any way by which the split can be done at word level boundry condition? (after each spoken word)? If librosa/ffmpeg/pydub has any feature like this, can split is possible at each vocal? but after split, I need start and end time of the vocal exactly what that vocal part has positioned in the original file. One simple solution or way to split by ffmpeg is also defined by :

https://gist.github.com/vadimkantorov/00bf4fbe4323360722e3d2220cc2915e

but this is also splitting by silence, and with each padding number or the frame size, the split is different. I am trying split by vocal. As example, I have done this manually the original file, split words and its time position in json is in a folder provided here under the link:

www.mediafire.com/file/u4ojdjezmw4vocb/attached_problem.tar.gz



from Audio signal split at word level boundary

No comments:

Post a Comment