Thursday, 24 December 2020

Compare the similarity of 2 sounds using Python Librosa

I have about 30 sound clips that are each a preset from a synthesizer. I want to compare these sounds to find out which ones are similar, and then sort the sounds so that each sound is adjacent in a list to 2 sounds which are similar to it. Frequency is not the only thing I want to look for. I would rather 2 saw waves which are a tone apart be considered similar that a saw wave and a sine wave which are the same note.

These sounds would be considered similar for example

enter image description here


Below is some code I have written in order to separate a song clip into individual sounds. At the end I have an array named transientSamples which holds the ~30 sounds from the sound file. I would like to sort this list by sound similarity. This code works and I do not need help with with. I am wondering how to do the next step, which is sorting the list of sounds. If you answer with code, please explain your code and/or fill it with lots of comments. I am new to audio analysis and have little experience with the concepts described in the librosa documentation but I feel like this page of functions on feature extraction will be useful here

This is the sound file I am using

import librosa
import numpy as np
import os
import soundfile as sf
    
def transients_from_onsets(onset_samples):
    
    """Takes a list of onset times for an audio file and returns the list of start and stop times for that audio file

      Args:
          onset_samples ([int]):

      Returns:
          [(int, int)]: A list of start and stop times for each sound change
      """
       starts = onset_samples[0:-1]
       stops = onset_samples[1:]
       transients = []
       for s in range(len(starts)):  
           transients.append((starts[s], stops[s]))
           return transients

  def transient_samples_from_times(transientTimes, y):
      transientSamples = []
      for (start, stop) in transientTimes:
           transientSamples.append(y[start, stop])
      return transientSamples

  def transients_from_sound_file(fileName, sr=44100):
      """Takes the path to an audio file and returns the list of start and stop times for that audio file as a frame rate
      Args:
        fileName (string): The path to an audio file
        sr (int, optional): The sample rate of the audio file. Defaults to 44100.

      Returns:
          [(int, int)]: A list of start and stop times for each sound change
      """
      
       y, sr = librosa.load(soundFile, sr=sr)
       C = np.abs(librosa.cqt(y=y, sr=sr))
       o_env = librosa.onset.onset_strength(sr=sr, S=librosa.amplitude_to_db(C, ref=np.max))
       onset_frames = librosa.onset.onset_detect(onset_envelope=o_env, sr=sr)

       onset_samples = list(librosa.frames_to_samples(onset_frames))
       onset_samples = np.concatenate(onset_samples, len(y))
       transients = transients_from_onsets(onset_samples)
       transientSamples = transient_samples_from_times(transientTimes)
       return transientTimes, transientSamples

  def main():
       soundFile = "first-four-seconds.wav"
       transientTimes, transientSamples = transients_from_sound_file(soundFile)

I need this explained at a low level of abstraction.

I made spectrograms for all the sounds

S = [np.abs(librosa.stft(y)) for y in transientSamples[0:3]]

This is the spectrogram array. Since there's only 3, I would the sound which sounds most like the other 2 to be in the middle. I don't know how to compare these arrays to each other

[
array([[5.4624897e-01, 2.4768096e-01, 1.4556460e-01, ..., 5.7867712e-01,
        4.5554334e-01, 5.4170746e-01],
       [2.2452552e+00, 1.5994334e+00, 1.1147156e+00, ..., 2.4819484e-01,
        3.9017826e-01, 2.6280621e-01],
       [4.1671777e+00, 3.8175552e+00, 3.0202076e+00, ..., 6.3510722e-01,
        6.0488397e-01, 8.2328826e-01],
       ...,
       [4.5242737e-04, 2.0710655e-04, 2.2738833e-04, ..., 1.3031939e-04,
        2.1871932e-04, 3.3146524e-04],
       [5.5690703e-04, 3.5128437e-04, 3.1929201e-04, ..., 2.3604959e-04,
        4.4018662e-04, 5.8832398e-04],
       [5.9645117e-04, 2.4223674e-04, 3.1842748e-04, ..., 2.6871334e-04,
        2.3753567e-04, 6.7172351e-04]], dtype=float32),
array([[5.28261960e-01, 3.40773761e-01, 3.64203081e-02, ...,
        4.94968325e-01, 3.40660214e-02, 6.10241592e-02],
       [1.47714257e+00, 8.36813867e-01, 6.81992233e-01, ...,
        1.36160457e+00, 7.20932424e-01, 5.77513456e-01],
       [2.13315511e+00, 2.37118530e+00, 2.42295027e+00, ...,
        2.90257120e+00, 1.71677089e+00, 1.24927975e-01],
       ...,
       [2.86732073e-04, 2.23182797e-04, 1.68425191e-04, ...,
        4.66386628e-05, 9.50504182e-05, 1.01365811e-04],
       [2.26811855e-04, 1.57420873e-04, 2.92457495e-04, ...,
        7.80835835e-05, 1.15517389e-04, 9.55912183e-05],
       [3.52372968e-04, 2.86036695e-04, 6.16093748e-05, ...,
        7.29371095e-05, 8.64024987e-05, 1.62979442e-04]], dtype=float32), 
array([[1.2992793e-01, 4.9643394e-02, 5.8306026e-01, ..., 8.2840703e-02,
        2.5342130e-03, 2.2889185e-01],
       [1.6738531e-01, 1.6829827e-01, 5.0865650e-01, ..., 3.2662424e-01,
        3.1669825e-01, 4.0731630e-01],
       [1.2706667e-01, 1.0827064e-01, 4.0436751e-01, ..., 1.7236206e-01,
        3.3250493e-01, 5.8282650e-01],
       ...,
       [3.9454570e-04, 2.7970120e-04, 6.7032655e-05, ..., 5.1820884e-04,
        4.5072302e-04, 2.3226207e-04],
       [7.6335639e-04, 4.5884529e-04, 1.4658972e-04, ..., 5.8171444e-04,
        6.4832438e-04, 2.1206622e-04],
       [8.9842471e-04, 4.6509632e-04, 1.3202771e-04, ..., 7.2093250e-04,
        6.4046000e-04, 2.2259250e-04]], dtype=float32)
]

I also computed rms, mfcc, and centroids using

rms = [librosa.feature.rms(S=s) for s in S]
centroids = [librosa.feature.spectral_centroid(y=y, sr=sr) for y in samples]
mfccs = [librosa.feature.mfcc(y=y, sr=sr) for y in samples]

But again, I don't know how to compare the arrays to eachother



from Compare the similarity of 2 sounds using Python Librosa

No comments:

Post a Comment