I am doing normalization for datasets but the data contains a lot of 0 because of padding.
I can mask them during model training but apparently, these zero will be affected when I applied normalization.
from sklearn.preprocessing import StandardScaler,MinMaxScaler
I am currently using the sklearn library to do the normalization
For example, given a 3D array with dimension (4,3,5) as (batch, step, features)
The number of zero-padding varied from batch to batch as this is the features I extracted from audio files, that have varied length, using a fixed window size.
[[[0 0 0 0 0],
[0 0 0 0 0],
[0 0 0 0 0]]
[[1 2 3 4 5],
[4 5 6 7 8],
[9 10 11 12 13]],
[[14 15 16 17 18],
[0 0 0 0 0],
[24 25 26 27 28]],
[[0 0 0 0 0],
[423 2 230 60 70],
[0 0 0 0 0]]
]
I wish to perform normalization by column so
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train.reshape(-1,X_train.shape[-1])).reshape(X_train.shape)
X_test = scaler.transform(X_test.reshape(-1,X_test.shape[-1])).reshape(X_test.shape)
However, in this case, zeros are treated as effective value. For example, the minimum value of the first column should be 1 instead of 0.
Further, the 0's values are also changed after applying the scalers but I wish to keep them as 0's so I can mask them during training. model.add(tf.keras.layers.Masking(mask_value=0.0, input_shape=(X_train.shape[1], X_train.shape[2])))
Is there any way to mask them during normalization so only the 2nd step and 3rd step in this example are used in normalization?
In addition, The actual dimension of the array for my project is bigger as (2000,50,68) among the 68 features, the difference in values of the 68 features can be very large. I tried to normalize them by dividing each element by the biggest element in their row to avoid the impact from 0's but this did not work out well.
from mask 0 values during normalization
No comments:
Post a Comment