Monday, 1 February 2021

Unsupervised clustering of demand into groups of hours

I have the following DataFrame that contains for each hour the corresponding consumption of a product. I want to somehow group those hours based on similar demand but the grouping of the hours must be consecutive in order to make sense. For instance, a meaningful grouping of hours could be 10-12 but not (10-12, 2, 4-5).

1970-01-01 08:00:00     9
1970-01-01 09:00:00    11
1970-01-01 10:00:00    28
1970-01-01 11:00:00    26
1970-01-01 12:00:00    26
1970-01-01 13:00:00    32
1970-01-01 14:00:00    24
1970-01-01 15:00:00    30
1970-01-01 16:00:00    23
1970-01-01 17:00:00    32
1970-01-01 18:00:00    27
1970-01-01 19:00:00    21
1970-01-01 20:00:00    16
1970-01-01 21:00:00    13
1970-01-01 22:00:00     1
1970-01-01 23:00:00     0

import scipy.cluster.hierarchy as hcluster
temp_data = df.values

ndata = [[td, td] for td in temp_data]
data = np.array(ndata)

# clustering
thresh = (15.0 / 100.0) * (
            max(temp_data) - min(temp_data))  # Threshold 15% of the total range of data

clusters = hcluster.fclusterdata(data, thresh, criterion="distance")

total_clusters = max(clusters)

clustered_index = []
for i in range(total_clusters):
    clustered_index.append([])

for i in range(len(clusters)):
    clustered_index[clusters[i] - 1].append(i)

clustered_range = []
for x in clustered_index:
    clustered_index_x = [temp_data[y] for y in x]
    clustered_range.append((min(clustered_index_x), max(clustered_index_x)))
print(clustered_range)

The code above (as well as all unsupervised clustering algos) produces some ranges of cluster values BUT it is not aware that the hours must be consecutive; it simply clusters the values. Any idea on how to tackle this constraint and enforce consecutive groups of hours at the same time?



from Unsupervised clustering of demand into groups of hours

No comments:

Post a Comment