I have the following DataFrame that contains for each hour the corresponding consumption of a product. I want to somehow group those hours based on similar demand but the grouping of the hours must be consecutive in order to make sense. For instance, a meaningful grouping of hours could be 10-12 but not (10-12, 2, 4-5).
1970-01-01 08:00:00 9
1970-01-01 09:00:00 11
1970-01-01 10:00:00 28
1970-01-01 11:00:00 26
1970-01-01 12:00:00 26
1970-01-01 13:00:00 32
1970-01-01 14:00:00 24
1970-01-01 15:00:00 30
1970-01-01 16:00:00 23
1970-01-01 17:00:00 32
1970-01-01 18:00:00 27
1970-01-01 19:00:00 21
1970-01-01 20:00:00 16
1970-01-01 21:00:00 13
1970-01-01 22:00:00 1
1970-01-01 23:00:00 0
import scipy.cluster.hierarchy as hcluster
temp_data = df.values
ndata = [[td, td] for td in temp_data]
data = np.array(ndata)
# clustering
thresh = (15.0 / 100.0) * (
max(temp_data) - min(temp_data)) # Threshold 15% of the total range of data
clusters = hcluster.fclusterdata(data, thresh, criterion="distance")
total_clusters = max(clusters)
clustered_index = []
for i in range(total_clusters):
clustered_index.append([])
for i in range(len(clusters)):
clustered_index[clusters[i] - 1].append(i)
clustered_range = []
for x in clustered_index:
clustered_index_x = [temp_data[y] for y in x]
clustered_range.append((min(clustered_index_x), max(clustered_index_x)))
print(clustered_range)
The code above (as well as all unsupervised clustering algos) produces some ranges of cluster values BUT it is not aware that the hours must be consecutive; it simply clusters the values. Any idea on how to tackle this constraint and enforce consecutive groups of hours at the same time?
from Unsupervised clustering of demand into groups of hours
No comments:
Post a Comment