This is current problem I am working on. I don’t know how to explain it properly but I’ll try. There is single linear time series of data collected in which there are multiple increasing sequences are present. For example consider this series of numbers (y) collected at time (x) generated with three linear equations.
x,y 2,14 2,12 3,37 4,22 4,14 5,26 5,15 6,73 6,30 7,34 8,97 8,38 8,18 9,109 9,19 10,121 10,20 11,133 11,50 11,21 12,145 12,54 13,23 14,169 14,62 15,181 16,26 18,217 18,78 19,29
When we plot this data on a chart we can see that there are three sequences in them.
The problem is to isolate these three clusters. Since I have no idea how to do this, I was first going for a k-means clustering algorithm with 3 clusters. which gave me this,
This is clearly wrong since we have a series which is forward moving both on x axis and y axis so we cannot have the blue cluster possibly occur linearly. This is when I though might be a graph based clustering algorithm might help. I can put all my rules in making the graph where only linearly possible clusters are connected and then just partition the graph. If it is too dense then I might be able to run some community detection algorithm to get the clusters out of it.
As an initial experiment, I made a graph between all these points (nodes) where distance is the euclidean distance between them. Then I applied the rules where for two nodes a, b (points) a link can exist from a to b only if
- b(x) > a(x),
- b(y) > a(y) and
- b(x) – a(x) is not more than 5
The resulting graph looks like this,
This seems good progress since I seem to have 2 connected components (ignoring the lone node) where one of them is a clear linear sequence. Then when I ran a random walk on the graph, I get three clusters,
we seem to be able to cluster linear sequences out of the data, where except for when these linear sequences are really close. This looks very promising for the stuff I am working on! Will see how this works with real data and post the update.
ps. I would really like to know if there is already a method which can extract multiple linear sequences out of a data similar to what I am trying here. Please mention in the comments if you think anything is relevant.