This is current problem I am working on. I don’t know how to explain it properly but I’ll try. There is single linear time series of data collected in which there are multiple increasing sequences are present. For example consider this series of numbers (y) collected at time (x) generated with three linear equations.
x,y 2,14 2,12 3,37 4,22 4,14 5,26 5,15 6,73 6,30 7,34 8,97 8,38 8,18 9,109 9,19 10,121 10,20 11,133 11,50 11,21 12,145 12,54 13,23 14,169 14,62 15,181 16,26 18,217 18,78 19,29
When we plot this data on a chart we can see that there are three sequences in them.
The problem is to isolate these three clusters. Since I have no idea how to do this, I was first going for a k-means clustering algorithm with 3 clusters. which gave me this,
This is clearly wrong since we have a series which is forward moving both on x axis and y axis so we cannot have the blue cluster possibly occur linearly. This is when I though might be a graph based clustering algorithm might help. I can put all my rules in making the graph where only linearly possible clusters are connected and then just partition the graph. If it is too dense then I might be able to run some community detection algorithm to get the clusters out of it.
As an initial experiment, I made a graph between all these points (nodes) where distance is the euclidean distance between them. Then I applied the rules where for two nodes a, b (points) a link can exist from a to b only if
- b(x) > a(x),
- b(y) > a(y) and
- b(x) – a(x) is not more than 5
The resulting graph looks like this,
This seems good progress since I seem to have 2 connected components (ignoring the lone node) where one of them is a clear linear sequence. Then when I ran a random walk on the graph, I get three clusters,
we seem to be able to cluster linear sequences out of the data, where except for when these linear sequences are really close. This looks very promising for the stuff I am working on! Will see how this works with real data and post the update.
ps. I would really like to know if there is already a method which can extract multiple linear sequences out of a data similar to what I am trying here. Please mention in the comments if you think anything is relevant.
2 thoughts on “Extracting multiple linear sequences out of two dimensional data”
What you have actually might be better described as a missing variable that might not be available in your data. I don’t think there is a particular method to define and figure this variable out and it’s not related to clustering. The comparable example is three data sets with three linear equations so i.e.
y = ax
y = bx
y = cx
a < b < c
Here your third variable is just the gradient. so the actual multivariate model is y = zx (where z : a or b or c)
In that case you can use multivariate regression/analysis which you can factor it in. it usually covers only when the variables affect all the data.
The other way would be to find a way to segment the data if in particular it is not a common variable. But that is more like a unknown missing variable which you can only deal with by adding something that tells you there is three set of results and you analyse each set one by one, but I don't think that is what you are looking for.
Good idea to ask maybe a PHD statistician!
Thanks! Will look into applying multivariate regression. the advantage here is that x is time and there is an overall order to all the observations. I have asked people, lets see if a more elegant way comes up.