Clustering in R


I need to cluster this database and then make the prediction .... I would like to know how I could do the substitution correctly in this case?

Which type of clustering would fit best?

I am a beginner in the data area and I am trying to solve this problem, as I believe it will be a great challenge for me to learn.

To reinforce: I would like to turn the data into numbers so that I could read it through kmeans, for example .... But I accept suggestions.

asked by anonymous 19.10.2017 / 14:25

1 answer


Only with what you described in the question is it difficult to give you a timely response. I suggest that next time, or if this answer is not satisfactory, explain a bit what the database describes.

The first thing is to do a preprocessing job. This will depend on the type of algorithm you want to implement. But if it is an algorithm like K-Means, identifying the outliers and making some kind of imputation is almost essential. After all, it's based on the average.

The k-means algorithm is one of the 10 most used algorithms in the data mining scope ( link ), and it was invented some time ago. Knowing this, I think it would be a good start to work with this algorithm, but grouping time series. Time series are data that has parameters as a function of time. In your case, one idea is what you can do is try to group weeks with behaviors similar to each other. This you can do with different types of attributes. You can see which weeks have similar behavior for order_status , price , etc.

Another good algorithm for those starting out is DBSCAN, which is a density algorithm (k-means is prototype-base , I do not know how to translate). It is very simple and you do not even have to worry about the outliers, as they are very likely to be discarded. But, I leave the work of seeing where to implement with you, who has a better sense of where this database came from.

Good luck.

23.10.2017 / 02:23