BusinessData ManagementAnalyticsStrategyTechnicalAptitive

How To Use Cluster Analysis in R To Answer Business Questions

By September 14, 2016 No Comments
How to Answer Business Questions Using Cluster Analysis in R

Aptitive has partnered with a variety of companies to build out extensive platforms for analyzing their data. The tremendous value in creating a business intelligence environment is to allow users to answer much more complex and interesting questions. In addition to visualizing the question in a dashboard, it makes sense to model the problem using statistical software such as performing cluster analysis in R.

As an example, let’s pretend we belong to Supply Chain group that needs to answer the question: “How do I classify my products to determine the appropriate service level?” In other words, how do I group products together to decide how to stock them in the warehouse?

There are a variety of tools and analyses that make sense, but I’ve decided to use RStudio to create a clustering model.

Load Data Set

First, I created a file that contains 1000 rows and three columns: Product ID, Distinct Customer Count over the last year, and Revenue (in thousands) over the last year. Note that I only used two inputs for the sake of visualization…for a bigger analysis, we could include more relevant variables.

I imported the data to RStudio, loaded the component columns into vectors, and created a matrix for my analysis:

 

 Sample Data

Sample Data

                                                                                                      

For your reference, here is the sample R code:

#pull columns from source .csv into vectors

CustomerCount <- SampleData$CustomerCount

RevenueK <- SampleData$RevenueK

#create a Matrix using the vectors

myMatrix <- matrix(c(CustomerCount, RevenueK), nrow=1000, ncol = 2 )

Perform the Cluster Analysis

Next, I used the hierarchical clustering function in R to map out the possible clusters based on the two components in my matrix. So, at the top of the chart, all products are in the same group. Then the function splits that group “k” number of times based on the commonalities between the two components. Based on the resulting visualization (ie Cluster Dendrogram), I tried to split the 1000 products into an even number of diverse groups. I concluded four clusters makes sense.

 

The y-axis is a measure of “closeness” of individual clusters

The y-axis is a measure of “closeness” of individual clusters

                                                           

#Use technique native in R to create clusters

myclust<-hclust(dist(myMatrix[-1]))

plot(myclust)

#based on breakdown, decide appropriate number of clusters to create

clustcnt <- 4

rect.hclust(myclust, clustcnt )

fit <- kmeans(myMatrix, clustcnt )

Visualize the Clusters

Next, I assigned the products their appropriate category and colored the plot point to reflect the result:

 

Customer Count Vs Revenue colored by cluster

Customer Count Vs Revenue colored by cluster

#bind the cluster assignment to the result set

out <- cbind(myMatrix, ClusterNum = fit$cluster)

colnames(out)[1] = “CustomerCount”

colnames(out)[2] = “RevenueK”

#designate output vectors using the resulting matrix and plot by color

CustomerCountOutput <- out[,’CustomerCount’]

RevenueKOutput <- out[,’RevenueK’]

clustcolor <- out[,’ClusterNum’]

plot(CustomerCountOutput,RevenueKOutput,main=”Product Cluster”, xlab=”Customer Count”, ylab=”Revenue in Ks”, col=ifelse(clustcolor==1,”blue”, ifelse(clustcolor==2,”purple”, ifelse(clustcolor==3,”red”, “green”))))

Analyze the Results

Finally, I am able to create a story about each category and make mindful decisions about how I would manage the stock:

  • Blue (High Revenue, Low Customer variety): I might store these items directly in the customer’s site and maintain high safety stock since they bring in more money.
  • Purple (Mid Revenue, Low/Mid Customer variety): I would perform more analysis to try to push the product into the “High Revenue” category.
  • Red (Low Revenue, Low Customer variety): We might not bother to even stock these items and create a JIT system for ordering.
  • Green (Low Revenue, High/Mid Customer variety): We might try to minimize these product lines and carefully watch our costs to ensure we are getting good return.

Also note, that if the firm added a new product line, it would only be a matter of rerunning the model to derive new classifications. No more long, painful projects to decide how to answer that same question over again.

Conclusion

By performing more advanced analyses, companies can make better, data-driven decisions. What questions could this type of analysis answer is your business?

  • Who are my organization’s best members?
  • Who is most likely to buy my services?
  • What types of products are most profitable?
  • …we can go on forever.

Beyond clustering, there is tremendous potential to use data for predictive forecasting, regression modeling, principle component analyses, and much more! Of course, the consultants at Aptitive love to work with data. We partner with companies to answer the questions that save clients money, more effectively run their business, and give them a competitive edge over the competition. Please reach out if you would like to discuss more.

This post was originally posted on Medium