Abstract
K-Means is a popular clustering algorithm with wideapplications in Computer Vision, Data mining, Data Visu-alization, etc. Clustering is an important step for indexingand searching of documents, images, video, etc. Clusteringlarge numbers of high-dimensional vectors is very compu-tation intensive. In this paper, we present the design andimplementation of theK-Means clustering algorithm on themodern GPU. All steps are performed entirely on the GPUefficiently in our approach. We also present a load balancedmulti-node, multi-GPU implementation which can handleup to 6 million, 128-dimensional vectors. We use efficientmemory layout for all steps to get high performance. TheGPU accelerators are now present on high-end workstationsand low-end laptops. Scalability in the number and dimen-sionality of the vectors, the number of clusters, as well as inthe number of cores available for processing are importantfor usability to different users. Our implementation scaleslinearly or near-linearly with different problem parameters.We achieve up to 2 times increase in speed compared to thebest GPU implementation forK-Means on a single GPU.We obtain a speed up of over 170 on a single Nvidia FermiGPU compared to a standard sequential implementation.We are able to execute one iteration ofK-Means in 136seconds on off-the-shelf GPUs to cluster 6 million vectorsof 128 dimensions into 4K clusters and in 2.5 seconds tocluster 125K vectors of 128 dimensions into 2K clusters.