@inproceedings{bib_Shar_2024, AUTHOR = {Subhajit Sahu, Kishore Kothapalli, Dip Sankar Banerjee}, TITLE = {Shared-Memory Parallel Algorithms for Community Detection in Dynamic Graphs}, BOOKTITLE = {International Parallel & Distributed Processing Symposium Workshops}. YEAR = {2024}}
Community detection is the problem of identifying natural divisions in networks. A relevant challenge in this problem is to find communities on rapidly evolving graphs. In this paper, we present our parallel Dynamic Frontier (DF) approach. Given a batch update of edge deletions or insertions, this approach incrementally identifies an approximate set of affected vertices in the graph with minimal overhead. We apply this approach to both Louvain, a high quality, and Label Propagation Algorithm (LPA), a fast static community detection algorithm. Our approach achieves a mean speedup of 7.3x and 6.7x, when applied to Louvain and LPA respectively, compared to our parallel and optimized implementation of ∆-screening, a recently proposed state-of-the-art approach. Finally, we show how to combine Louvain and LPA with the DF approach to arrive at a hybrid algorithm. This algorithm produces highquality communities while providing a speedup of 2.0x on top of DF-based Louvain.
@inproceedings{bib_Shar_2024, AUTHOR = {Subhajit Sahu, Kishore Kothapalli, Dip Sankar Banerjee}, TITLE = {Shared-Memory Parallel Dynamic Louvain Algorithm for Community Detection}, BOOKTITLE = {International Parallel and Distributed Processing Symposium}. YEAR = {2024}}
Community detection refers to the identification of coherent partitions in networks. In this poster, we present a parallel dynamic Louvain algorithm that finds communities in rapidly evolving graphs. Given a batch update of edge deletions or insertions, our algorithm identifies an approximate set of affected vertices in the graph with minimal overhead and updates the community membership of each vertex. This process repeats until convergence. Our approach achieves a mean speedup of 7.3x, compared to our parallel and optimized implementation of Delta-screening combined with Louvain, a recently proposed stateof-the-art approach.
@inproceedings{bib_Lock_2024, AUTHOR = {Subhajit Sahu, Kishore Kothapalli, Hemalatha Eedi, Sathya Peri}, TITLE = {Lock-free Computation of PageRank in Dynamic Graphs}, BOOKTITLE = {International Parallel & Distributed Processing Symposium Workshops}. YEAR = {2024}}
PageRank is a metric that assigns importance to the vertices of a graph based on its neighbors and their scores. Recently, there has been increasing interest in computing PageRank on dynamic graphs, where the graph structure evolves due to edge insertions and deletions. However, traditional barrier-based approaches for updating PageRanks encounter significant wait times on certain graph structures, leading to high overall runtimes. Additionally, the growing trend of multicore architectures with increased core counts has raised concerns about random thread delays and failures. In this study, we propose a lock-free algorithm for updating PageRank scores on dynamic graphs. First, we introduce our Dynamic Frontier (DF) approach, which identifies and processes vertices likely to change PageRanks with minimal overhead. Subsequently, we integrate DF with our lock-free and fault-tolerant PageRank (Alg. DF_LF), incorporating a helping mechanism among threads between its two phases. Experimental results demonstrate that Alg. DF_LF not only eliminates waiting times at iteration barriers but also withstands random thread delays and crashes. On average, it is 4.6x faster than lock-free Naive-dynamic PageRank (Alg. ND_LF).
@inproceedings{bib_Fast_2024, AUTHOR = {Subhajit Sahu, Kishore Kothapalli, Dip Sankar Banerjee}, TITLE = {Fast Leiden Algorithm for Community Detection in Shared Memory Setting}, BOOKTITLE = {International Conference on Parallel Processing}. YEAR = {2024}}
Community detection is the problem of identifying natural divisions in networks. Efficient parallel algorithms for identifying such divisions is critical in a number of applications, where the size of datasets have reached significant scales. This paper presents one of the most efficient implementations of the Leiden algorithm, a high quality community detection method. On a server equipped with dual 16-core Intel Xeon Gold 6226R processors, our Leiden implementation, which we term as GVE-Leiden, outperforms NetworKit Leiden and cuGraph Leiden (running on NVIDIA A100 GPU) by 8.2× and 3.0× respectively — achieving a processing rate of 403M edges/s on a 3.8B edge graph. In addition, GVE-Leiden improves performance at a rate of 1.6× for every doubling of threads.
DF* PageRank: Incrementally Expanding Approaches for Updating PageRank on Dynamic Graphs
@inproceedings{bib_DF*__2024, AUTHOR = {Subhajit Sahu, Kishore Kothapalli, Hemalatha Eedi, Sathya Peri}, TITLE = {DF* PageRank: Incrementally Expanding Approaches for Updating PageRank on Dynamic Graphs}, BOOKTITLE = {International European Conference on Parallel and Distributed Computing (was International Conferenc}. YEAR = {2024}}
PageRank is a widely used centrality measure that assesses the significance of vertices in a graph. Efficiently updating PageRank on dynamic graphs is essential for various applications due to the increasing scale of datasets. This paper introduces our Dynamic Frontier (DF) and Dynamic Frontier with Pruning (DF-P) approaches. Given a batch update comprising edge insertions and deletions, these approaches iteratively identify vertices likely to change their ranks with minimal overhead. On a server featuring a 64-core AMD EPYC-7742 processor, our approaches outperform Static and Dynamic Traversal PageRank by and respectively - on real-world dynamic graphs, and by and on large static graphs with random batch updates. Our approaches scale at a rate of for every doubling of threads.
High-Performance Implementation of Louvain Algorithm with Representational Optimizations
@inproceedings{bib_High_2024, AUTHOR = {Subhajit Sahu, Kishore Kothapalli, Dip Sankar Banerjee}, TITLE = {High-Performance Implementation of Louvain Algorithm with Representational Optimizations}, BOOKTITLE = {Complex Networs and their Applications}. YEAR = {2024}}
Community detection is the problem of identifying natural divisions in networks. Efficient parallel algorithms for identifying such divisions is critical in a number of applications, where the size of datasets have reached significant scales. This paper presents one of the most efficient multicore implementations of the Louvain algorithm, a high quality community detection method. On a server equipped with dual 16-core Intel Xeon Gold 6226R processors, our Louvain, which we term as GVE-Louvain, outperforms Vite, Grappolo, NetworKit Louvain, and cuGraph Louvain (running on NVIDIA A100 GPU) by 50x, 22x, 20x, and 5.8x faster respectively - achieving a processing rate of 560M edges/s on a 3.8B edge graph. In addition, GVE-Louvain improves performance at an average rate of 1.6x for every doubling of threads.
@inproceedings{bib_Evol_2024, AUTHOR = {Karan Nijhawan, S Rajendraprasad, Subhajit Sahu, Kishore Kothapalli}, TITLE = {EvolvGraph: A Tool for Property-Constrained Generation of Dynamic Graphs}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2024}}
Graphs, including, for example, social networks, collaboration networks, and epidemiological networks, offer a way to represent relationships among entities. Graphs arising from most real-world phenomena are not static and evolve. Dealing with such dynamic graphs requires special algorithms, known as dynamic graph algorithms, that update the required graph analytic based on the change in the current graph instead of recomputing the analytic from scratch. Parallel dynamic graph algorithms are now known for various graph problems such as connectivity, spanning trees, shortest paths, centrality metrics, and community detection. Dynamic algorithms often work in settings where a batch of edge insertions/deletions affects the current graph.
A Fast Parallel Approach for Neighborhood‐Based Link Prediction by Disregarding Large Hubs
@inproceedings{bib_A_Fa_2024, AUTHOR = {Subhajit Sahu, Kishore Kothapalli}, TITLE = {A Fast Parallel Approach for Neighborhood‐Based Link Prediction by Disregarding Large Hubs}, BOOKTITLE = {Concurrency and Computation: Practice and Experience}. YEAR = {2024}}
Link prediction can help rectify inaccuracies in various graph algorithms, stemming from unaccounted-for or overlooked links within networks. However, many existing works use a baseline approach, which incurs unnecessary computational costs due to its high time complexity. Further, many studies focus on smaller graphs, which can lead to misleading conclusions. Here, we study the prediction of links using neighborhood-based similarity measures on large graphs. In particular, we improve upon the baseline approach (IBase), and propose a heuristic approach that additionally disregards large hubs (DLH), based on the idea that high-degree nodes contribute little similarity among their neighbors. On a server equipped with dual 16-core Intel Xeon Gold 6226R processors, DLH is on average 1019x faster than IBase, especially on web graphs and social networks, while maintaining similar prediction accuracy. Notably, DLH achieves a link prediction rate of 38.1M edges/s and improves performance by 1.6x for every doubling of threads.
Accelerating Computer Vision Tasks on GPUs using Ramanujan Graph Product Framework
@inproceedings{bib_Acce_2023, AUTHOR = {Shaik Furqan Ahmed, Konduru Thejasvi, Girish Varma, Kishore Kothapalli}, TITLE = {Accelerating Computer Vision Tasks on GPUs using Ramanujan Graph Product Framework}, BOOKTITLE = {Joint International Conference on Data Science & Management of Data}. YEAR = {2023}}
Sparse neural networks have been proven to generate efficient and better runtimes when compared to dense neural networks. Accelera- tion in runtime is better achieved with structured sparsity. However, generating an efficient sparsity structure to maintain both runtime and accuracy is a challenging task. In this paper, we implement the RBGP4 sparsity pattern derived from the Ramanujan Bipartite Graph Product (RBGP) framework on various Computer Vision tasks and test how well it performs w.r.t accuracy and runtime. Us- ing this approach, we generate structured sparse neural networks which has multiple levels of block sparsity that generates good connectivity due to the presence of Ramanujan bipartite graphs. We benchmark our approach on Semantic Segmentation and Pose Estimation tasks on an edge device (Jetson Nano 2GB) as well as server (V100) GPUs. We compare the results obtained for RBGP4 sparsity pattern with the unstructured and block sparsity patterns. When compared to sparsity patterns like unstructured and block, we obtained significant speedups while maintaining accuracy. KEYWORDS Sparse Neural Networks, Ramanujan Graphs, Semantic Segmenta- tion, Pose Detection ACM Reference Format: Thejasvi Konduru, Furqan Ahmed Shaik, Girish Varma, and Kishore Kotha- palli. 2023. Accelerating Computer Vision Tasks on GPUs using Ramanujan Graph Product Framework. In 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD) (CODS-COMAD 2023), January 4–7, 2023, Mumbai, India. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3570991.3571044
@inproceedings{bib_Rama_2023, AUTHOR = {V DHARMA TEJA, Girish Varma, Kishore Kothapalli}, TITLE = {Ramanujan bipartite graph products for efficient block sparse neural networks}, BOOKTITLE = {Concurrency and Computation: Practice and Experience}. YEAR = {2023}}
Sparse neural networks are shown to give accurate predictions competitive to denser versions, while also minimizing the number of arithmetic operations performed. However current GPU hardware can only exploit structured sparsity patterns for better efficiency. We propose a framework for generating structured multilevel block sparse neural networks by using the theory of graph products. Our Ramanujan bipartite graph product (RBGP) framework uses products of Ramanujan graphs to obtain the best connectivity for a given level of sparsity. This essentially ensures that the i.) the networks has the structured block sparsity for which runtime efficient algorithms exists, ii.) the model gives high prediction accuracy, due to the better expressive power derived from the connectivity of the graph, iii.) the graph data structure has a succinct representation that can be stored efficiently in memory. We use our framework to design a specific connectivity pattern called RBGP4 which makes efficient use of the memory hierarchy available on GPU. We benchmark our approach on image classification and machine translation tasks with an edge (Jetson Nano 2GB) as well as server (V100) GPUs. When compared with commonly used sparsity patterns like unstructured and block, we obtain significant speedups while achieving the same level of accuracy.
Efficient parallel algorithms for dynamic closeness‐and betweenness centrality
Sai Charan Regunt,Sai Harsh Tondomker,Kshitij Shukla,Kishore Kothapalli
Concurrency and Computation: Practice and Experience, CCPE, 2023
Abs | | bib Tex
@inproceedings{bib_Effi_2023, AUTHOR = {Sai Charan Regunt, Sai Harsh Tondomker, Kshitij Shukla, Kishore Kothapalli}, TITLE = {Efficient parallel algorithms for dynamic closeness‐and betweenness centrality}, BOOKTITLE = {Concurrency and Computation: Practice and Experience}. YEAR = {2023}}
Finding the centrality measures of nodes in a graph is a problem of fundamental importance due to various applications from social networks, biological networks, and transportation networks. Given the large size of such graphs, it is natural to use parallelism as a recourse. Several studies show how to compute the various centrality measures of nodes in a graph on parallel architectures, including multi‐core systems and GPUs. However, as these graphs evolve and change, it is pertinent to study how to update the centrality measures on changes to the underlying graph. In this article, we show novel parallel algorithms for updating the betweenness‐ and closeness‐centrality values of nodes in a dynamic graph. Our algorithms process a batch of updates in parallel by extending the approach of handling a single update for betweenness‐ and closeness‐centrality. For the latter, we also introduce techniques based on
BlockVac: A Universally Acceptable and Ideal Vaccination System on Blockchain
Manika Sharma,Kishore Kothapalli,Sujit P Gujar
International Conference on Blockchain, Blockchain, 2022
@inproceedings{bib_Bloc_2022, AUTHOR = {Manika Sharma, Kishore Kothapalli, Sujit P Gujar}, TITLE = {BlockVac: A Universally Acceptable and Ideal Vaccination System on Blockchain}, BOOKTITLE = {International Conference on Blockchain}. YEAR = {2022}}
2022 IEEE International Conference on Blockchain (Blockchain) BLOC KVAC: A Un iversally Acceptable and Ideal Vaccination System on Blockchain Manik a Sharma Center for Securit y, Theory, and Algorithmic Research Intern ationa l Institute of Inform ation Technology, Hyderabad Gach ibowli, Hyderabad, India 500 032 manik a.sharm a @research.iiit.ac.in Kishore Kothap alli Cent er for Securit y, Theo ry, and Algorithmic Research Interna tional Institute of Information Technology, Hyderabad Gachibowli, Hyderabad , Indi a 500 032 kisho re @iiit.ac .in Sujit Gujar Machin e Learning Lab International Institute of Information Technology, Hyderabad Gachib owli, Hyderabad , India 500 032 suj it.gujar @iiit.ac.in Abstract-Vaccines are essential in pro tecting indivi dua ls and commun ities from the risk of ha rmful an d fatal diseases. The wor ld is migra ting towards the digitization of vacci nation docu- mentation. Still, the current digita l vacci nation is a centra lized mode l, lacking a nnive rsal acceptance. A Universally Acceptable an d Id eal Vacci nat ion System (UAIVS ) sho nld possess (a) a sta ndard vaccination cert ificate ver ificat ion algorithm, (h) p r i- vacy to per sonal in form a tion, (c) sca lab ilit y, (d) s npport lor all vacci nat ions, (e) acco u ntab ility for vaccination information access, and (f) a sco pe for scientific researc h a nd d evelopm en t. Blockc ha in tec hno logy int ro dnces d ecentr alizat ion, immn tability, acco unta bility, per sistence, and liveliness. T hese pro per ties ma ke block cha in an ideal can di da te to solve the existing pro blems in digital vacci nat ion. In this pa per, we prop ose a novel blockchain- base d syste m, n am ely BLO CKVAC, to implem ent a UA IVS . F urt her more , o ur wor k provi des a way to track an individ nal's an d fa mily's vacc ination histo ry, which is esse ntial for sc ient ific research a nd developm ent . BLOCKVAC is highly modular, sca l- a ble, an d cost -efficien t, pr oviding eno ugh roo m for fnt n re feature addi tio ns. The c urre nt works lack cost and sca lab ility ana lysis. WI' show thc scalability of o ur system t hro ugh a th oron gh a nalys is. O ur impl em entation shows t hat add ing one vacc inati on re cor d costs 7 . l(j ~2 ~ x 10- 6 ETH ( ~ U.U26US D) for reg istratio n, 2.5089 x 10- 5 ETH ( ~ 0.07 8USD) for vaccina tion, a nd th e syste m ca n easi ly han dl e np to 20K vaccinations per hon r. Ind ex Terms- B1ockc ha in, Eth ere um, Sm art Contrac ts, Public Acco unta bility, Sca lab ilit y I. I N TR O D UC TI O N The vaccination process, also known as imm unization , was introduced in the late 18th century by the British physician Edward Jenn er against the deadly smallpox virus [ I]. Since then. it has become an inevitab le medical proc edur e, protecting individuals and communities from harmful and fatal diseases. The World Hea lth Organi zation (WHO) estimates that vac- cination saves around 2.5 millio n child lives every year [2]. A record of vacci nation details massively aid s in providing qualit y health care to individuals [3]. Whi le secure storage and verificat ion of vaccination details are challenges even within a country, it is tediou s to merge them from across all the countries [4 ]. Moreover. a definitive process is the need of the hour to verify the required vaccinati on certification for travelers when they enter a country. We wou ld like to than k MeitY and Ripp le for fundin g this research . O ver view of existing vac cina tion mod els: Traditional paper- based vaccination records [5] are prone to challenges such as the possibil ity of losing, damaging, or forging the records without any accountability. Digital solutions replace the need for paper, provide an immutable card, and reduc e manual labor [6]. However, the current vacci nation systems follow a centralized model, where the users need to trust the cent ral autho rity. Further. these sys tems are prone to single-point failures and jeopardize data security P 1. Moreover, current vaccination portal s mainly deal with just slot bookings and issuin g non -verifiab le vaccination certificates. Mo tiva tion: Th e Covid-19 pandemic has caused a digital revolution in vaccination data managem ent. Countri es are shifting towards a digital solution to secure vaccination data ami issue certificates. However, such a system should be universally acceptabl e. We identi fy the properties requ ired for such a sys tem and call it a Universally Acceptable and Ideal Vaccination System (UA IVS). We furt her categorize the proper- ties of UAIVS into two subcategories: Universally Acceptable Vaccination System and Ideal Va ccination System. A Universally Acce ptable Vaccination System co
Shared-Memory Parallel Algorithms for Fully Dynamic Maintenance of 2-Connected Components
Chirayu Anant Haryan, G. Ramakrishna,Kishore Kothapalli,Dip Sankar Banerjee
International Parallel and Distributed Processing Symposium, IPDPS, 2022
@inproceedings{bib_Shar_2022, AUTHOR = {Chirayu Anant Haryan, G. Ramakrishna, Kishore Kothapalli, Dip Sankar Banerjee}, TITLE = {Shared-Memory Parallel Algorithms for Fully Dynamic Maintenance of 2-Connected Components}, BOOKTITLE = {International Parallel and Distributed Processing Symposium}. YEAR = {2022}}
Finding the biconnected components of a graph has a large number of applications in many other graph problems including planarity testing, computing the centrality metrics, finding the (weighted) vertex cover, coloring, and the like. Recent years saw the design of efficient algorithms for this problem across sequential and parallel computational models. However, current algorithms do not work in the setting where the underlying graph changes over time in a dynamic manner via the insertion or deletion of edges. Dynamic algorithms in the sequential setting that obtain the biconnected components of a graph upon insertion or deletion of a single edge are known from over two decades ago. Parallel algorithms for this problem are not heavily studied. In this paper, we design shared-memory parallel algorithms that obtain the biconnected components of a graph subsequent to the insertion or deletion of a batch of edges
Dynamic Batch Parallel Algorithms for Updating PageRank
Subhajit Sahu,Kishore Kothapalli,Dip Sankar Banerjee
International Parallel & Distributed Processing Symposium Workshops, IPDPS-W, 2022
@inproceedings{bib_Dyna_2022, AUTHOR = {Subhajit Sahu, Kishore Kothapalli, Dip Sankar Banerjee}, TITLE = {Dynamic Batch Parallel Algorithms for Updating PageRank}, BOOKTITLE = {International Parallel & Distributed Processing Symposium Workshops}. YEAR = {2022}}
The design and implementation of parallel algorithms for dynamic graph problems is attracting significant research attention in the recent years, driven by numerous applications to social network analysis, neuroscience, and protein interaction networks. One such problem is the computation of PageRank values of vertices in a directed graph. This paper presents two new parallel algorithms for recomputing the PageRank values of vertices in a dynamic graph. Our techniques require the recomputation of the PageRank of only the vertices affected by the insertion/deletion of a batch of edges. We conduct detailed experimental studies of our algorithm on a set of 11 realworld graphs. Our results on Intel Xeon Silver 4116 CPU and NVIDIA Tesla V100 PCIe 16GB GPU indicate that our algorithms outperform static and dynamic update algorithms by 6.1×and 8.6×on the CPU, and by 9.8×and 9.3×on the GPU respectively. We also compare the performance of the algorithms in batched mode to cumulative single-edge updates.
Efficient Parallel Algorithms for Computing Percolation Centrality
C Athreya,Sayantan Jana,Kishore Kothapalli
International Conference on High Performance Computing, HiPC, 2021
@inproceedings{bib_Effi_2021, AUTHOR = {C Athreya, Sayantan Jana, Kishore Kothapalli}, TITLE = {Efficient Parallel Algorithms for Computing Percolation Centrality}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2021}}
Centrality measures on graphs have found appli- cations in a large number of domains including modeling the spread of an infection/disease, social network analysis, and transportation networks. As a result, parallel algorithms for computing various centrality metrics on graphs are gaining significant research attention in recent years. In this paper, we study parallel algorithms for the percolation centrality measure which extends the betweenness-centrality measure by incorporating a time dependent state variable with every node. We present parallel algorithms that compute the source-based and source-destination variants of the percolation centrality values of nodes in a network. Our algorithms extend the algorithm of Brandes, introduce optimizations aimed at exploiting the structural properties of graphs, and extend the algorithmic techniques introduced by Sariyuce et al. [26] in the context of centrality computation. Experimental studies of our algorithms on an Intel Xeon(R) Silver 4116 CPU and an Nvidia Tesla V100 GPU on a collection of 12 real-world graphs indicate that our algorithmic techniques offer a significant speedup.
Efficient Distributed Algorithms in the k-machine model via PRAM Simulations
John Augustine,Kishore Kothapalli,Gopal Pandurangan
International Parallel and Distributed Processing Symposium, IPDPS, 2021
@inproceedings{bib_Effi_2021, AUTHOR = {John Augustine, Kishore Kothapalli, Gopal Pandurangan}, TITLE = {Efficient Distributed Algorithms in the k-machine model via PRAM Simulations }, BOOKTITLE = {International Parallel and Distributed Processing Symposium}. YEAR = {2021}}
We study several fundamental problems in the kmachine model, a message-passing model for large-scale distributed computations where k 2 machines jointly perform computations on a large input of size N, (typically, N k). The input is initially partitioned (randomly or in a balanced fashion) among the k machines, a common implementation in many real-world systems. Communication is point-to-point, and the goal is to minimize the number of communication rounds of the computation. Our main result is a general technique for designing efficient deterministic distributed algorithms in the k-machine model using PRAM algorithms. Our technique works by efficiently simulating PRAM algorithms in the k-machine model in a deterministic way. This simulation allows us to arrive at new algorithms in the k-machine model for some problems for which no efficient k-machine algorithms are known before and also improve on existing results in the k-machine model for some problems. While our simulation allows us to obtain k-machine algorithms for any problem with a known PRAM algorithm, we mainly focus on graph problems. For an input graph on n vertices and m edges, we obtain O˜(m/k2) round4 algorithms for various graph problems such as r–connectivity for r = 1, 2, 3, 4, minimum spanning tree (MST), maximal independent set (MIS), (Δ+1)-coloring, maximal matching, ear decomposition, and spanners under the assumption that the edges of the input graph are partitioned (randomly, or in an arbitrary, but balanced, fashion) among the k machines. For problems such as connectivity and MST, the above bound is (essentially) the best possible (up to logarithmic factors). Our simulation technique allows us to obtain the first known efficient deterministic algorithms in the k-machine model for other problems with known deterministic PRAM algorithms.
Effect of stepwise adjustment of Damping factor upon PageRank
Subhajit Sahu,Kishore Kothapalli,Dip Sankar Banerjee
Technical Report, arXiv, 2021
@inproceedings{bib_Effe_2021, AUTHOR = {Subhajit Sahu, Kishore Kothapalli, Dip Sankar Banerjee}, TITLE = {Effect of stepwise adjustment of Damping factor upon PageRank}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
The effect of adjusting damping factor α, from a small initial value α0 to the final desired αf value, upon then iterations needed for PageRank computation is observed. Adjustment of the damping factor is done in one or more steps. Results show no improvement in performance over a fixed damping factor based PageRank.
Adjusting PageRank parameters and Comparing results
Subhajit Sahu,Kishore Kothapalli,Dip Sankar Banerjee
Technical Report, arXiv, 2021
@inproceedings{bib_Adju_2021, AUTHOR = {Subhajit Sahu, Kishore Kothapalli, Dip Sankar Banerjee}, TITLE = {Adjusting PageRank parameters and Comparing results}, BOOKTITLE = {Technical Report}. YEAR = {2021}}
— The effect of adjusting damping factor α and tolerance τ on iterations needed for PageRank computation is studied here. Relative performance of PageRank computation with L1, L2, and L∞ norms used as convergence check, are also compared with six possible mean ratios. It is observed that increasing the damping factor α linearly increases the iterations needed almost exponentially. On the other hand, decreasing the tolerance τ exponentially decreases the iterations needed almost exponentially. On average, PageRank with L∞ norm as convergence check is the fastest, quickly followed by L2 norm, and then L1 norm. For large graphs, above certain tolerance τ values, convergence can occur in a single iteration. On the contrary, below certain tolerance τ values, sensitivity issues can begin to appear, causing computation to halt at maximum iteration limit without convergence. The six mean ratios for relative performance comparison are based on arithmetic, geometric, and harmonic mean, as well as the order of ratio calculation. Among them GM-RATIO, geometric mean followed by ratio calculation, is found to be most stable, followed by AM-RATIO.
Efficient Parallel Algorithms for Computing Percolation Centrality
Athreya Chandramouli,Sayantan Jana,Kishore Kothapalli
International Conference on High Performance Computing, HiPC, 2021
@inproceedings{bib_Effi_2021, AUTHOR = {Athreya Chandramouli, Sayantan Jana, Kishore Kothapalli}, TITLE = {Efficient Parallel Algorithms for Computing Percolation Centrality}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2021}}
— Centrality measures on graphs have found applications in a large number of domains including modeling the spread of an infection/disease, social network analysis, and transportation networks. As a result, parallel algorithms for computing various centrality metrics on graphs are gaining significant research attention in recent years. In this paper, we study parallel algorithms for the percolation centrality measure which extends the betweenness-centrality measure by incorporating a time dependent state variable with every node. We present parallel algorithms that compute the source-based and source-destination variants of the percolation centrality values of nodes in a network. Our algorithms extend the algorithm of Brandes, introduce optimizations aimed at exploiting the structural properties of graphs, and extend the algorithmic techniques introduced by Sariyuce et al. [26] in the context of centrality computation. Experimental studies of our algorithms on an Intel Xeon(R) Silver 4116 CPU and an Nvidia Tesla V100 GPU on a collection of 12 real-world graphs indicate that our algorithmic techniques offer a significant speedup.
Message from the General Co-chairs
Kishore Kothapalli
International Conference on High Performance Computing, HiPC, 2021
@inproceedings{bib_Mess_2021, AUTHOR = {Kishore Kothapalli}, TITLE = {Message from the General Co-chairs}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2021}}
Welcome to virtual HiPC 2021, the 28th edition of the IEEE International Conference on High Performance Computing, Data, and Analytics. It serves as a forum to present current work by accomplished HPC researchers from around the globe, to explore developing topics in the field, and to highlight related activities in Asia.
Efficient Parallel Algorithms for Betweenness- and Closeness-Centrality in Dynamic Graphs
Shukla Kshitij Prem Sarita,Sai Harsh Tondomker,Sai Charan Regunta,Kishore Kothapalli
ACM International Conference on Supercomputing, ICS, 2020
@inproceedings{bib_Effi_2020, AUTHOR = {Shukla Kshitij Prem Sarita, Sai Harsh Tondomker, Sai Charan Regunta, Kishore Kothapalli}, TITLE = {Efficient Parallel Algorithms for Betweenness- and Closeness-Centrality in Dynamic Graphs}, BOOKTITLE = {ACM International Conference on Supercomputing}. YEAR = {2020}}
Finding the centrality measures of nodes in a graph is a problem of fundamental importance due to various applications from social networks, biological networks, and transportation networks. Given the large size of such graphs, it is natural to use parallelism as a recourse. There have been several studies that show how to compute the various centrality measures of nodes in a graph on parallel architectures, including multi-core systems and GPUs. However, as these graphs evolve and change, it is pertinent to study how to update the centrality measures on changes to the underlying graph. In this paper, we show novel parallel algorithms for updating the betweenness- and closeness-centrality values of nodes in a dynamic graph. Our algorithms process a batch of updates in parallel by extending the approach of handling a single update for etweennessand closeness-centrality by Jamour et al. [16] and Sarıyüce et al. [27], respectively. Besides, our algorithms incorporate mechanisms to exploit the structural properties of graphs for enhanced performance.
Sample-and-Gather: Fast Ruling Set Algorithms in the Low-Memory MPC Model
Kishore Kothapalli,Shreyas Pai,Sriram V. Pemmaraju
@inproceedings{bib_Samp_2020, AUTHOR = {Kishore Kothapalli, Shreyas Pai, Sriram V. Pemmaraju}, TITLE = {Sample-and-Gather: Fast Ruling Set Algorithms in the Low-Memory MPC Model}, BOOKTITLE = {DROPS}. YEAR = {2020}}
Motivated by recent progress on symmetry breaking problems such as maximal independent set (MIS) and maximal matching in the low-memory Massively Parallel Computation (MPC) model (e.g., Behnezhad et al.~PODC 2019; Ghaffari-Uitto SODA 2019), we investigate the complexity of ruling set problems in this model. The MPC model has become very popular as a model for large-scale distributed computing and it comes with the constraint that the memory-per-machine is strongly sublinear in the input size. For graph problems, extremely fast MPC algorithms have been designed assuming Ω~(n) memory-per-machine, where n is the number of nodes in the graph (e.g., the O(loglogn) MIS algorithm of Ghaffari et al., PODC 2018). However, it has proven much more difficult to design fast MPC algorithms for graph problems in the low-memory MPC model, where the memory-per-machine is restricted to being strongly sublinear in the number of nodes, i.e., $O(n^eps)$ for $0 < eps < 1$. In this paper, we present an algorithm for the 2-ruling set problem, running in O~(log1/6Δ) rounds whp, in the low-memory MPC model. We then extend this result to β-ruling sets for any integer β>1. Specifically, we show that a β-ruling set can be computed in the low-memory MPC model with $O(n^eps)$ memory-per-machine in O~(β⋅log1/(2β+1−2)Δ) rounds, whp. From this it immediately follows that a β-ruling set for β=Ω(logloglogΔ)-ruling set can be computed in in just O(βloglogn) rounds whp. The above results assume a total memory of $tilde{O}(m + n^{1+eps})$. We also present algorithms for β-ruling sets in the low-memory MPC model assuming that the total memory over all machines is restricted to O~(m).
Dynamic Block Sparse Reparameterization of Convolutional Neural Networks
V DHARMA TEJA,Girish Varma,Kishore Kothapalli
International Conference on Computer Vision Workshops, ICCV-W, 2019
@inproceedings{bib_Dyna_2019, AUTHOR = {V DHARMA TEJA, Girish Varma, Kishore Kothapalli}, TITLE = {Dynamic Block Sparse Reparameterization of Convolutional Neural Networks}, BOOKTITLE = {International Conference on Computer Vision Workshops}. YEAR = {2019}}
Sparse neural networks are efficient in both memory and compute when compared to dense neural networks. But on parallel hardware such as GPU, sparse neural networks result in small or no runtime performance gains. On the other hand, structured sparsity patterns like filter, channel and block sparsity result in large performance gains due to regularity induced by structure. Among structured sparsities, block sparsity is a generic structured sparsity pattern with filter and channel sparsity being sub cases of block sparsity. In this work, we focus on block sparsity and generate efficient block sparse convolutional neural networks using our approach DBSR (Dynamic block sparse reparameterization). Our DBSR approach, when applied on image classification task over Imagenet dataset, decreases parameters and FLOPS of ResneXt50 by a factor of 2x with only increase of 0.48 in Top-1 error. And when extended to the task of semantic segmentation, our approach reduces parameters and FLOPS by 30% and 20% respectively with only 1% decrease in mIoU for ERFNet over Cityscapes dataset
Symmetric Key based Secure Resource Sharing
Bruhadeshwar Bezawada,Kishore Kothapalli,Dugyala Raman,Rui Li
International Symposium on Security in Computing and Communication, SSCC, 2019
@inproceedings{bib_Symm_2019, AUTHOR = {Bruhadeshwar Bezawada, Kishore Kothapalli, Dugyala Raman, Rui Li}, TITLE = {Symmetric Key based Secure Resource Sharing}, BOOKTITLE = {International Symposium on Security in Computing and Communication}. YEAR = {2019}}
We focus on the problem of symmetric key distribution for securing shared resources among large groups of users in distributed applications like cloud storage, shared databases, and collaborative editing, among others. In such applications, resources such as data, are sensitive in nature and it is necessary that only authorized users are allowed access without the presence of on-line monitoring system. The de-facto approach is to encrypt a shared resource and deploy a key distribution mechanism, which enables only authorized users to generate the respective decryption key for the resource. The key distribution approach has two major challenges: first, the applications are dynamic i.e., users might join and leave arbitrarily, and second, for a large number of users, it is required that the cryptographic technique be scalable and efficient. In this work, we describe an approach that overcomes these challenges by using two key techniques: first, flattening the access structure and applying efficient symmetric key distribution techniques. By flattening the access structure, we reduce the problem to that of key distribution of a resource among all the users sharing that resource. We consider this smaller flattened access structure and devise a unified key distribution technique that is sufficient for key distribution across all such structures. Our key distribution techniques have an important feature of a public secret and a private secret, which allows the group controller to publish updates to the keying material using the public secret and therefore, does not necessitate the users to be in constant communication with the group controller. Using this model we describe two efficient key distribution techniques that scale logarithmically with the group size and also handle group additions and removals. Furthermore, a user can be off-line for any amount of time and need not be aware of the dynamics of the system, which is important as it overcomes the problems posed by lossy channels. We have performed an experimental evaluation of our scheme against a popular existing scheme and show that they perform better for this scheme with the same security guarantees. As our approaches are easy to implement they are especially suitable for practical applications where security is viewed as an overhead rather than as a necessity.
BRICS – Efficient Techniques for Estimating the Farness-Centrality in Parallel
Sai Charan Regunta,Sai Harsh Tondomker,Kishore Kothapalli
International Parallel & Distributed Processing Symposium Workshops, IPDPS-W, 2019
@inproceedings{bib_BRIC_2019, AUTHOR = {Sai Charan Regunta, Sai Harsh Tondomker, Kishore Kothapalli}, TITLE = {BRICS – Efficient Techniques for Estimating the Farness-Centrality in Parallel}, BOOKTITLE = {International Parallel & Distributed Processing Symposium Workshops}. YEAR = {2019}}
In this paper, we study scalable parallel algorithms for estimating the farness-centrality value of the nodes in a given undirected and connected graph. Our algorithms consider approaches that are more suitable for sparse graphs. To this end, we propose four optimization techniques based on removing redundant nodes, removing identical nodes, removing chain nodes, and making use of decomposition based on the biconnected components of the input graph.We test our techniques on a collection of real-world graphs for the time taken and the average error percentage. We further analyze the applicability of our techniques on various classes of real-world graphs. We suggest why certain techniques work better on certain classes of graphs
Efficient Sparse Neural Networks Using Regularized Multi Block Sparsity Pattern on a GPU
V DHARMA TEJA,Kishore Kothapalli
International Conference on High Performance Computing, HiPC, 2019
@inproceedings{bib_Effi_2019, AUTHOR = {V DHARMA TEJA, Kishore Kothapalli}, TITLE = {Efficient Sparse Neural Networks Using Regularized Multi Block Sparsity Pattern on a GPU}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2019}}
A large portion of the computation in sparse neural networks comprises of multiplying a sparse matrix with a dense matrix, denoted SDMM in this paper. The SDMM operation with an unstructured sparsity pattern cannot be efficiently processed on modern architectures such as GPUs due to irregularity in compute and memory accesses. However, efficient parallel algorithms on a GPU can be designed for SDMM when the sparsity pattern is more structured. Thus, the run time performance of sparse neural networks on a GPU is dependent on the sparsity pattern present in the underlying matrix.In sparse neural networks obtained using pruning based approaches, the choice of sparsity pattern not only effects the run time, but also effects the accuracy of the task for which the neural network is trained for. Sparsity patterns which have a good run time performance on a GPU may not have a good accuracy and vice-versa. The real challenge then is to given a target architecture, identify a sparsity pattern, a storage format,and an algorithm for SDMM that leads to sparse neural networks which are efficient in both run time and accuracy.In this work, we propose a novel, structured, flexible, and generic sparsity pattern called the RMB (Regularized MultiBlock) sparsity pattern, and an efficient storage format (CRMB),and a fast GPU algorithm for processing RMBMM (SDMMwith the multiplicand having RMB sparsity pattern). Using theRMB sparsity pattern, we achieve better trade-offs between the accuracy, and the run time performance of sparse neural networks on a GPU when compared to commonly used sparsity patterns like unstructured, and block sparsity patterns.
BRICS – Efficient Techniques for Estimating the Farness-Centrality in Parallel
Sai Charan Regunta,Tondomker Sai Harsh,Kishore Kothapalli
International Symposium on Parallel and Distributed Processing Workshops and Phd Forum , IPDPSW, 2019
@inproceedings{bib_BRIC_2019, AUTHOR = {Sai Charan Regunta, Tondomker Sai Harsh, Kishore Kothapalli}, TITLE = {BRICS – Efficient Techniques for Estimating the Farness-Centrality in Parallel}, BOOKTITLE = {International Symposium on Parallel and Distributed Processing Workshops and Phd Forum }. YEAR = {2019}}
In this paper, we study scalable parallel algorithms for estimating the farness-centrality value of the nodes in a given undirected and connected graph. Our algorithms consider approaches that are more suitable for sparse graphs. To this end, we propose four optimization techniques based on removing redundant nodes, removing identical nodes, removing chain nodes, and making use of decomposition based on the biconnected components of the input graph. We test our techniques on a collection of real-world graphs for the time taken and the average error percentage. We further analyze the applicability of our techniques on various classes of real-world graphs. We suggest why certain techniques work better on certain classes of graphs.
Share-a-GPU: Providing Simple and Effective Time-Sharing on GPUs
SHALEEN GARG,Kishore Kothapalli,Venkata Suresh Reddy Purini
INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, HiPCDA, 2018
@inproceedings{bib_Shar_2018, AUTHOR = {SHALEEN GARG, Kishore Kothapalli, Venkata Suresh Reddy Purini}, TITLE = {Share-a-GPU: Providing Simple and Effective Time-Sharing on GPUs}, BOOKTITLE = {INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS}. YEAR = {2018}}
Time-sharing, which allows for multiple users to use a shared resource, is an important and fundamental aspect of modern computing systems. However, accelerators such as GPUs, that come without a native operating system do not support time sharing. The inability of accelerators to support time-sharing limits their applicability especially as they are getting deployed in Platform-as-a-Service and Resource-as-a-Service environments. In the former, elastic demands may require preemption where as in the latter, fine-grained economic models of service cost can be supported with time sharing. In this paper, we extend the concept of time sharing to the GPGPU computational space using a cooperative multitasking approach. Our technique is applicable to any GPGPU program written in Compute Unified Device Architecture (CUDA) API provided for C/C++ programming languages. With minimal support from the programmer, our framework incorporates process scheduling, light-weight memory management, and multiGPU support. Our framework provides an abstraction where, in a round-robin manner, every workload can use a GPU(s) over a time quantum exclusively. We demonstrate the applicability of our scheduling framework by running many workloads concurrently in a time sharing manner
Applications of Ear Decomposition to Efficient Heterogeneous Algorithms for Shortest Path/Cycle Problems
DEBARSHI DUTTA,MEHER CHAITANYA P,Kishore Kothapalli,Debajyoti Bera
International Journal of Networking and Computing, IJNC, 2018
@inproceedings{bib_Appl_2018, AUTHOR = {DEBARSHI DUTTA, MEHER CHAITANYA P, Kishore Kothapalli, Debajyoti Bera}, TITLE = {Applications of Ear Decomposition to Efficient Heterogeneous Algorithms for Shortest Path/Cycle Problems}, BOOKTITLE = {International Journal of Networking and Computing}. YEAR = {2018}}
Graph algorithms play an important role in several fields of sciences and engineering. Prominent among them are the All-Pairs-Shortest-Paths (APSP) and related problems. Indeed there are several efficient implementations for such problems on a variety of modern multi- and manycore architectures. It can be noticed that for several graph problems, parallelism offers only a limited success as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, some of these graphs exhibit clear structural properties due to their sparsity. This calls for particular solution strategies aimed at scalable processing of large, sparse graphs on modern parallel architectures. In this paper, we study the applicability of an ear decomposition of graphs to problems such as all-pairs-shortest-paths and minimum cost cycle basis. Through experimentation, we show that the resulting solutions are scalable in terms of both memory usage and also their speedup over best known current implementations. We believe that our techniques have the potential to be relevant for designing scalable solutions for other computations on large sparse graphs.
Expediting Parallel Graph Connectivity Algorithms
MIHIR WADWEKAR,Kishore Kothapalli
International Conference on High Performance Computing, HiPC, 2018
@inproceedings{bib_Expe_2018, AUTHOR = {MIHIR WADWEKAR, Kishore Kothapalli}, TITLE = {Expediting Parallel Graph Connectivity Algorithms}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2018}}
Finding whether a graph is k-connected, and the identification of its k-connected components is a fundamental problem in graph theory. For this reason, there have been several algorithms for this problem in both the sequential and parallel settings. Several recent sequential and parallel algorithms fork-connectivity rely on one or more breadth-first traversals of the input graph.While BFS can be made very efficient in a sequential setting,the same cannot be said in the case of parallel environments. A major factor in this difficulty is due to the inherent requirement to use a shared queue, balance work among multiple threads in every round, synchronization, and the like. Optimizing the execution of BFS on many current parallel architectures is therefore quite challenging. For this reason, it can be noticed that the time spent by the current parallel graph connectivity algorithms on BFS operations is usually a significant portion of their overall runtime.In this paper, we study how one can, in the context of algorithms for graph connectivity, mitigate the practical in efficiency of relying on BFS operations in parallel. Our technique suggests that such algorithms may not require a BFS of the input graph but actually can work with a sparse spanning subgraph of the input graph. The incorrectness introduced by not using a BFS spanning tree can then be offset by further post-processing stepson suitably defined small auxiliary graphs. Our experiments on finding the 2, and 3-connectivity of graphs on Nvidia K40c GPUs improve the state-of-the-art on the corresponding problems by a factor 2.2x, and 2.1x respectively
Parallelizing Hines matrix solver in neuron simulations on GPU
V DHARMA TEJA,Kishore Kothapalli,Upinder S Bhalla
International Conference on High Performance Computing, HiPC, 2017
@inproceedings{bib_Para_2017, AUTHOR = {V DHARMA TEJA, Kishore Kothapalli, Upinder S Bhalla}, TITLE = {Parallelizing Hines matrix solver in neuron simulations on GPU}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2017}}
Hines matrices arise in the simulations of mathematical models describing initiation and propagation of action potentials in a neuron. In this work, we exploit the structural properties of Hines matrices and design a scalable, linear work, recursive parallel algorithm for solving a system of linear equations where the underlying matrix is a Hines matrix, using the Exact Domain Decomposition Method (EDD). We give a general form for representing a Hines matrix and use the general form to prove that the intermediate matrix obtained via the EDD has the same structural properties as that of a Hines matrix. Using the above observation, we propose a novel decomposition strategy called fine decomposition which is suitable for a GPU architecture. Our algorithmic approach R-FINE-TPT based on fine decomposition outperforms the previously known approach in all the cases and gives a speedup of 2.5x on average for a variety of input neuron morphologies. We further perform experiments to understand the behaviour of R-FINE-TPT approach and show its robustness. We also employ a machine learning technique called linear regression to effectively guide recursion in our algorithm.
GPUScheduler: User Level Preemptive Scheduling for NVIDIA GPUs
SHALEEN GARG,Kishore Kothapalli,Venkata Suresh Reddy Purini
International Conference on High Performance Computing, HiPC, 2017
@inproceedings{bib_GPUS_2017, AUTHOR = {SHALEEN GARG, Kishore Kothapalli, Venkata Suresh Reddy Purini}, TITLE = {GPUScheduler: User Level Preemptive Scheduling for NVIDIA GPUs}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2017}}
In this paper, we present a novel approach to extend the concepts of time sharing and preemption to GPGPU computational space. Our technique is applicable to any batch GPGPU program written in Compute Unified Device Architecture (CUDA) API provided for C/C++ programming languages. It also gives the user freedom to use different scheduling algorithms to schedule the GPGPU programs satisfying specific system level agreements. Our easy-to-use and minimal API makes it very easy for users to run their programs using the GPUScheduler.
A fast GPU algorithm for biconnected components
MIHIR WADWEKAR,Kishore Kothapalli
International Conference on Contemporary Computing, IC3, 2017
@inproceedings{bib_A_fa_2017, AUTHOR = {MIHIR WADWEKAR, Kishore Kothapalli}, TITLE = {A fast GPU algorithm for biconnected components}, BOOKTITLE = {International Conference on Contemporary Computing}. YEAR = {2017}}
Finding the articulation points and the biconnected components of an undirected graph has been a problem of huge interest in graph theory. Over the years, several sequential and parallel algorithms have been presented for this problem. Our paper here presents and implements a fast parallel algorithm on GPU which is to the best of our knowledge the first such attempt and also the fastest implementation across architectures. The implementation is on an average 4× faster then the next best implementation. We also apply an edge-pruning technique which results in a further 2× speedup for dense graphs
A study of graph decomposition algorithms for parallel symmetry breaking
SAYYAD NAYYARODDEEN,MAHAK GAMBHIR,Kishore Kothapalli
International Parallel & Distributed Processing Symposium Workshops, IPDPS-W, 2017
@inproceedings{bib_A_st_2017, AUTHOR = {SAYYAD NAYYARODDEEN, MAHAK GAMBHIR, Kishore Kothapalli}, TITLE = {A study of graph decomposition algorithms for parallel symmetry breaking}, BOOKTITLE = {International Parallel & Distributed Processing Symposium Workshops}. YEAR = {2017}}
Parallel algorithms on large graphs play a prominent role in various problems from several domains of sciences and engineering. Of these, symmetry breaking problems such as matchings and colorings are fundamental owing to a large number of applications. Algorithms for symmetry breaking problems are studied in several parallel/distributed settings over the decades. However, as the size of graphs corresponding to real-world phenomenon increase, it is imperative to use not only just parallelism but also algorithmic enhancements based on the structure of real-world graphs. A popular approach in this context is to use a graph decomposition to break the problem into multiple subproblems the solutions of which can be used to solve the original problem. A big question in this direction that is left unanswered by current research is to study which decomposition is appropriate for a given computation and a target architecture. In this context, we address the above question with respect to three problems: Maximal Matching (MM), Vertex Coloring (COLOR), and Maximal Independent Set (MIS). For these three problems, we study three different decomposition techniques including one based on bridges, one based on a random partitioning, and one based on vertices of degree k for a given k. We show that existing algorithms for the above computations on a multi-core CPU and a GPU can be significantly improved by making use of an appropriate decomposition of the input graph. Our study indicates that the exact decomposition to use depends more on the problem and is largely independent of target architecture between the CPU and the GPU.
Nearly Balanced Work Partitioning for Heterogeneous Algorithms
Hardhik Mallipeddia,Dip Sankar Banerjee,Kiran Raj Ramamoorthy,Srinathan Kannan,Kishore Kothapalli
International Conference on Parallel Processing, ICPP, 2017
@inproceedings{bib_Near_2017, AUTHOR = {Hardhik Mallipeddia, Dip Sankar Banerjee, Kiran Raj Ramamoorthy, Srinathan Kannan, Kishore Kothapalli}, TITLE = {Nearly Balanced Work Partitioning for Heterogeneous Algorithms}, BOOKTITLE = {International Conference on Parallel Processing}. YEAR = {2017}}
The architectural trend towards heterogeneity has pushed heterogeneous computing to the fore of parallel computing research. Heterogeneous algorithms, often carefully handcrafted, have been designed for several important problems from parallel computing such as sorting, graph algorithms, matrix computations, and the like. A majority of these algorithms follow a work partitioning approach where the input is divided into appropriate sized parts so that individual devices can process the “right-sized” parts of the input. However, arriving at a good work partitioning is usually non-trivial and may crucially depend also on the input instance apart from the heterogeneous algorithm used. An extensive empirical search is not helpful as such an empirical search can potentially offset any gains accrued out of heterogeneous algorithms. Other recently proposed approaches too are in general inadequate. In this paper, we propose a simple and effective technique for work partitioning in the context of heterogeneous algorithms. Our technique is based on randomized sampling and therefore can adapt to both the algorithm used and the input instance. Our technique is generic in its applicability as we will demonstrate in this paper. We validate our technique on three problems: multiplying two unstructured sparse matrices (spmm), multiplying two scale-free sparse matrices, and finding the connected components of a graph (CC). For these problems, we show that using our method, we can find the required threshold that is under 10% away from the best possible thresholds respectively.
Efficient parallel ear decomposition of graphs with application to betweenness-centrality
CHARUDATT RAJENDRA PACHORKAR,MEHER CHAITANYA P,Kishore Kothapalli,Debajyoti Bera
International Conference on High Performance Computing, HiPC, 2016
@inproceedings{bib_Effi_2016, AUTHOR = {CHARUDATT RAJENDRA PACHORKAR, MEHER CHAITANYA P, Kishore Kothapalli, Debajyoti Bera}, TITLE = {Efficient parallel ear decomposition of graphs with application to betweenness-centrality}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2016}}
Parallel graph algorithms continue to attract a lot of research attention given their applications to several fields of sciences and engineering. Efficient design and implementation of graph algorithms on modern manycore accelerators has to however contend with a host of challenges including not being able to reach full memory system throughput and irregularity. Of late, focusing on real-world graphs, researchers are addressing these challenges by using decomposition and preprocessing techniques guided by the structural properties of such graphs. In this direction, we present a new GPU algorithm for obtaining an ear decomposition of a graph. Our implementation of the proposed algorithm on an NVidia Tesla K40c improves the state-of-the-art by a factor of 2.3x on average on a collection of real-world and synthetic graphs. The improved performance of our algorithm is due to our proposed characterization that identifies edges of the graph as redundant for the purposes of an ear decomposition. We then study an application of the ear decomposition of a graph in computing the betweenness-centrality values of nodes in the graph. We use an ear decomposition of the input graph to systematically remove nodes of degree two. The actual computation of betweenness-centrality is done on the remaining nodes and the results are extended to nodes removed in the previous step. We show that this approach improves the state-of-the-art for computing betweenness-centrality on an NVidia K40c GPU by a factor of 1.9x on average over a collection of real-world graphs.
STIC-D: Algorithmic techniques for efficient parallel PageRank computation on Real-World Graphs
PARITOSH GARG,Kishore Kothapalli
International Conference on Distributed Computing and Networking, ICDCN, 2016
@inproceedings{bib_STIC_2016, AUTHOR = {PARITOSH GARG, Kishore Kothapalli}, TITLE = {STIC-D: Algorithmic techniques for efficient parallel PageRank computation on Real-World Graphs}, BOOKTITLE = {International Conference on Distributed Computing and Networking}. YEAR = {2016}}
Computing metrics on nodes of a graph is an essential step in understanding the properties of the graph. Pagerank is one such metric that is popular and is being used to measure the importance of nodes in not only web graphs but also in social networks, biological networks, road networks, and the like. The core of the computation of pagerank can be seen as an iterative approach that updates the pageranks of nodes until the values converge.
Efficient multicore algorithms for identifying biconnected components
MEHER CHAITANYA P,Kishore Kothapalli
International Journal of Networking and Computing, IJNC, 2016
@inproceedings{bib_Effi_2016, AUTHOR = {MEHER CHAITANYA P, Kishore Kothapalli}, TITLE = {Efficient multicore algorithms for identifying biconnected components}, BOOKTITLE = {International Journal of Networking and Computing}. YEAR = {2016}}
In this paper we design and implement an algorithm for finding the biconnected components of a given graph. Our algorithm is based on experimental evidence that finding the bridges of a graph is usually easier and faster in the parallel setting. We use this property to first decompose the graph into independent and maximal 2-edge-connected subgraphs. To identify the articulation points in these 2-edge connected subgraphs, we again convert this into a problem of finding the bridges on an auxiliary graph. It is interesting to note that during the conversion process, the size of the graph may increase. However, we show that this small increase in size and the run time is offset by the consideration that finding bridges is easier in a parallel setting. We implement our algorithm on an Intel i7 980X CPU running 12 threads. We show that our algorithm is on average 2.45 x faster than the best known current algorithms implemented on the same platform. Finally, we extend our approach to dense graphs by applying the sparsification technique suggested by Cong and Bader
Reporting and counting maximal points in a query orthogonal rectangle
Ananda Swarup Das,Prosenjit Gupta,Kishore Kothapalli,Srinathan Kannan
Journal on Discrete Alogorithms, JDA, 2015
@inproceedings{bib_Repo_2015, AUTHOR = {Ananda Swarup Das, Prosenjit Gupta, Kishore Kothapalli, Srinathan Kannan}, TITLE = {Reporting and counting maximal points in a query orthogonal rectangle}, BOOKTITLE = {Journal on Discrete Alogorithms}. YEAR = {2015}}
In this work we show that given a set S of n points with coordinates on an n × n grid, we can construct data structures for (i) reporting and (ii) counting the maximal points in an axes-parallel query rectangle in sub-logarithmic time. We assume our model of computation to be the word RAM with size of each word being log n bits.
A Novel Heterogeneous Algorithm for Multiplying Scale-Free Sparse Matrices
S R KIRAN RAJ,DIP SANKAR BANERJEE,Srinathan Kannan,Kishore Kothapalli
International Parallel & Distributed Processing Symposium Workshops, IPDPS-W, 2015
@inproceedings{bib_A_No_2015, AUTHOR = {S R KIRAN RAJ, DIP SANKAR BANERJEE, Srinathan Kannan, Kishore Kothapalli}, TITLE = {A Novel Heterogeneous Algorithm for Multiplying Scale-Free Sparse Matrices}, BOOKTITLE = {International Parallel & Distributed Processing Symposium Workshops}. YEAR = {2015}}
Multiplying two sparse matrices, denoted spmm, is a fundamental operation in linear algebra with several applications. Hence, efficient and scalable implementation of spmm has been a topic of immense research. Recent efforts are aimed at implementations on GPUs, multicore architectures, FPGAs, and such emerging computational platforms. Owing to the highly irregular nature of spmm, it is observed that GPUs and CPUs can offer comparable performance. In this paper, we study CPU+GPU heterogeneous algorithms for spmm where the matrices exhibit a scale-free nature. Focusing on such matrices, we propose an algorithm that multiplies two sparse matrices exhibiting scale-free nature on a CPU+GPU heterogeneous platform. Our experiments on a wide variety of real-world matrices from standard datasets show an average of 25% improvement over the best possible algorithm on a CPU+GPU heterogeneous platform. We show that our approach is both architecture-aware, and workload-aware.
Work efficient parallel algorithms for large graph exploration on emerging heterogeneous architectures
DIP SANKAR BANERJEE,ASHUTOSH KUMAR,MEHER CHAITANYA P,Shashank Sharma,Kishore Kothapalli
Journal of Parallel and Distributed Computing, JPDC, 2015
@inproceedings{bib_Work_2015, AUTHOR = {DIP SANKAR BANERJEE, ASHUTOSH KUMAR, MEHER CHAITANYA P, Shashank Sharma, Kishore Kothapalli}, TITLE = {Work efficient parallel algorithms for large graph exploration on emerging heterogeneous architectures}, BOOKTITLE = {Journal of Parallel and Distributed Computing}. YEAR = {2015}}
Graph algorithms play a prominent role in several fields of sciences and engineering. Notable among them are graph traversal, finding the connected components of a graph, and computing shortest paths. There are several efficient implementations of the above problems on a variety of modern multiprocessor architectures. It can be noticed in recent times that the size of the graphs that correspond to real world datasets has been increasing. Parallelism offers only a limited succor to this situation as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, these graphs are also getting very sparse in nature. This calls for particular solution strategies aimed at processing large, sparse graphs on modern parallel architectures.
Implementation of Kirchhoff-Helmholtz transform on GPU for use in digital in-line holographic microscopy
GAGANDEEP SINGH DHANOA,MANISH KUMAR SHUKLA,Prabhakar Bhimalapuram,Kishore Kothapalli
ACM India Computing Conference, ACM-ICC, 2014
@inproceedings{bib_Impl_2014, AUTHOR = {GAGANDEEP SINGH DHANOA, MANISH KUMAR SHUKLA, Prabhakar Bhimalapuram, Kishore Kothapalli}, TITLE = {Implementation of Kirchhoff-Helmholtz transform on GPU for use in digital in-line holographic microscopy}, BOOKTITLE = {ACM India Computing Conference}. YEAR = {2014}}
GPUs with their massive Single Instruction Multiple Data (SIMD) capability have become attractive for scientific computation. The Kirchhoff-Helmholtz Transform (KHT) is a two dimensional integral commonly used in numerical reconstruction of the object from its experimentally measured diffraction pattern (called hologram). We explore the evaluation of KHT using GPU at various levels of optimisation. These optimisations are of two types:(a) algorithmic: exploiting the symmetries inherent in KHT, and (b) programmatic: optimizations that are specific to the GPU architecture like the lookup tables, scheduling of read/writes. From the numerical experiments, we report the speedup for each level of optimisation.
Architecture-and workload-aware heterogeneous algorithms for sparse matrix vector multiplication
I SIVARAMAKRISHNA,M MANOJ KUMAR,Kishore Kothapalli
ACM India Computing Conference, ACM-ICC, 2014
@inproceedings{bib_Arch_2014, AUTHOR = {I SIVARAMAKRISHNA, M MANOJ KUMAR, Kishore Kothapalli}, TITLE = {Architecture-and workload-aware heterogeneous algorithms for sparse matrix vector multiplication}, BOOKTITLE = {ACM India Computing Conference}. YEAR = {2014}}
Multiplying a sparse matrix with a vector, denoted spmv, is a fundamental operation in linear algebra with several applications. Hence, efficient and scalable implementation of spmv has been a topic of immense research. Recent efforts are aimed at implementations on GPUs, multicore architectures, and such emerging computational platforms. Owing to the highly irregular nature of spmv, it is observed that GPUs and CPUs can offer comparable performance.
GPU Accelerated Range Trees with Applications
M MANOJ KUMAR,Kishore Kothapalli
European Conference on Parallel Processing, ECPP, 2014
@inproceedings{bib_GPU__2014, AUTHOR = {M MANOJ KUMAR, Kishore Kothapalli}, TITLE = {GPU Accelerated Range Trees with Applications}, BOOKTITLE = {European Conference on Parallel Processing}. YEAR = {2014}}
Range searching is a primal problem in computational geometry with applications to database systems, mobile computing, geographical information systems, and the like. Defined simply, the problem is to preprocess a given a set of points in a d-dimensional space so that the points that lie inside an orthogonal query rectangle can be efficiently reported. Many practical applications of range trees require one to process a massive amount of points and a massive number of queries. In this context, we propose an efficient parallel implementation of range trees on manycore architectures such as GPUs. We extend our implementation to query processing. While queries can be batched together to exploit interquery parallelism, we also utilize intra-query parallelism. This inter- and intra-query parallelism greatly reduces the per query latency thereby increasing the throughput. On an input of 1 M points in a 2-dimensional space, our implementation on a single Nvidia GTX 580 GPU for constructing a range tree shows an improvement of 12X over a 12-threaded CPU implementation. We also achieve an average throughput of 10 M queries per second for answering 4 M queries on a range tree containing 1 M points on a Nvidia GTX 580 GPU. We extend our implementation to an application where we seek to report the set of maximal points in a given orthogonal query rectangle.
Brief announcement: Super-fast t-ruling sets
Tushar Bisht,Kishore Kothapalli,Sriram V. Pemmaraju
ACM symposium on Principles of distributed computing, PODC, 2014
@inproceedings{bib_Brie_2014, AUTHOR = {Tushar Bisht, Kishore Kothapalli, Sriram V. Pemmaraju}, TITLE = {Brief announcement: Super-fast t-ruling sets}, BOOKTITLE = {ACM symposium on Principles of distributed computing}. YEAR = {2014}}
A t-ruling set of a graph G=(V, E) is a vertex-subset S⊆ V that is independent and satisfies the property that every vertex v∈ V is at a distance of at most t hops from some vertex in S. A maximal independent set (MIS) is a 1-ruling set. Extending results from Kothapalli et al.(FSTTCS 2012) this note presents a randomized algorithm for computing, with high probability, a t-ruling set in O (t⋅ log 1/(t-1) n) rounds for 2< t≤√(log log n) and in (O (√(log log n))) rounds for t>√(log log n).
On reporting the L1 metric closest pair in a query rectangle
Ananda Swarup Das,Prosenjit Gupta,Kishore Kothapalli,Srinathan Kannan
Information Processing Letters, IPL, 2014
@inproceedings{bib_On_r_2014, AUTHOR = {Ananda Swarup Das, Prosenjit Gupta, Kishore Kothapalli, Srinathan Kannan}, TITLE = {On reporting the L1 metric closest pair in a query rectangle}, BOOKTITLE = {Information Processing Letters}. YEAR = {2014}}
In this work, we consider the problem of finding the closest pair (in L 1 metric) of points in an orthogonal query rectangle. Given a set of n static points on a U× U grid, we preprocess these points into a data structure of size O (m f (m) log 2 m) that can be queried in time O ((g (m)+ log log m) log 3 m), for m= O (n log U) and (i) f (m)= O (1) and g (m)= O (log ε m);(ii) f (m)= O (log log m) and g (m)= O (log log m);(iii) f (m)= O (log ε m) and g (m)= O (1). Here ε: 0< ε< 1 is a small but arbitrary constant.
On generalized planar skyline and convex hull range queries
NADEEM MOIDU,JATIN AGARWAL,SANKALP KHARE,Kishore Kothapalli,Srinathan Kannan
International Workshop on Algorithms and Computation, WALCOM, 2014
@inproceedings{bib_On_g_2014, AUTHOR = {NADEEM MOIDU, JATIN AGARWAL, SANKALP KHARE, Kishore Kothapalli, Srinathan Kannan}, TITLE = {On generalized planar skyline and convex hull range queries}, BOOKTITLE = {International Workshop on Algorithms and Computation}. YEAR = {2014}}
We present output sensitive techniques for the generalized reporting versions of the planar range maxima problem and the planar range convex hull problem. Our solutions are in the pointer machine model, for orthogonal range queries on a static point set. We solve the planar range maxima problem for two-sided, three-sided and four-sided queries. We achieve a query time of O(logn + c) using O(n) space for the two-sided case, where n denotes the number of stored points and c the number of colors reported. For the three-sided case, we achieve query time O(log2 n + clogn) using O(n) space while for four-sided queries we answer queries in O(log3 n + clog2 n) using O(nlogn) space. For the planar range convex hull problem, we provide a solution that answers queries in O(log2 n + clogn) time, using O(nlog2 n) space.
Work efficient parallel algorithms for large graph exploration
DIP SANKAR BANERJEE,Shashank Sharma,Kishore Kothapalli
International Conference on High Performance Computing, HiPC, 2013
@inproceedings{bib_Work_2013, AUTHOR = {DIP SANKAR BANERJEE, Shashank Sharma, Kishore Kothapalli}, TITLE = {Work efficient parallel algorithms for large graph exploration}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2013}}
Graph algorithms play a prominent role in several fields of sciences and engineering. Notable among them are graph traversal, finding the connected components of a graph, and computing shortest paths. There are several efficient implementations of the above problems on a variety of modern multiprocessor architectures. It can be noticed in recent times that the size of the graphs that correspond to real world data sets has been increasing. Parallelism offers only a limited succor to this situation as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, these graphs are also getting very sparse in nature. This calls for particular work efficient solutions aimed at processing large, sparse graphs on modern parallel architectures. In this paper, we introduce graph pruning as a technique that aims to reduce the size of the graph. Certain elements of the graph can be pruned depending on the nature of the computation. Once a solution is obtained for the pruned graph, the solution is extended to the entire graph. We apply the above technique on three fundamental graph algorithms: breadth first search (BFS), Connected Components (CC), and All Pairs Shortest Paths (APSP). To validate our technique, we implement our algorithms on a heterogeneous platform consisting of a multicore CPU and a GPU. On this platform, we achieve an average of 35% improvement compared to state-ofthe-art solutions. Such an improvement has the potential to speed up other applications that rely on these algorithms.
An empirical study of two MIS algorithms
Tushar Bisht,Kishore Kothapalli
International Conference on Advanced Computing, Networking and Security, ADCONS, 2013
@inproceedings{bib_An_e_2013, AUTHOR = {Tushar Bisht, Kishore Kothapalli}, TITLE = {An empirical study of two MIS algorithms}, BOOKTITLE = {International Conference on Advanced Computing, Networking and Security}. YEAR = {2013}}
Distributed algorithms for computing a maximal independent set of a graph have been very popular in the research community. Starting with the algorithm of Luby dating back to 25 years, several researchers have focused on distributed MIS algorithms for general graphs. with little success. Only recently, M'etevier et al. have proposed an algorithm for MIS in the distributed setting with a round complexity that matches Luby's algorithm asymptotically. However, very little is known about the relative performance of the algorithm of Luby and M'etevier in practice. Such a study can throw light on some aspects of the algorithms that are not brought out by a theoretical analysis. In this paper, we compare implementations of the algorithm of Luby and M'etevier and study their behavior on a class of random graphs and bipartite graphs. Further, we study how varying the probability that a node wishes to join the MIS in the case of Luby's algorithm affects the number of rounds required.
Allowing Multiple Rounds in the Shared Whiteboard Model: Some More Impossibility Results
DHARMEET SINGH HORA,PIYUSH BANSAL,Kishore Kothapalli,Srinathan Kannan
International Conference on Advanced Computing, Networking and Security, ADCONS, 2013
@inproceedings{bib_Allo_2013, AUTHOR = {DHARMEET SINGH HORA, PIYUSH BANSAL, Kishore Kothapalli, Srinathan Kannan}, TITLE = {Allowing Multiple Rounds in the Shared Whiteboard Model: Some More Impossibility Results}, BOOKTITLE = {International Conference on Advanced Computing, Networking and Security}. YEAR = {2013}}
The shared whiteboard model for distributed computing is one of the recent interesting models to be proposed (See Becker et al. (SPAA 2012)). In its basic form, this model allows all nodes to write a message of no more than O(log n) bits on a whiteboard that every node can read. However, each node can write at most once. In this model, a variety of problems from graphs are shown to be either possible or impossible. In this paper we extend the work of Becker et al. to allow for nodes to write on the shared whiteboard more than once. However, each node can write at most O(log n) bits at any one instant. Interestingly, in this model, we show that allowing just two rounds of writing on the whiteboard, one can color the vertices of a d-degenerate graph using d+1 colors. Similarly, we show that two rounds suffice to find maximal independent set (MIS), whereas 2-ruling sets can be computed in one round in simultaneous synchronous models. Finally, we show that for finding connected components in a graph, even O(1) rounds is not enough in general. We show that any deterministic algorithm that follows certain rules requires at least Omega(log nlog log n) rounds to find the connected components of an n-vertex graph. At the same time, we show the existence of a O(log n) round algorithm for the same. Thus, our results indicate that the multi-round shared whiteboard model has interesting consequences.
Efficient Range Reporting of Convex Hull
JATIN AGARWAL,NADEEM MOIDU,Kishore Kothapalli,Srinathan Kannan
Technical Report, arXiv, 2013
@inproceedings{bib_Effi_2013, AUTHOR = {JATIN AGARWAL, NADEEM MOIDU, Kishore Kothapalli, Srinathan Kannan}, TITLE = {Efficient Range Reporting of Convex Hull}, BOOKTITLE = {Technical Report}. YEAR = {2013}}
We consider the problem of reporting convex hull points in an orthogonal range query in two dimensions. Formally, let $ P $ be a set of $ n $ points in $\mathbb {R}^{2} $. A point lies on the convex hull of a point set $ S $ if it lies on the boundary of the minimum convex polygon formed by $ S $. In this paper, we are interested in finding the points that lie on the boundary of the convex hull of the points in $ P $ that also fall with in an orthogonal range $[x_ {lt}, x_ {rt}]\times {}[y_b, y_t] $. We propose a $ O (n\log^{2} n) $ space data structure that can support reporting points on a convex hull inside an orthogonal range query, in time $ O (\log^{3} n+ h) $. Here $ h $ is the size of the output. This work improves the result of (Brass et al. 2013)\cite {brass} that builds a data structure that uses $ O (n\log^{2} n) $ space and has a $ O (\log^{5} n+ h) $ query time. Additionally, we show that our data structure can be modified slightly to solve other related problems. For instance, for counting the number of points on the convex hull in an orthogonal query rectangle, we propose an $ O (n\log^{2} n) $ space data structure that can be queried upon in $ O (\log^{3} n) $ time. We also propose a $ O (n\log^{2} n) $ space data structure that can compute the $ area $ and $ perimeter $ of the convex hull inside an orthogonal range query in $ O (\log^{3} n $) time.
Fast, scalable parallel comparison sort on hybrid multicore architectures
DIP SANKAR BANERJEE,PARIKSHIT VISHWAS SAKURIKAR,Kishore Kothapalli
International Parallel & Distributed Processing Symposium Workshops, IPDPS-W, 2013
@inproceedings{bib_Fast_2013, AUTHOR = {DIP SANKAR BANERJEE, PARIKSHIT VISHWAS SAKURIKAR, Kishore Kothapalli}, TITLE = {Fast, scalable parallel comparison sort on hybrid multicore architectures}, BOOKTITLE = {International Parallel & Distributed Processing Symposium Workshops}. YEAR = {2013}}
Sorting has been a topic of immense research value since the inception of Computer Science. Hybrid computing on multicore architectures involves computing simultaneously on a tightly coupled heterogeneous collection of devices. In this work, we consider a multicore CPU along with a many core GPU as our experimental hybrid platform. In this work, we present a hybrid comparison based sorting algorithm which utilizes a many-core GPU and a multi-core CPU to perform sorting. The algorithm is broadly based on splitting the input list according to a large number of splitters followed by creating independent sub lists. Sorting the independent sub lists results in sorting the entire original list. On a CPU+GPU platform consisting of an Intel i7 980 and an Nvidia GTX 580, our algorithm achieves a 20% gain over the current best known comparison sort result that was published by Davidson et. al. [In Par 2012]. On the above experimental platform, our results are better by 40% on average over a similar GPU-alone algorithm proposed by Leischner et. al. [IPDPS 2010]. Our results also show that our algorithm and its implementation scale with the size of the input. We also show that such performance gains can be obtained on other hybrid CPU+GPU platforms.
On the analysis of a label propagation algorithm for community detection
Kishore Kothapalli,Sriram V. Pemmaraju,Vivek Sardeshmukh
International Conference on Distributed Computing and Networking, ICDCN, 2013
@inproceedings{bib_On_t_2013, AUTHOR = {Kishore Kothapalli, Sriram V. Pemmaraju, Vivek Sardeshmukh}, TITLE = {On the analysis of a label propagation algorithm for community detection}, BOOKTITLE = {International Conference on Distributed Computing and Networking}. YEAR = {2013}}
This paper initiates formal analysis of a simple, distributed algorithm for community detection on networks. We analyze an algorithm that we call Max-LPA, both in terms of its convergence time and in terms of the “quality” of the communities detected. Max-LPA is an instance of a class of community detection algorithms called label propagation algorithms. As far as we know, most analysis of label propagation algorithms thus far has been empirical in nature and in this paper we seek a theoretical understanding of label propagation algorithms. In our main result, we define a clustered version of Erdös-Rényi random graphs with clusters V 1, V 2, …, V k where the probability p, of an edge connecting nodes within a cluster V i is higher than p′, the probability of an edge connecting nodes in distinct clusters. We show that even with fairly general restrictions on p and p′ ( p=Ω(1n1/4−ϵ) for any ε > 0, p′ = O(p 2), where n is the number of nodes), Max-LPA detects the clusters V 1, V 2, …, V n in just two rounds. Based on this and on empirical results, we conjecture that Max-LPA can correctly and quickly identify communities on clustered Erdös-Rényi graphs even when the clusters are much sparser, i.e., with p=clognn for some c > 1.
Planar Convex Hull Range Query and Related Problems.
NADEEM MOIDU,JATIN AGARWAL,Kishore Kothapalli
Annual Canadian Conference on Computational Geometry, CCCG, 2013
@inproceedings{bib_Plan_2013, AUTHOR = {NADEEM MOIDU, JATIN AGARWAL, Kishore Kothapalli}, TITLE = {Planar Convex Hull Range Query and Related Problems.}, BOOKTITLE = {Annual Canadian Conference on Computational Geometry}. YEAR = {2013}}
We consider the planar convex hull range query problem. Let P be a set of points in the plane. We preprocess these points into a data structure such that given an orthogonal range query, we can report the convex hull of the points in the range in O (log2 n+ h) time, where h is the size of the output. The data structure uses O (n log n) space. This improves the previous bound of O (log5 n+ h) time and O (n log2 n) space. Given a range query, it also supports extreme points in a given direction, tangent queries through a given point, and line-hull intersection queries on the points in the range in time O (log2 n) for each orthogonal query and O (log n) time for each additional query on that range. These problems have not been studied before.
Sparse matrix-matrix multiplication on modern architectures
KIRAN KUMAR M,I SIVARAMAKRISHNA,Kishore Kothapalli
International Conference on High Performance Computing, HiPC, 2012
@inproceedings{bib_Spar_2012, AUTHOR = {KIRAN KUMAR M, I SIVARAMAKRISHNA, Kishore Kothapalli}, TITLE = {Sparse matrix-matrix multiplication on modern architectures}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2012}}
Sparse matrix-sparse/dense matrix multiplications, spgemm and csrmm, respectively, among other applications find usage in various matrix formulations of graph problems. Considering the difficulties in executing graph problems and the duality between graphs and matrices, computations such as spgemm and csrmm have recently caught the attention of HPC community. These computations pose challenges such as load balancing, irregular nature of the computation, and difficulty in predicting the output size. It is even more challenging when combined with the GPU architectural constraints such as memory accesses, limited shared memory, strict SIMD and thread execution. To address these challenges on a GPU, we evaluate three possible variations of matrix multiplication (Row-Column, Column-Row, Row-Row) and perform suitable optimizations targeted at sparse matrices. Our experiments indicate that the Row-Row formulation, which mostly outperforms the other formulations, is 3.5x faster on average compared to an optimized multi-core implementation in the Intel MKL library. We extend the Row-Row formulation to a CPU+GPU hybrid algorithm that simultaneously utilizes the CPU also. In this direction, we present heuristics to find the right amount of work division between the CPU and the GPU. Our hybrid row-row formulation of the spgemm operation performs 5.5x faster on average when compared to the optimized multi-core implementation in the Intel MKL library. Our experience indicates that it is difficult to identify right amount of work division between the CPU and the GPU. We therefore investigate a subclass of sparse matrices, band matrices, and present an analytical method to identify a good work division when multiplying two band matrices. Our GPU csrmm operation performs 2.5x faster on average when compared to a corresponding implementation in the cusparse library, which outperforms the Intel MKL library implementation.
Discrete range searching primitive for the GPU and its applications
JYOTHISH SOMAN,Kishore Kothapalli,Narayanan P J
ACM Journal of Experimental Algorithmics, JEA, 2012
@inproceedings{bib_Disc_2012, AUTHOR = {JYOTHISH SOMAN, Kishore Kothapalli, Narayanan P J}, TITLE = {Discrete range searching primitive for the GPU and its applications}, BOOKTITLE = {ACM Journal of Experimental Algorithmics}. YEAR = {2012}}
Graphics processing units (GPUs) provide large computational power at a very low price, which position GPUs well as an ubiquitous accelerator. However, GPUs are space constrained, and hence applications developed for GPUs are space sensitive. Space-constrained computational devices such as GPUs can greatly benefit from representations that reduce space consumption drastically. One such representation is the succinct representation of trees. Succinct representation of trees generally allows for operations such as parent queries, least common ancestor queries, and so on. Mapping such a robust representation to the GPU for targeted applications can lead to substantial improvement in problem sizes that are processed at a given point of time. Space-saving methods such as succinct data structures remain largely unexplored on the GPU. In this work, a succinct representation of ordered trees on the GPU is explored, with application to discrete range searching (DRS). Based on the succinct representations found applicable, a space--saving solution for DRS is presented here. In our method, DRS is mapped to a least common ancestor query on a Cartesian tree. For space-efficient DRS queries, we store the succinct representation of the Cartesian tree of an array. Our method uses a maximum of 7.5 bits of additional space per element. Furthermore, the speed-up achieved by our method is in the range of 20--25 for preprocessing and 25--35 for batch querying over a sequential implementation. Compared to an 8-threaded implementation, our preprocessing and querying methods obtain a speed-up of 6--8. We also study the applications of the DRS on the GPU. Efficient primitives expand the range of applications performed on the GPU. DRS is one such primitive with direct applications to string processing, document and text retrieval systems, and least common ancestor queries. We suggest that graph algorithms that use the least common ancestor, can be enabled on the GPU based on DRS primitive. We also show some applications of DRS in tree queries and string querying.
On the Analysis of a Label Propagation Algorithm for Community Detection
Kishore Kothapalli,Sriram V. Pemmaraju,Vivek Sardeshmukh
International Conference on Distributed Computing and Networking, ICDCN, 2012
@inproceedings{bib_On_t_2012, AUTHOR = {Kishore Kothapalli, Sriram V. Pemmaraju, Vivek Sardeshmukh}, TITLE = {On the Analysis of a Label Propagation Algorithm for Community Detection}, BOOKTITLE = {International Conference on Distributed Computing and Networking}. YEAR = {2012}}
This paper initiates formal analysis of a simple, distributed algorithm for community detection on networks. We analyze an algorithm that we calltextsc {Max-LPA}, both in terms of its convergence time and in terms of the" quality" of the communities detected.textsc {Max-LPA} is an instance of a class of community detection algorithms calledtextit {label propagation} algorithms. As far as we know, most analysis of label propagation algorithms thus far has been empirical in nature and in this paper we seek a theoretical understanding of label propagation algorithms. In our main result, we define a clustered version ofer random graphs with clusters where the probability , of an edge connecting nodes within a cluster is higher than , the probability of an edge connecting nodes in distinct clusters. We show that even with fairly general restrictions on and ( for any , , where is the number of nodes),textsc {Max-LPA} detects the clusters in just two rounds. Based on this and on empirical results, we conjecture thattextsc {Max-LPA} can correctly and quickly identify communities on clustereder graphs even when the clusters are much sparser, ie, with for some .
On counting range maxima points in plane
ANIL KISHORE KALAVAGATTU,JATIN AGARWAL,ANANDA SWARUP DAS,Kishore Kothapalli
International Workshop Combinatorial Algorithm, IWOCA, 2012
@inproceedings{bib_On_c_2012, AUTHOR = {ANIL KISHORE KALAVAGATTU, JATIN AGARWAL, ANANDA SWARUP DAS, Kishore Kothapalli}, TITLE = {On counting range maxima points in plane}, BOOKTITLE = {International Workshop Combinatorial Algorithm}. YEAR = {2012}}
We consider the problem of reporting and counting maximal points in a given orthogonal query range in two-dimensions. Our model of computation is the pointer machine model. Let P be a static set of n points in ℝ2. A point is maximal in P if it is not dominated by any other point in P. We propose a linear space data structure that can support counting the number of maximal points inside a 3-sided orthogonal query rectangle unbounded on its right in O(logn) time. For counting the number of maximal points in a 4-sided orthogonal query rectangle, we propose an O(n logn) space data structure that can be constructed in O(n logn) time and queried upon in O(logn) time. This work proposes the first data structure for counting the number of maximal points in a query range. Das et al. proposed a data structure for the counting version in the word RAM model [WALCOM 2012]. For the corresponding reporting versions, we propose a linear size data structure for reporting maximal points inside a 3-sided query range in time O(logn + k), where k is the size of the output. We propose an O(n logn) size data structure for reporting the maximal points in a 4-sided orthogonal query range in time O(logn + k), where k is the size of the output. The methods we propose for reporting maximal points are simpler than previous methods and meet the best known bounds.
An on-demand fast parallel pseudo random number generator with applications
DIP SANKAR BANERJEE,AMAN KUMAR,Kishore Kothapalli
International Parallel & Distributed Processing Symposium Workshops, IPDPS-W, 2012
@inproceedings{bib_An_o_2012, AUTHOR = {DIP SANKAR BANERJEE, AMAN KUMAR, Kishore Kothapalli}, TITLE = {An on-demand fast parallel pseudo random number generator with applications}, BOOKTITLE = {International Parallel & Distributed Processing Symposium Workshops}. YEAR = {2012}}
The use of manycore architectures and accelerators, such as GPUs, with good programmability has allowed them to be deployed for vital computational work. The ability to use randomness in computation is known to help in several situations. For such computations to be made possible on a general purpose computer, a source of randomness, or in general a pseudo random generator (PRNG), is essential. However, most of the PRNGs currently available on GPUs suffer from some basic drawbacks that we highlight in this paper. It is of high interest therefore to develop a parallel, quality PRNG that also works in an on demand model. In this paper we investigate a CPU+GPU hybrid technique to create an efficient PRNG. The basic technique we apply is that of random walks on expander graphs. Unlike existing generators available in the GPU programming environment, our generator can produce random numbers on demand as opposed to a onetime generation. Our approach produces 0.07 GNumbers per second. The quality of our generator is tested with industry standard tests. We also demonstrate two applications of our PRNG. We apply our PRNG to design a list ranking algorithm which demonstrates the on-demand nature of the algorithm and a Monte Carlo simulation which shows the high quality of our generator.
Range aggregate maximal points in the plane
Ananda Swarup Das,Prosenjit Gupta,ANIL KISHORE KALAVAGATTU,JATIN AGARWAL,Srinathan Kannan,Kishore Kothapalli
International Workshop on Algorithms and Computation, WALCOM, 2012
@inproceedings{bib_Rang_2012, AUTHOR = {Ananda Swarup Das, Prosenjit Gupta, ANIL KISHORE KALAVAGATTU, JATIN AGARWAL, Srinathan Kannan, Kishore Kothapalli}, TITLE = {Range aggregate maximal points in the plane}, BOOKTITLE = {International Workshop on Algorithms and Computation}. YEAR = {2012}}
In this work, we study the problem of reporting and counting maximal points in a query rectangle for a set of n integer points that lie on an n×n grid. A point is said to be maximal inside a query rectangle if it is not dominated by any other point inside the query rectangle. Our model of computation is unit-cost RAM model with word size of O(logn) bits. For the reporting version of the problem, we present a data structure of size words and support querying in time where k is the size of the output. For the counting version, we present a data structure of size words which supports querying in . Both the data structures are static in nature. The reporting version of the problem has been studied in [1] and [5]. To the best of our knowledge, this is the first sub-logarithmic result for the reporting version and the first work for the counting version of the problem.
Hybrid algorithms for list ranking and graph connected components
DIP SANKAR BANERJEE,Kishore Kothapalli
International Conference on High Performance Computing, HiPC, 2011
@inproceedings{bib_Hybr_2011, AUTHOR = {DIP SANKAR BANERJEE, Kishore Kothapalli}, TITLE = {Hybrid algorithms for list ranking and graph connected components}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2011}}
The advent of multicore and many-core architectures saw them being deployed to speed-up computations across several disciplines and application areas. Prominent examples include semi-numerical algorithms such as sorting, graph algorithms, image processing, scientific computations, and the like. In particular, using GPUs for general purpose computations has attracted a lot of attention given that GPUs can deliver more than one TFLOP of computing power at very low prices. In this work, we use a new model of multicore computing called hybrid multicore computing where the computation is performed simultaneously a control device, such as a CPU, and an accelerator such as a GPU. To this end, we use two case studies to explore the algorithmic and analytical issues in hybrid multicore computing. Our case studies involve two different ways of designing hybrid multicore algorithms. The main contribution of this paper is to address the issues related to the design of hybrid solutions. We show our hybrid algorithm for list ranking is faster by 50% compared to the best known implementation [Z. Wei, J. JaJa; IPDPS 2010]. Similarly, our hybrid algorithm for graph connected components is faster by 25% compared to the best known GPU implementation [26].
Accelerating sparse matrix vector multiplication in iterative methods using GPU
KIRAN KUMAR M,Kishore Kothapalli
International Conference on Parallel Processing, ICPP, 2011
@inproceedings{bib_Acce_2011, AUTHOR = {KIRAN KUMAR M, Kishore Kothapalli}, TITLE = {Accelerating sparse matrix vector multiplication in iterative methods using GPU}, BOOKTITLE = {International Conference on Parallel Processing}. YEAR = {2011}}
Multiplying a sparse matrix with a vector (spmv for short) is a fundamental operation in many linear algebra kernels. Having an efficient spmv kernel on modern architectures such as the GPUs is therefore of principal interest. The computational challenges that spmv poses are significantlydifferent compared to that of the dense linear algebra kernels. Recent work in this direction has focused on designing data structures to represent sparse matrices so as to improve theefficiency of spmv kernels. However, as the nature of sparseness differs across sparse matrices, there is no clear answer as to which data structure to use given a sparse matrix. In this work, we address this problem by devising techniques to understand the nature of the sparse matrix and then choose appropriate data structures accordingly. By using our technique, we are able to improve the performance of the spmv kernel on an Nvidia Tesla GPU (C1060) by a factor of up to80% in some instances, and about 25% on average compared to the best results of Bell and Garland [3] on the standard dataset (cf. Williams et al. SC'07) used in recent literature. We also use our spmv in the conjugate gradient method and show an average 20% improvement compared to using HYB spmv of [3], on the dataset obtained from the The University of Florida Sparse Matrix Collection [9].
Finding Maximum Density Axes Parallel Regions for Weighted Point Sets.
ANANDA SWARUP DAS,Prosenjit Gupta,Srinathan Kannan,Kishore Kothapalli
Annual Canadian Conference on Computational Geometry, CCCG, 2011
@inproceedings{bib_Find_2011, AUTHOR = {ANANDA SWARUP DAS, Prosenjit Gupta, Srinathan Kannan, Kishore Kothapalli}, TITLE = {Finding Maximum Density Axes Parallel Regions for Weighted Point Sets.}, BOOKTITLE = {Annual Canadian Conference on Computational Geometry}. YEAR = {2011}}
In this work we study the problem of finding axesparallel regions of maximum density for weighted point sets in IR2 and IR3. The 2-d variant is motivated by applications in thermal analysis of VLSI chips.
On Finding Skyline Points for Range Queries in Plane.
ANIL KISHORE KALAVAGATTU,ANANDA SWARUP DAS,Kishore Kothapalli,Srinathan Kannan
Annual Canadian Conference on Computational Geometry, CCCG, 2011
@inproceedings{bib_On_F_2011, AUTHOR = {ANIL KISHORE KALAVAGATTU, ANANDA SWARUP DAS, Kishore Kothapalli, Srinathan Kannan}, TITLE = {On Finding Skyline Points for Range Queries in Plane.}, BOOKTITLE = {Annual Canadian Conference on Computational Geometry}. YEAR = {2011}}
We consider the dominating point set reporting problem in two-dimension. We propose a data structure for finding the set of dominating points inside a given orthogonal query rectangle. Given a set of n points in the plane, it supports 4-sided queries in O (log n+ k), where k is size of the output, using O (n log n) space. This work can be of application when range queries are generated using mobile devices with limited display capacity.
Distributed graph coloring in a few rounds
Kishore Kothapalli,Sriram Pemmaraju
ACM SIGACTSIGOPS Symposium on Principles of Distributed Computing, ACM PODC, 2011
@inproceedings{bib_Dist_2011, AUTHOR = {Kishore Kothapalli, Sriram Pemmaraju}, TITLE = {Distributed graph coloring in a few rounds}, BOOKTITLE = {ACM SIGACTSIGOPS Symposium on Principles of Distributed Computing}. YEAR = {2011}}
This paper considers the question of how many colors a distributed graph coloring algorithm would need to use if it had only k rounds available, for any positive integer k. In our main result, we present an algorithm that runs in O(k) rounds for any k bounded below by ©(log log n) and bounded above by O(√log n), and uses O(a Å n1/k) colors to color a graph with arboricity a. This result is optimal since the palette size matches the lower bound of Barenboim and Elkin (PODC 2008). This result is achieved via the use of several new results developed in this paper on coloring graphs whose edges have been acyclically oriented. For example, suppose that G is an n-vertex, acyclically oriented graph with maximum out-degree Δo. We present an algorithm that, for any k ≥ 2 loglog n, runs in O(k) rounds on G to produce an (i) O(Δo)-coloring when Δo Δ ©(maxkn2/k2log1+1/kn, 2k) and an (ii) O(Δo Å n2/k2)-coloring when Δo ∈ Ω(maxk log1+1/kn, 2k). These results are useful in any setting where it is possible to efficiently compute acyclic orientations of a graph with Δo << Δ. We derive non-trivial bounds on the palette size even when k < 2 loglog n. Our main technical contributions can be summarized as: (i) developing a k-round version of the algorithm of Kothapalli et al. (IPDPS 2006) which computes an O(?)-coloring of a graph in O(√log n) rounds, and (ii) developing an oriented version of the Brooks-Vizing coloring result of Grable and Panconesi (SODA 1998).
GPU accelerated Lanczos algorithm with applications
KIRAN KUMAR M,Kishore Kothapalli
International Conference on Advanced Information Networking and Applications Workshops, AINA-W, 2011
@inproceedings{bib_GPU__2011, AUTHOR = {KIRAN KUMAR M, Kishore Kothapalli}, TITLE = {GPU accelerated Lanczos algorithm with applications}, BOOKTITLE = {International Conference on Advanced Information Networking and Applications Workshops}. YEAR = {2011}}
Graphics Processing Units provide a large computational power at a very low price which position them as an ubiquitous accelerator. GPGPU is accelerating general purpose computations using GPU's. GPU's have been used to accelerate many Linear Algebra routines and Numerical Methods. Lanczos is an iterative method well suited for finding the extreme eigenvalues and the corresponding eigenvectors of large sparse symmetric matrices. In this paper, we present an implementation of Lanczos Algorithm on GPU using the CUDA programming model and apply it to two important problems : graph bisection using spectral methods, and image segmentation. Our GPU implementation of spectral bisection performs better when compared to both an Intel Math Kernel Library implementation and a Matlab implementation. Our GPU implementation shows a speedup up to 97.3 times over Matlab Implementation and 2.89 times over the Intel Math Kernel Library implementation on a Intel Core i7 920 Processor, which is a quad-core CPU. Similarly, our image segmentation implementation achieves a speed up of 3.27 compared to a multicore CPU based implementation using Intel Math Kernel Library and OpenMP. Through this work, we therefore wish to establish that the GPU may still be a better platform for also highly irregular and computationally intensive applications.
Fast two dimensional convex hull on the GPU
SRIKANTH SRUNGARAPU,B DURGA PRASAD REDDY,Kishore Kothapalli,Narayanan P J
International Conference on Advanced Information Networking and Applications Workshops, AINA-W, 2011
@inproceedings{bib_Fast_2011, AUTHOR = {SRIKANTH SRUNGARAPU, B DURGA PRASAD REDDY, Kishore Kothapalli, Narayanan P J}, TITLE = {Fast two dimensional convex hull on the GPU}, BOOKTITLE = {International Conference on Advanced Information Networking and Applications Workshops}. YEAR = {2011}}
General purpose programming on the graphics processing units(GPGPU) has received a lot of attention in the parallel computing community as it promises to offer a large computational power at a very low price. GPGPU is best suited for regular data parallel algorithms. They are not directly amenable for algorithms which have irregular data access patterns such as convex hull, list ranking etc. In this paper, we present a GPU-optimized implementation for finding the convex hull of a two dimensional point set. Our implementation tries to minimize the impact of irregular data access patterns. Our implementation can find the convex hull of 10 million random points in less than 0.2 seconds and achieves a speedup of up to 14 over the standard sequential CPU implementation. We also discuss some of the practical issues relating to the implementation of convex hull algorithms on massively multi-threaded architectures like that of the GPU.
Acyclic vertex coloring of graphs of maximum degree 5
KISHOR YADAV KOMMANABOINA,VARAGANI SATISH,Kishore Kothapalli,V Ch Venkaiah
Discrete Mathematics, DM, 2011
@inproceedings{bib_Acyc_2011, AUTHOR = {KISHOR YADAV KOMMANABOINA, VARAGANI SATISH, Kishore Kothapalli, V Ch Venkaiah}, TITLE = {Acyclic vertex coloring of graphs of maximum degree 5}, BOOKTITLE = {Discrete Mathematics}. YEAR = {2011}}
An acyclic vertex coloring of a graph is a proper vertex coloring such that there are no bichromatic cycles. The acyclic chromatic number of G, denoted a (G), is the minimum number of colors required for acyclic vertex coloring of graph G. For a family F of graphs, the acyclic chromatic number of F, denoted by a (F), is defined as the maximum a (G) over all the graphs G∈ F. In this paper we show that a (F)= 8 where F is the family of graphs of maximum degree 5 and give a linear time algorithm to achieve this bound.
Efficient Discrete Range Searching primitives on the GPU with applications
JYOTHISH SOMAN,KIRAN KUMAR M,Kishore Kothapalli,Narayanan P J
International Conference on High Performance Computing, HiPC, 2010
@inproceedings{bib_Effi_2010, AUTHOR = {JYOTHISH SOMAN, KIRAN KUMAR M, Kishore Kothapalli, Narayanan P J}, TITLE = {Efficient Discrete Range Searching primitives on the GPU with applications}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2010}}
Graphics processing units provide a large computational power at a very low price which position them as an ubiquitous accelerator. Efficient primitives that can expand the r ange of operations performed on the GPU are thus important. Discrete Range Searching(DRS) is one such primitive with direct applications to string processing, document and text retrieval systems, and least common ancestor queries. In this work, we present a GPU specific implementation of DRS with an optimal space-time trade off. Toward this end, we also present GPU amenable succinct representations and discuss limitations on the GPU. Our method uses 7.5 bits of additional space per element. The speedup achieved by our method is in the range of 20-25 for preprocessing, and 25-35 for batch querying over a sequential implementation. Compared to an 8-threaded implementation, our methods obtain a speedup of 6-8. We study applications of the DRS on the GPU. Also, we suggest that most graph algorithms which focus on using least common ancestor, can easily be enabled on the GPU based on range minima primitive. Beyond this, we show applications of DRS in string querying and tree queries, and suggest how DRS can be helpful in implementing tree based graph algorithms on the GPU.
Some GPU algorithms for graph connected components and spanning tree
JYOTHISH SOMAN,Kishore Kothapalli,Narayanan P J
Parallel Processing Letters, JPPL, 2010
@inproceedings{bib_Some_2010, AUTHOR = {JYOTHISH SOMAN, Kishore Kothapalli, Narayanan P J}, TITLE = {Some GPU algorithms for graph connected components and spanning tree}, BOOKTITLE = {Parallel Processing Letters}. YEAR = {2010}}
Graphics Processing Units (GPU) are application specific accelerators which provide high performance to cost ratio and are widely available and used, hence places them as a ubiquitous accelerator. A computing paradigm based on the same is the general purpose computing on the GPU (GPGPU) model. The GPU due to its graphics lineage is better suited for the data-parallel, data-regular algorithms. The hardware architecture of the GPU is not suitable for the data parallel but data irregular algorithms such as graph connected components and list ranking. In this paper, we present results that show how to use GPUs efficiently for graph algorithms which are known to have irregular data access patterns. We consider two fundamental graph problems: finding the connected components and finding a spanning tree. These two problems find applications in several graph theoretical problems. In this paper we arrive at efficient GPU implementations for the above two problems. The algorithms focus on minimising irregularity at both algorithmic and implementation level. Our implementation achieves a speedup of 11-16 times over a corresponding best sequential implementation.
The power of orientation in symmetry-breaking
SATYA KRISHNA PINDIPROLI,Kishore Kothapalli
International Conference on Advanced Information Networking and Applications, AINA, 2010
@inproceedings{bib_The__2010, AUTHOR = {SATYA KRISHNA PINDIPROLI, Kishore Kothapalli}, TITLE = {The power of orientation in symmetry-breaking}, BOOKTITLE = {International Conference on Advanced Information Networking and Applications}. YEAR = {2010}}
Symmetry breaking is a fundamental operation in distributed computing. It has applications to important problems such as graph vertex and edge coloring, maximal independent sets, and the like. Deterministic algorithms for symmetry breaking that run in a polylogarithmic number of rounds are not known. However, randomized algorithms that run in poly-logarithmic number of rounds are known starting from Luby's algorithm. Recently, orientation on edges was considered and it was shown that an O(Δ) coloring of the vertices of a given oriented graph can be arrived at using essentially O(log Δ + √log n) bits of communication. In this paper we further demonstrate the power of orientation on edges in symmetry-breaking. We present efficient algorithms to construct fractional independent sets in constant degree graphs using very low order communication between the vertices. For instance, we show that in bounded degree graphs and planar graphs, it is possible to construct a fractional independent set by exchanging O(1) bits. Further, we present algorithms to construct maximal independent sets in bounded degree graphs and oriented trees. Our algorithm for constructing an MIS of an oriented tree uses only O(log n) bits of communication.
A fully dynamic and self-stabilizing TDMA scheme for wireless ad-hoc networks
Bezawada Bruhadeshwar,Kishore Kothapalli,Indira Radhika Pulla
International Conference on Advanced Information Networking and Applications, AINA, 2010
@inproceedings{bib_A_fu_2010, AUTHOR = {Bezawada Bruhadeshwar, Kishore Kothapalli, Indira Radhika Pulla}, TITLE = {A fully dynamic and self-stabilizing TDMA scheme for wireless ad-hoc networks}, BOOKTITLE = {International Conference on Advanced Information Networking and Applications}. YEAR = {2010}}
One important challenge in wireless ad hoc networks is to achieve collision free communication. Many MAC layer protocols have been proposed by considering various communication models of the ad hoc networks. One such protocol is TDMA, which provides for interference free communication while providing some bound on the packet delay. However, most proposed TDMA schemes can perform scheduling transmissions among nodes that are within transmission range. This is due to the simplistic modeling of the underlying network. Recently, more practical network models and overlay construction algorithms have been proposed that consider the interference among nodes as well. Existing TDMA protocols are unsuitable for such models as they will have to consider the interference range of the nodes as well. This requires that the TDMA protocol operate in a global perspective while working as a local-control algorithm. In this work, we describe a novel TDMA protocol that is suitable for such realistic and practical network models. We also extend the TDMA protocol to handle other difficulties such as membership changes within transmission range and interference range. Some networks also have to deal with sleeping nodes that wake up periodically. Our TDMA scheme employs a simple control phase and a data phase to handle such tasks. Some of the benefits of our solution are that no knowledge of the network parameters such as the size or an estimate of the size are used. The overhead of our scheme is also very low and (control) messages exchanged are of small size. Our solution will also be self-stabilizing which is an important property for distributed systems. To the best of our knowledge, our TDMA protocol is the first MAC layer protocol for such realistic network models. Additionally, we report some experimental results of our scheme.
Fast GPU algorithms for graph connectivity
JYOTHISH SOMAN,Kishore Kothapalli,Narayanan P J
Workshop on Large Sacle Parallel Processing, LSPP, 2010
@inproceedings{bib_Fast_2010, AUTHOR = {JYOTHISH SOMAN, Kishore Kothapalli, Narayanan P J}, TITLE = {Fast GPU algorithms for graph connectivity}, BOOKTITLE = {Workshop on Large Sacle Parallel Processing}. YEAR = {2010}}
Graphics processing units provide a large compu-tational power at a very low price which position them as an ubiquitous accelerator. General purpose programming on the graphics processing units (GPGPU) is best suited for regular data parallel algorithms. They are not directly amenable for algorithms which have irregular data access patterns such as list ranking, and finding the connected components of a graph, and the like. In this work, we present a GPU-optimized implementation for finding the connected components of a given graph. Our implementation tries to minimize the impact of irregularity, both at the data level and functional level. Our implementation achieves a speed up of 9 to 12 times over the best sequential CPU implementation. For instance, our implementation finds connected components of a graph of 10 million nodes and 60 million edges in about 500 milliseconds on a GPU, given a random edge list. We also draw interesting observations on why PRAM algorithms, such as the Shiloach-Vishkin algorithm may not be a good fit for the GPU and how they should be modified.
Automatic analysis of distance bounding protocols
Sreekanth Malladi,Bezawada Bruhadeshwar,Kishore Kothapalli
Technical Report, arXiv, 2010
@inproceedings{bib_Auto_2010, AUTHOR = {Sreekanth Malladi, Bezawada Bruhadeshwar, Kishore Kothapalli}, TITLE = {Automatic analysis of distance bounding protocols}, BOOKTITLE = {Technical Report}. YEAR = {2010}}
Distance bounding protocols are used by nodes in wireless networks to calculate upper bounds on their distances to other nodes. However, dishonest nodes in the network can turn the calculations both illegitimate and inaccurate when they participate in protocol executions. It is important to analyze protocols for the possibility of such violations. Past efforts to analyze distance bounding protocols have only been manual. However, automated approaches are important since they are quite likely to find flaws that manual approaches cannot, as witnessed in literature for analysis pertaining to key establishment protocols. In this paper, we use the constraint solver tool to automatically analyze distance bounding protocols. We first formulate a new trace property called Secure Distance Bounding (SDB) that protocol executions must satisfy. We then classify the scenarios in which these protocols can operate considering the (dis) honesty of nodes and location of the attacker in the network. Finally, we extend the constraint solver so that it can be used to test protocols for violations of SDB in these scenarios and illustrate our technique on some published protocols.
GPU-accelerated genetic algorithms
RAJVI SHAH,Narayanan P J,Kishore Kothapalli
Technical Report, arXiv, 2010
@inproceedings{bib_GPU-_2010, AUTHOR = {RAJVI SHAH, Narayanan P J, Kishore Kothapalli}, TITLE = {GPU-accelerated genetic algorithms}, BOOKTITLE = {Technical Report}. YEAR = {2010}}
Genetic algorithms are effective in solving many optimiza-tion tasks. However, the long execution time associated withit prevents its use in many domains. In this paper, we pro-pose a new approach for parallel implementation of geneticalgorithm on graphics processing units (GPUs) using CUDAprogramming model. We exploit the parallelism within achromosome in addition to the parallelism across multiplechromosomes. The use of one thread per chromosome byprevious efforts does not utilize the GPU resources effec-tively. Our approach uses multiple threads per chromosome,thereby exploiting the massively multithreaded GPU moreeffectively. This results in good utilization of GPU resourceseven at small population sizes while maintaining impressivespeed up for large population sizes. Our approach is mod-eled after the GAlib library and is adaptable to a varietyof problems. We obtain a speedup of over 1500 over theCPU on problems involving a million chromosomes. Prob-lems of such magnitude are not ordinarily attempted due tothe prohibitive computation times.
A fast GPU algorithm for graph connectivity
JYOTHISH SOMAN,Kishore Kothapalli,Narayanan P J
International Parallel & Distributed Processing Symposium Workshops, IPDPS-W, 2010
@inproceedings{bib_A_fa_2010, AUTHOR = {JYOTHISH SOMAN, Kishore Kothapalli, Narayanan P J}, TITLE = {A fast GPU algorithm for graph connectivity}, BOOKTITLE = {International Parallel & Distributed Processing Symposium Workshops}. YEAR = {2010}}
Graphics processing units provide a large compu-tational power at a very low price which position them as anubiquitous accelerator. General purpose programming on thegraphics processing units (GPGPU) is best suited for regulardata parallel algorithms. They are not directly amenable foralgorithms which have irregular data access patterns suchas list ranking, and finding the connected components of agraph, and the like. In this work, we present a GPU-optimizedimplementation for finding the connected components of agiven graph. Our implementation tries to minimize the impactof irregularity, both at the data level and functional level.Our implementation achieves a speed up of 9 to 12 timesover the best sequential CPU implementation. For instance, ourimplementation finds connected components of a graph of 10million nodes and 60 million edges in about 500 millisecondson a GPU, given a random edge list. We also draw interestingobservations on why PRAM algorithms, such as the Shiloach-Vishkin algorithm may not be a good fit for the GPU and howthey should be modified.
A performance prediction model for the CUDA GPGPU platform
Kishore Kothapalli,RISHABH MUKHERJEE,MOHAMMED SUHAIL REHMAN,SURYAKANT PATIDAR,Narayanan P J,Srinathan Kannan
International Conference on High Performance Computing, HiPC, 2009
@inproceedings{bib_A_pe_2009, AUTHOR = {Kishore Kothapalli, RISHABH MUKHERJEE, MOHAMMED SUHAIL REHMAN, SURYAKANT PATIDAR, Narayanan P J, Srinathan Kannan}, TITLE = {A performance prediction model for the CUDA GPGPU platform}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2009}}
The significant growth in computational power of modern Graphics Processing Units (GPUs) coupled with the advent of general purpose programming environments like NVIDIA's CUDA, has seen GPUs emerging as a very popular parallel computing platform. Till recently, there has not been a performance model for GPGPUs. The absence of such a model makes it difficult to definitively assess the suitability of the GPU for solving a particular problem and is a significant impediment to the mainstream adoption of GPUs as a massively parallel (super)computing platform. In this paper we present a performance prediction model for the CUDA GPGPU platform. This model encompasses the various facets of the GPU architecture like scheduling, memory hierarchy, and pipelining among others. We also perform experiments that demonstrate the effects of various memory access strategies. The proposed model can be used to analyze pseudo code for a CUDA kernel to obtain a performance estimate, in a way that is similar to performing asymptotic analysis. We illustrate the usage of our model and its accuracy with three case studies: matrix multiplication, list ranking, and histogram generation.
Acyclic Vertex Coloring of Graphs of Maximum Degree∆ Indian Mathematical Society
Kishore Kothapalli,VARAGANI SATISH,V. Ch. Venkaiah
International Conference Theoretical Computer Science and Discrete Mathematics, DMTCS, 2009
@inproceedings{bib_Acyc_2009, AUTHOR = {Kishore Kothapalli, VARAGANI SATISH, V. Ch. Venkaiah}, TITLE = {Acyclic Vertex Coloring of Graphs of Maximum Degree∆ Indian Mathematical Society}, BOOKTITLE = {International Conference Theoretical Computer Science and Discrete Mathematics}. YEAR = {2009}}
An acyclic vertex coloring of a graph is a proper vertex coloring such that there are no bichromatic cycles. The acyclic chromatic number of G, denoted a (G), is the minimum number of colors required for acyclic vertex coloring of graph G=(V, E). In this paper we show that for any graph G with maximum degree∆, a (G)≤ 3∆ 2+ 4∆+ 8
Fast and scalable list ranking on the GPU
MOHAMMED SUHAIL REHMAN,Kishore Kothapalli,Narayanan P J
ACM International Conference on Supercomputing, ICS, 2009
@inproceedings{bib_Fast_2009, AUTHOR = {MOHAMMED SUHAIL REHMAN, Kishore Kothapalli, Narayanan P J}, TITLE = {Fast and scalable list ranking on the GPU}, BOOKTITLE = {ACM International Conference on Supercomputing}. YEAR = {2009}}
General purpose programming on the graphics processing units (GPGPU) has received a lot of attention in the parallel computing community as it promises to offer the highest performance per dollar. The GPUs have been used extensively on regular problems that can be easily parallelized. In this paper, we describe two implementations of List Ranking, a traditional irregular algorithm that is difficult to parallelize on such massively multi-threaded hardware. We first present an implementation of Wyllie's algorithm based on pointer jumping. This technique does not scale well to large lists due to the suboptimal work done. We then present a GPU-optimized, Recursive Helman-JaJa (RHJ) algorithm. Our RHJ implementation can rank a random list of 32 million elements in about a second and achieves a speedup of about 8-9 over a CPU implementation as well as a speedup of 3-4 over the best reported implementation on the Cell Broadband engine. We also discuss the practical issues relating to the implementation of irregular algorithms on massively multi-threaded architectures like that of the GPU. Regular or coalesced memory accesses pattern and balanced load are critical to achieve good performance on the GPU.
Distributed Computing and Networking: 10th International Conference, ICDCN 2009, Hyderabad, India, January 3-6, 2009, Proceedings
Vijay Garg,Roger Wattenhofer,Kishore Kothapalli
International Conference on Distributed Computing and Networking, ICDCN, 2009
@inproceedings{bib_Dist_2009, AUTHOR = {Vijay Garg, Roger Wattenhofer, Kishore Kothapalli}, TITLE = {Distributed Computing and Networking: 10th International Conference, ICDCN 2009, Hyderabad, India, January 3-6, 2009, Proceedings}, BOOKTITLE = {International Conference on Distributed Computing and Networking}. YEAR = {2009}}
people volunteer their time and energy and work in a dedicated fashion to pull everything together each year, including our very supportive Steering Comm-tee members led by Sukumar Ghosh. However, the success of ICDCN is mainly due to the hard work of all those people who submit papers and/or attend the conference. We thank you all. January 2009 Prasad Jayanti Andrew T. Campbell Message from the Technical Program Chairs Welcome to the proceedings of the 10thInternationalConferenceon Distributed Computing and Networking (ICDCN) 2009. As ICDCN celebrates its 10th-niversary, ithasbecomeanimportantforumfordisseminatingthelatestresearch results in distributed computing and networking. We received 179 submissions from all over the world, including Algeria, A-tralia, Canada, China, Egypt, France, Germany, Hong Kong, Iran, Italy, Japan, Malaysia, The Netherlands, Poland, Singapore, South Korea, Taiwan, and the USA, besides India, the host country. The submissions were read and evaluated by the Program Committee, which consisted of 25 members for the Distributed Computing Track and 28 members for the Networking Track, with the ad-tional help of external reviewers. The Program Committee selected 20 regular papers and 32 short papers for inclusion in the proceedings and presentation at the conference. We were fortunate to have several distinguished scientists as keynote speakers. Andrew Campbell (Dartmouth College, USA), Maurice Herlihy (Brown University, USA), and PR Kumar (University of of Illinois, Urbana-Champaign) delivered the keynote address. Krithi Ramamritham from IIT Bombay, India
Routing protocol security using symmetric key based techniques
Bezawada Bruhadeshwar,Kishore Kothapalli,M POORNIMA,MUPPA DIVYA
International Conference on Availability, Reliability and Security, ARES, 2009
@inproceedings{bib_Rout_2009, AUTHOR = {Bezawada Bruhadeshwar, Kishore Kothapalli, M POORNIMA, MUPPA DIVYA}, TITLE = {Routing protocol security using symmetric key based techniques}, BOOKTITLE = {International Conference on Availability, Reliability and Security}. YEAR = {2009}}
In this paper, we address the security of routing protocols. Internet routing protocols are subject to attacks in the control plane as well as the data plane. In the control plane, a routing protocol, e.g., BGP, OSPF, exchanges routing state updates and enables routers to compute the best paths towards various destinations. During this phase, an attacker can modify or inject malicious control messages leading to incorrect computation of routing paths. In the data plane, the routers forward the data along the paths computed in the control plane. Even if an attacker is not successful during the control phase, he can choose not to use the correct routing paths and forward data along routes that benefit him. Research shows that, attacks on the control plane can be mitigated by ensuring message integrity and, attacks on the data plane can be mitigated by ensuring route integrity. Earlier works have addressed these two problems independently with many interesting solutions. However, due to the nature of these solutions, network architects cannot deploy security at both planes without increasing the overhead on the network. In this paper, we focus on an integrated approach and propose the use of symmetric key protocols for addressing the security at both the control and data planes. We describe approaches that enable the reuse of the symmetric key protocols thereby eliminating the need for separate solutions at different planes. We used symmetric key protocols as they are efficient and scalable. Our experimental results show that our approaches are practical and can be incrementally deployed as well.
Reducing the cost of session key establishment
Bezawada Bruhadeshwar,Kishore Kothapalli,M. SREE DEEPYA
International Conference on Availability, Reliability and Security, ARES, 2009
@inproceedings{bib_Redu_2009, AUTHOR = {Bezawada Bruhadeshwar, Kishore Kothapalli, M. SREE DEEPYA}, TITLE = {Reducing the cost of session key establishment}, BOOKTITLE = {International Conference on Availability, Reliability and Security}. YEAR = {2009}}
Scenarios such as online banking, mobile payment systems, stock trading, selling merchandise, and a host of other applications that need a high level of security have moved from the research domain to real world. Moreover, the nature of clients has been changing from traditional desktops to mobile and handheld devices. Protocols like SSL, SSH are the present standard for establishing secure channels. However, the drawback in these protocols is that both the server and the client need to perform computationally expensive public-key operations for secure channel establishment. In this paper, we present simple constructions that spread the cost of secure channel establishment over several sessions. Our constructions are incrementally deployable and can operate with existing protocols such as SSL and SSH. Experimental results indicate that our constructions are practical and efficient in reducing the computational load at the server as well as the client side.
Experimental analysis of distributed coloring algorithms
SATYA KRISHNA PINDIPROLI,Kishore Kothapalli
International Conference on Advanced Computing, ICoAC, 2009
@inproceedings{bib_Expe_2009, AUTHOR = {SATYA KRISHNA PINDIPROLI, Kishore Kothapalli}, TITLE = {Experimental analysis of distributed coloring algorithms}, BOOKTITLE = {International Conference on Advanced Computing}. YEAR = {2009}}
We aim at an empirical analysis of distributed vertex coloring algorithms. To this end, we compare the empirical performance of a recently proposed distributed vertex coloring algorithm [8] with that of Luby's algorithm. To get a good coverage we look at the cycle graph on n vertices, cliques, and random graphs from the family G(n, p) by controlling n, p and np. The results of our experiments fairly demonstrate the improvement in the bit complexity of the algorithm proposed in [8]. Our results also match those of the experiments of Panconesi et. al. [3] on Luby's algorithm.
Hybrid multi-core algorithms for regular image filtering applications
SHRENIK LAD,KRISHNA KUMAR SINGH,Kishore Kothapalli,Narayanan P J
Genetic Programming and Evolvable Machines, GPEM, 2009
@inproceedings{bib_Hybr_2009, AUTHOR = {SHRENIK LAD, KRISHNA KUMAR SINGH, Kishore Kothapalli, Narayanan P J}, TITLE = {Hybrid multi-core algorithms for regular image filtering applications}, BOOKTITLE = {Genetic Programming and Evolvable Machines}. YEAR = {2009}}
GPGPU has received lot of attention in the past few years mainly because of the performance gain GPUs offer at a low price. Recently, researchers have identified hybrid multi-core computing as a better solution compared to accelerator based computing for several prob-lems. In this paper, we evaluate two regular problems in image processing, bilateral filtering and convolution, on a hybrid multi-core platform. We provide a detailed analysis of these algorithms by comparing their performance on three platforms 1)CPU+GPU hybrid, 2) pure GPU and3)pure CPU. We show that performance gains as hig has 30-40%can be obtained by using basic techniques like data decomposition and overlapped execution, when a hybrid computing model is used. Finally, we conclude by discussing some future prospects in the area of hybrid computing
Parallelizing two dimensional convex hull on NVIDIA GPU and Cell BE
SRIKANTH S,B DURGA PRASAD REDDY,Kishore Kothapalli,Govindarajulu Regeti,Narayanan P J
International Conference on High Performance Computing, HiPC, 2009
@inproceedings{bib_Para_2009, AUTHOR = {SRIKANTH S, B DURGA PRASAD REDDY, Kishore Kothapalli, Govindarajulu Regeti, Narayanan P J}, TITLE = {Parallelizing two dimensional convex hull on NVIDIA GPU and Cell BE}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2009}}
Multicore processors are a shift of paradigm in computer architecture that promises dramatic increase in performance. But they also bring complexity in algorithmic design. In this paper we describe the challenges and design issues involved in parallelizing two dimensional convex hull on both CUDA and Cell Brodband Engine (Cell BE). We have parallelized the quickhull algorithm for two dimensional convex hull. The major advantage of this algorithm is that interprocessor communication cost is highly reduced.
Vertex Magic Total Labelings of Complete Graphs
KRISHNAPPA H K,Kishore Kothapalli,V. Ch. Venkaiah
AKCE International Journal of Graphs and Combinatorics, IJGC, 2009
@inproceedings{bib_Vert_2009, AUTHOR = {KRISHNAPPA H K, Kishore Kothapalli, V. Ch. Venkaiah}, TITLE = {Vertex Magic Total Labelings of Complete Graphs}, BOOKTITLE = {AKCE International Journal of Graphs and Combinatorics}. YEAR = {2009}}
A vertex magic total labeling of a graph G = (V, E) is a bijection f: V ≼ E → {1, 2,…, |V| + |E|} such that for every vertex w, the sum is a constant. It is well known that all complete graphs Kn admit a vertex magic total labeling. In this paper we present a new proof of this theorem using the concepts of twin factorization and magic square.
Cordial labelings of a Class of Planar Graphs
K. Ramanjaneyulu,V. Ch. Venkaiah,Kishore Kothapalli
AKCE International Journal of Graphs and Combinatorics, IJGC, 2009
@inproceedings{bib_Cord_2009, AUTHOR = {K. Ramanjaneyulu, V. Ch. Venkaiah, Kishore Kothapalli}, TITLE = {Cordial labelings of a Class of Planar Graphs}, BOOKTITLE = {AKCE International Journal of Graphs and Combinatorics}. YEAR = {2009}}
Let G = (V, E) be a graph and let f: V → {0, 1} be a mapping from the set of vertices to {0, 1} and for each edge (u, v) ∊ E assign the label |f(u) — f(v)|. If the number of vertices labeled with 0 and the number of vertices labeled with 1 differ by at most 1 and the number of edges labled with 0 and the number of edges labeled with 1 differ by at most 1, then f is called a cordial labeling. In this paper, we present two families of planar graphs, Pln and Plm,n defined shortly, that admit a cordial labeling. We also show that the class Plm,n admits a total product cordial labeling under certain conditions.
Graph Theoretic Approach for Studying Correlated Motions in Biomolecules
ESHITA MUTT,MONIKA SHARMA,Abhijit Mitra,JYOTHISH SOMAN,Kishore Kothapalli,Naveena Yanamala
World Congress on Nature & Biologically Inspired Computing, NaBIC, 2009
@inproceedings{bib_Grap_2009, AUTHOR = {ESHITA MUTT, MONIKA SHARMA, Abhijit Mitra, JYOTHISH SOMAN, Kishore Kothapalli, Naveena Yanamala}, TITLE = {Graph Theoretic Approach for Studying Correlated Motions in Biomolecules}, BOOKTITLE = {World Congress on Nature & Biologically Inspired Computing}. YEAR = {2009}}
Graph theoretic concepts have been applied to the dynamic cross-correlation data obtained from MD simulation of adenine riboswitch, in absence and presence of adenine. Our hybrid approach combines community detection algorithms that support edge weights and cliques. The effect of variations in the values of nearest neighbors (NN) and correlation coefficient threshold (T) in the community detection algorithm have been applied to identify and filter out coincidental correlations between rogue nodes. Our results have identified the correlations within the structural regions of the molecule, which provide strong clues regarding the functionality and stability of the molecule in absence and presence of adenine. Our results show that a prior application of our algorithm applied to the simulation data of biomolecules in automated fashion can provide strong leads for hypothesis formulation and subsequent hypothesis-driven manual investigation.
Anti-magic labellings of a class of planar graphs
K. Ramanjaneyulu,V. Ch. Venkaiah,Kishore Kothapalli
The Australasian Journal of Combinatorics, AJC, 2008
@inproceedings{bib_Anti_2008, AUTHOR = {K. Ramanjaneyulu, V. Ch. Venkaiah, Kishore Kothapalli}, TITLE = {Anti-magic labellings of a class of planar graphs}, BOOKTITLE = {The Australasian Journal of Combinatorics}. YEAR = {2008}}
This paper focuses on anti-magic labellings of planar graphs of type (a, b, c) introduced by Lih (Util. Math. 24 (1983), 165–197). We show that a certain class of planar graphs defined using the complete graphs and a class of planar graphs defined using the complete bipartite graphs are (1, 1, 0) and (1, 1, 1)-anti-magic under certain mild conditions.
A family of collusion resistant symmetric key protocols for authentication
Bruhadeshwar Bezawada,Kishore Kothapalli
International Conference on Distributed Computing and Networking, ICDCN, 2008
@inproceedings{bib_A_fa_2008, AUTHOR = {Bruhadeshwar Bezawada, Kishore Kothapalli}, TITLE = {A family of collusion resistant symmetric key protocols for authentication}, BOOKTITLE = {International Conference on Distributed Computing and Networking}. YEAR = {2008}}
We address the problem of message authentication in communication networks which are resource constrained or are performance bound. Recent research has focused on development of symmetric key protocols for authentication in such networks. In these protocols, the sender generates a pool of keys -used to sign the messages, and distributes a different subset of keys -used to verify the signatures, to each user. However, in these protocols, users can collude to combine their keys and impersonate the sender by generating the sender signatures. In this work, we describe a family of collusion resistant symmetric key distribution protocols for authentication which address the problem of collusion. We show that the collusion resistance achieved using our protocols is practical (and hence, sufficient) for networks whose communication diameter is known or is within fixed bounds. Furthermore, we show that some existing protocols in literature are members of our family of protocols.
How Far Must You See To Hear Reliably.
PRANAV KUMAR VASISHTA GV,ANUJ GUPTA,Prasant Gopal,PIYUSH BANSAL,RISHABH MUKHERJEE,M POORNIMA,Srinathan Kannan,Kishore Kothapalli
IACR Cryptology ePrint Archive, CEA, 2008
@inproceedings{bib_How__2008, AUTHOR = {PRANAV KUMAR VASISHTA GV, ANUJ GUPTA, Prasant Gopal, PIYUSH BANSAL, RISHABH MUKHERJEE, M POORNIMA, Srinathan Kannan, Kishore Kothapalli}, TITLE = {How Far Must You See To Hear Reliably.}, BOOKTITLE = {IACR Cryptology ePrint Archive}. YEAR = {2008}}
We consider the problem of probabilistic reliable communication (PRC) over synchronous networks modeled as directed graphs in the presence of a Byzantine adversary when players’ knowledge of the network topology is not complete. We show that possibility of PRC is extremely sensitive to the changes in players’ knowledge of the topology. This is in complete contrast with earlier known results on the possibility of perfectly reliable communication over undirected graphs where the case of each player knowing only its neighbours gives the same result as the case where players have complete knowledge of the network. Specifically, in either case,(2t+ 1)-vertex connectivity is necessary and sufficient, where t is the number of nodes that can be corrupted by the adversary [DDWY93, SKR+05]. We introduce a novel model for quantifying players’ knowledge of network topology, denoted by T K. Given a directed graph G, influenced by a Byzantine adversary that can corrupt up to any t players, we give a necessary and sufficient condition for possibility of PRC over G for any arbitrary topology knowledge T K. It follows from our main characterization theorem that knowledge of up to d=⌊ n− 2t
Self-stabilizing routing algorithms for wireless ad-hoc networks
KHOT ROHIT ASHOK,Ravikant Poola,Kishore Kothapalli,Srinathan Kannan
International Conference on Distributed Computing and Internet Technology, ICDCIT, 2007
@inproceedings{bib_Self_2007, AUTHOR = {KHOT ROHIT ASHOK, Ravikant Poola, Kishore Kothapalli, Srinathan Kannan}, TITLE = {Self-stabilizing routing algorithms for wireless ad-hoc networks}, BOOKTITLE = {International Conference on Distributed Computing and Internet Technology}. YEAR = {2007}}
This paper considers the problem of unicasting in wireless ad hoc networks. Unicasting is the problem of finding a route between a source and a destination and forwarding the message from the source to the destination. In theory, models that have been used oversimplify the problem of route discovery in ad hoc networks. The achievement of this paper is threefold. First we use a more general model in which nodes can have different transmission and interference ranges and we present a new routing algorithm for wireless ad hoc networks that has several nice features. We then combine our algorithm with that of known greedy algorithms to arrive at an average case efficient routing algorithm in the situation that GPS information is available. Finally we show how to schedule unicast traffic between a set of source-destination pairs by providing a proper vertex coloring of the nodes in the wireless ad hoc network. Our coloring algorithm achieves a O(Δ)–coloring that is locally distinct within the 2-hop neighborhood of any node.