IIITH

Practical time bundle adjustment for 3d reconstruction on the gpu

Siddharth Choudhary, Shubham Gupta, Narayanan P J

European Conference on Computer Vision, ECCV, 2010

Core Rank : A* Google Rank :206

Abs PDF DOI bibTex

@inproceedings{bib_Prac_2010, AUTHOR = {SIDDHARTH CHOUDHARY, SHUBHAM GUPTA, Narayanan P J}, TITLE = {Practical time bundle adjustment for 3d reconstruction on the gpu}, BOOKTITLE = {European Conference on Computer Vision}. YEAR = {2010}}

Practical time bundle adjustment for 3d reconstruction on the gpu

Abstract

Large-scale 3D reconstruction has received a lot of attentionrecently. Bundle adjustment is a key component of the reconstructionpipeline and often its slowest and most computational resource intensive.It hasn’t been parallelized effectively so far. In this paper, we present ahybrid implementation of sparse bundle adjustment on the GPU usingCUDA, with the CPU working in parallel. The algorithm is decomposedinto smaller steps, each of which is scheduled on the GPU or the CPU. Wedevelop efficient kernels for the steps and make use of existing libraries forseveral steps. Our implementation outperforms the CPU implementationsignificantly, achieving a speedup of 30-40 times over the standard CPUimplementation for datasets with upto 500 images on an Nvidia TeslaC2050 GPU.

Fast GPU algorithms for graph connectivity

Jyothish Soman, Kishore Kothapalli, Narayanan P J

Workshop on Large Sacle Parallel Processing, LSPP, 2010

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Fast_2010, AUTHOR = {JYOTHISH SOMAN, Kishore Kothapalli, Narayanan P J}, TITLE = {Fast GPU algorithms for graph connectivity}, BOOKTITLE = {Workshop on Large Sacle Parallel Processing}. YEAR = {2010}}

Fast GPU algorithms for graph connectivity

Abstract

Graphics processing units provide a large compu-tational power at a very low price which position them as an ubiquitous accelerator. General purpose programming on the graphics processing units (GPGPU) is best suited for regular data parallel algorithms. They are not directly amenable for algorithms which have irregular data access patterns such as list ranking, and finding the connected components of a graph, and the like. In this work, we present a GPU-optimized implementation for finding the connected components of a given graph. Our implementation tries to minimize the impact of irregularity, both at the data level and functional level. Our implementation achieves a speed up of 9 to 12 times over the best sequential CPU implementation. For instance, our implementation finds connected components of a graph of 10 million nodes and 60 million edges in about 500 milliseconds on a GPU, given a random edge list. We also draw interesting observations on why PRAM algorithms, such as the Shiloach-Vishkin algorithm may not be a good fit for the GPU and how they should be modified.

Some GPU algorithms for graph connected components and spanning tree

Jyothish Soman, Kishore Kothapalli, Narayanan P J

Parallel Processing Letters, JPPL, 2010

Core Rank : - Google Rank :7

Abs PDF bibTex

@inproceedings{bib_Some_2010, AUTHOR = {JYOTHISH SOMAN, Kishore Kothapalli, Narayanan P J}, TITLE = {Some GPU algorithms for graph connected components and spanning tree}, BOOKTITLE = {Parallel Processing Letters}. YEAR = {2010}}

Some GPU algorithms for graph connected components and spanning tree

Abstract

Graphics Processing Units (GPU) are application specific accelerators which provide high performance to cost ratio and are widely available and used, hence places them as a ubiquitous accelerator. A computing paradigm based on the same is the general purpose computing on the GPU (GPGPU) model. The GPU due to its graphics lineage is better suited for the data-parallel, data-regular algorithms. The hardware architecture of the GPU is not suitable for the data parallel but data irregular algorithms such as graph connected components and list ranking. In this paper, we present results that show how to use GPUs efficiently for graph algorithms which are known to have irregular data access patterns. We consider two fundamental graph problems: finding the connected components and finding a spanning tree. These two problems find applications in several graph theoretical problems. In this paper we arrive at efficient GPU implementations for the above two problems. The algorithms focus on minimising irregularity at both algorithmic and implementation level. Our implementation achieves a speedup of 11-16 times over a corresponding best sequential implementation.

Efficient Discrete Range Searching primitives on the GPU with applications

Jyothish Soman, Kiran Kumar M, Kishore Kothapalli, Narayanan P J

International Conference on High Performance Computing, HiPC, 2010

Core Rank : - Google Rank :24

Abs PDF bibTex

@inproceedings{bib_Effi_2010, AUTHOR = {JYOTHISH SOMAN, KIRAN KUMAR M, Kishore Kothapalli, Narayanan P J}, TITLE = {Efficient Discrete Range Searching primitives on the GPU with applications}, BOOKTITLE = {International Conference on High Performance Computing}. YEAR = {2010}}

Efficient Discrete Range Searching primitives on the GPU with applications

Abstract

Graphics processing units provide a large computational power at a very low price which position them as an ubiquitous accelerator. Efficient primitives that can expand the r ange of operations performed on the GPU are thus important. Discrete Range Searching(DRS) is one such primitive with direct applications to string processing, document and text retrieval systems, and least common ancestor queries. In this work, we present a GPU specific implementation of DRS with an optimal space-time trade off. Toward this end, we also present GPU amenable succinct representations and discuss limitations on the GPU. Our method uses 7.5 bits of additional space per element. The speedup achieved by our method is in the range of 20-25 for preprocessing, and 25-35 for batch querying over a sequential implementation. Compared to an 8-threaded implementation, our methods obtain a speedup of 6-8. We study applications of the DRS on the GPU. Also, we suggest that most graph algorithms which focus on using least common ancestor, can easily be enabled on the GPU based on range minima primitive. Beyond this, we show applications of DRS in string querying and tree queries, and suggest how DRS can be helpful in implementing tree based graph algorithms on the GPU.

Large Graph Algorithms for Massively Multithreaded Architectures

Pawan Kumar Harish, Vibhav Vineet, Narayanan P J

Technical Report, arXiv, 2009

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Larg_2009, AUTHOR = {PAWAN KUMAR HARISH, VIBHAV VINEET, Narayanan P J}, TITLE = {Large Graph Algorithms for Massively Multithreaded Architectures}, BOOKTITLE = {Technical Report}. YEAR = {2009}}

Large Graph Algorithms for Massively Multithreaded Architectures

Abstract

The Graphics Processing Units (GPUs) provide highcomputation power at a low cost and is an important computeaccelerator with a massively multithreaded architecture. Inthis paper, we present fast implementations of common graphoperations like breadth-first search, st-connectivity, single-sourceshortest path, all-pairs shortest path, minimum spanning tree,and maximum flow for undirected graphs on the GPU using theCUDA programming model. Our implementations exhibit highperformance, especially on large graphs. We use two data-parallelprogramming methodologies for these algorithms. One is aniterative, mask-based approach that processes valid data elementslike vertices and edges using independent threads for each.The other is a divide-and-conquer approach that reduces theproblem into smaller problems that are handled later using thesame approach. Parallel algorithms for such problems have beenreported in the literature before, especially on supercomputers.The massively multithreaded model of the GPU makes it possibleto adopt the data-parallel approach even to irregular algorithmslike graph algorithms, usingO(V)orO(E)simultaneous threads.The algorithms and the underlying techniques presented in thispaper are likely to be applicable to many irregular algorithms.We show the impact of our implementations on random, scale-free, and real-life graphs of up to millions of vertices on high-end and low-end GPUs. The availability and spread of GPUs todesktops and laptops make them ideal candidates to accelerategraph operations over the CPU-only implementations. Practicalimplementations of basic operations go a long way in realizingtheir potential.

Scalable Split and Gather Primitives for the GPU

Suryakant Patidar, Narayanan P J

Technical Report, arXiv, 2009

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Scal_2009, AUTHOR = {SURYAKANT PATIDAR, Narayanan P J}, TITLE = {Scalable Split and Gather Primitives for the GPU}, BOOKTITLE = {Technical Report}. YEAR = {2009}}

Scalable Split and Gather Primitives for the GPU

Abstract

We present efficient implementations of two primitives for datamapping and distribution on the massively multithreaded architec-ture of the GPUs in this paper. Thesplitprimitive distributes el-ements of a list according to their category. Split is an importantoperation for data mapping and is used to build data structures,distribute work load, etc., in a massively parallel environment.Thegather/scatterprimitive performs fast, distributed data move-ment. Efficient data movement is critical to high performance onthe GPUs as suboptimal memory accesses can pay heavy penalties.The split we implement is a generalization of the binary split [Blel-loch 1990] and is implemented using the shared memory and theatomic operations available on them. The split performance scaleslogarithmically with the number of categories, linearly with the listlength, and linearly with the number of cores on the GPU. Thismakes it useful for applications that deal with large data sets. Wealso present a variant of split that partitions the indexes of records.This facilitates the use of the GPU as a coprocessor for split or sort,with the actual data movement handled separately. We can computethe split indexes for a list of 32 million records in 180 millisecondsfor a 32-bit key and in 800 ms for a 96-bit key. The instantaneouslocality of memory references play a critical role in data movementon the current GPU memory architectures. For scatter and gatherinvolving large records, we use collective data movement in whichmultiple threads cooperate on individual records to improve the in-stantaneous locality. The split, gather, and their combinations findmany applications and expect our primitives will be used by fu-ture GPU programmers. We show sorting of 16 million 128-byterecords in 379 milliseconds with 4-byte keys and in 556 ms with8-byte keys

Real-time ray tracing of implicit surfaces on the GPU

Jag Mohan Singh, Narayanan P J

IEEE Transactions on Visualization and Computer Graphics, TVCG, 2009

Core Rank : - Google Rank :89

Abs PDF bibTex

@inproceedings{bib_Real_2009, AUTHOR = {JAG MOHAN SINGH, Narayanan P J}, TITLE = {Real-time ray tracing of implicit surfaces on the GPU}, BOOKTITLE = {IEEE Transactions on Visualization and Computer Graphics}. YEAR = {2009}}

Real-time ray tracing of implicit surfaces on the GPU

Abstract

Compact representation of geometry using a suitable procedural or mathematical model and a ray-tracing mode ofrendering fit the programmable graphics processor units (GPUs) well. Several such representations including parametric andsubdivision surfaces have been explored in recent research. The important and widely applicable category of the general implicitsurface has received less attention. In this paper, we present a ray-tracing procedure to render general implicit surfaces efficiently onthe GPU. Though only the fourth or lower order surfaces can be rendered using analytical roots, ouradaptive marching pointsalgorithm can ray trace arbitrary implicit surfaces without multiple roots, by sampling the ray at selected points till a root is found.Adapting the sampling step size based on a proximity measure and a horizon measure delivers high speed. The sign test can handleany surface without multiple roots. The Taylor test that uses ideas from interval analysis can ray trace many surfaces with complexroots. Overall, a simple algorithm that fits the SIMD architecture of the GPU results in high performance. We demonstrate the raytracing of algebraic surfaces up to order 50 and nonalgebraic surfaces including a Blinn’s blobby with 75 spheres at better thaninteractive frame rates.

Singular value decomposition on GPU using CUDA

Lahabar Sheetal Madhukar, Narayanan P J

International Parallel and Distributed Processing Symposium, IPDPS, 2009

Core Rank : A Google Rank :-

Abs PDF DOI bibTex

@inproceedings{bib_Sing_2009, AUTHOR = {LAHABAR SHEETAL MADHUKAR, Narayanan P J}, TITLE = {Singular value decomposition on GPU using CUDA}, BOOKTITLE = {International Parallel and Distributed Processing Symposium}. YEAR = {2009}}

Singular value decomposition on GPU using CUDA

Abstract

Linear algebra algorithms are fundamental to many com-puting applications. Modern GPUs are suited for manygeneral purpose processing tasks and have emerged asinexpensive high performance co-processors due to theirtremendous computing power. In this paper, we present theimplementation of singular value decomposition (SVD) of adense matrix on GPU using the CUDA programming model.SVD is implemented using the twin steps of bidiagonalizationfollowed by diagonalization. It has not been implemented onthe GPU before. Bidiagonalization is implemented using aseries of Householder transformations which map well toBLAS operations. Diagonalization is performed by applyingthe implicitly shifted QR algorithm. Our complete SVDimplementation outperforms the MATLAB and IntelR©MathKernel Library (MKL) LAPACK implementation significantlyon the CPU. We show a speedup of upto60over theMATLAB implementation and upto8over the Intel MKLimplementation on a Intel Dual Core2.66GHz PC onNVIDIA GTX280for large matrices. We also give resultsfor very large matrices on NVIDIA Tesla S1070.

Fast minimum spanning tree for large graphs on the GPU

Vibhav Vineet, Pawan Kumar Harish, Suryakant Patidar, Narayanan P J

Conference on High Performance Graphics, HPG, 2009

Core Rank : - Google Rank :-

Abs PDF bibTex

@inproceedings{bib_Fast_2009, AUTHOR = {VIBHAV VINEET, PAWAN KUMAR HARISH, SURYAKANT PATIDAR, Narayanan P J}, TITLE = {Fast minimum spanning tree for large graphs on the GPU}, BOOKTITLE = {Conference on High Performance Graphics}. YEAR = {2009}}

Fast minimum spanning tree for large graphs on the GPU

Abstract

Graphics Processor Units are used for many general purpose pro-cessing due to high compute power available on them. Regular,data-parallel algorithms map well to the SIMD architecture of cur-rent GPU. Irregular algorithms on discrete structures like graphs areharder to map to them. Efficient data-mapping primitives can playcrucial role in mapping such algorithms onto the GPU. In this paper,we present a minimum spanning tree algorithm on Nvidia GPUsunder CUDA, as a recursive formulation of Bor ̊uvka’s approach forundirected graphs. We implement it using scalable primitives suchas scan, segmented scan and split. The irregular steps of superver-tex formation and recursive graph construction are mapped to prim-itives like split to categories involving vertex ids and edge weights.We obtain30to50times speedup over the CPU implementationon most graphs and3to10times speedup over our previous GPUimplementation. We construct the minimum spanning tree on a5million node and30million edge graph in under1second on onequarter of the Tesla S1070GPU.

Solving multilabel MRFs using incremental α-expansion on the GPUs

Vibhav Vineet, Narayanan P J

Asian Conference on Computer Vision, ACCV, 2009

Core Rank : B Google Rank :39

Abs PDF bibTex

@inproceedings{bib_Solv_2009, AUTHOR = {VIBHAV VINEET, Narayanan P J}, TITLE = {Solving multilabel MRFs using incremental α-expansion on the GPUs}, BOOKTITLE = {Asian Conference on Computer Vision}. YEAR = {2009}}

Solving multilabel MRFs using incremental α-expansion on the GPUs

Abstract

Many vision problems map to the minimization of an energyfunction over a discrete MRF. Fast performance is needed if the energyminimization is one step in a control loop. In this paper, we presentthe incrementalα-expansion algorithm for high-performance multilabelMRF optimization on the GPU. Our algorithm utilizes the gridstruc-ture of the MRFs for good parallelism on the GPU. We improve the basicpush-relabel implementation of graph cuts using the atomicoperationsof the GPU and by processing blocks stochastically. We also reuse theflow using reparametrization of the graph from cycle to cycleand itera-tion to iteration for fast performance. We show results on various visionproblems on standard datasets. Our approach takes 950 milliseconds onthe GPU for stereo correspondence on Tsukuba image with 16 labelscompared to 5.4 seconds on the CPU.