Abstract
Distributed data analytics engines are designed to process and analyze large data sets in a distributed
environment. The architecture of these engines is based on two key components: a distributed file
system and a distributed computing framework. In this thesis, we consider the following two problems,
coded data rebalancing in distributed file systems and coded distributed computing.
For the data rebalancing problem, we consider replication-based distributed storage systems in which
each node stores the same quantum of data and each data bit stored has the same replication factor across
the nodes. Such systems are referred to as balanced distributed databases. When existing nodes leave
or new nodes are added to this system, the balanced nature of the database is lost, either due to the
reduction in the replication factor, or the non-uniformity of the storage at the nodes. This triggers a
rebalancing algorithm, that exchanges data between the nodes so that the balance of the database is
reinstated. The goal is then to design rebalancing schemes with minimal communication load. In a
recent work [1] by Krishnan et al., coded transmissions were used to rebalance a carefully designed
distributed database from a node removal or addition. These coded rebalancing schemes have optimal
communication load, however, require the file-size to be at least exponential in the system parameters.
In this work, we consider a cyclic balanced database (where data is cyclically placed in the system
nodes) and present coded rebalancing schemes for node removal and addition in such a database. These
databases (and the associated rebalancing schemes) require the file-size to be only cubic in the number
of nodes in the system. We bound the advantage of our node removal rebalancing scheme over the
uncoded scheme, and show that our scheme has a smaller communication load. In the node addition
scenario, the rebalancing scheme presented is a simple uncoded scheme, which we show has optimal
load.
For the distributed computing problem, we consider the widely popular MapReduce distributed computing framework, which consists of three phases, namely Map, Shuffle, and Reduce. In previous works
by Li et al. [2, 3], an optimal coded distributed computing scheme has been presented. This optimal
scheme however requires very high file complexity, i.e., exponential in the number of servers K. To
address this issue, low complexity distributed computing schemes using binary matrices derived from
combinatorial designs have been presented in [4, 5]. In our work, we use similar principles to construct
a distributed computing scheme via subspace designs, the q-analogs of combinatorial designs. While
the scheme requires low file complexity, it has a higher communication load, when compared to the optimal scheme with equivalent parameters. Also, it is primarily useful for large local storage scenarios.
Moreover, we also provide numerical comparisons with some existing baseline schemes.