Abstract
Our computing demands have grown so much that we need a robust distributed computing platform to process data. To feed such data hungry systems, we need an equally robust distributed file systems that span across multiple geographically separate locations. Most distributed file systems break any file into a set of blocks or chunks which are spread across the cluster. The major bottleneck is to identify where to place the blocks and where to place the replicas so that cluster is optimized on certain parameters like disk utilization, minimizing network congestion, maximizing throughput, optimal power utilization, etc. This paper proposes assigning a distance measure for each one of the data sources with respect to the other and placing the blocks on the specified disk on the node that minimizes the total distance of the last few requests made for that block. As the request pattern and parameters change, the distances are updated and the blocks are moved dynamically to minimized the distance, in effect optimizing on the required parameters. The distance function is to be modeled based on the cluster and the parameters you wish to optimize on. It can be a function of just bandwidth or bandwidth and latency or other miscellaneous features like disk utilization, processing power, disk speed, power utilization, cooling requirements, temperature, etc. A detailed performance analysis was carried out with disk bandwidth, network bandwidth and disk utilization as the parameters and the performance is far better in comparison to the reference system (10% or more depending on specification differences) which has no understanding of the different type of disks present and the nature of the cluster. Due to the performance aware nature, the system was inherently able to utilize memory for speeding up performance through the use of in-memory partitions.