On exploiting location flexibility in data-intensive distributed systems




Yu, Boyang

Journal Title

Journal ISSN

Volume Title



With the fast growth of data-intensive distributed systems today, more novel and principled approaches are needed to improve the system efficiency, ensure the service quality to satisfy the user requirements, and lower the system running cost. This dissertation studies the design issues in the data-intensive distributed systems, which are differentiated from other systems by the heavy workload of data movement and are characterized by the fact that the destination of each data flow is limited to a subset of available locations, such as those servers holding the requested data. Besides, even among the feasible subset, different locations may result in different performance. The studies in this dissertation improve the data-intensive systems by exploiting the data storage location flexibility. It addresses how to reasonably determine the data placement based on the measured request patterns, to improve a series of performance metrics, such as the data access latency, system throughput and various costs, by the proposed hypergraph models for data placement. To implement the proposal with a lower overhead, a sketch-based data placement scheme is presented, which constructs the sparsified hypergraph under a distributed and streaming-based system model, achieving a good approximation on the performance improvement. As the network can potentially become the bottleneck of distributed data-intensive systems due to the frequent data movement among storage nodes, the online data placement by reinforcement learning is proposed which intelligently determines the storage locations of each data item at the moment that the item is going to be written or updated, with the joint-awareness of network conditions and request patterns. Meanwhile, noticing that distributed memory caches are effective measures in lowering the workload to the backend storage systems, the auto-scaling of memory cache clusters is studied, which tries to balance the energy cost of the service and the performance ensured. As the outcome of this dissertation, the designed schemes and methods essentially help to improve the running efficiency of data-intensive distributed systems. Therefore, they can either help to improve the user-perceived service quality under the same level of system resource investment, or help to lower the monetary expense and energy consumption in maintaining the system under the same performance standard. From the two perspectives, both the end users and the system providers could obtain benefits from the results of the studies.



Distributed systems, Data storage, Cache cluster, Hypergraphs, Reinforcement learning, System optimization, Data placement, Auto-scaling