QoS-oriented Design of Large-scale Storage Systems

The viability of large-scale multimedia applications, such as video-on-demand systems, digital libraries, tele-medicine, and distance learning, depends on the performance of storage systems and communication networks. In my work I focus on storage systems for continuous media (CM) applications. Common among these applications are their high I/O bandwidth and storage requirements, coupled with some form of QoS data delivery for admitted users. CM applications place high demands for performance, scalability, and reliability on storage servers, which in turn requires efficient data placement, scheduling, admission control, and provision for fault tolerance. Some early works focused on single logical disk designs which do not scale up. My work focuses on large-scale storage systems (with parallel or distributed I/O).

Although many basic principles of parallelism apply, some characteristics are unique to storage subsystems. Unlike, say CPU resources, all storage resources are not equivalent in the sense that their utility is determined by the data placed on them which is determined a priori to a large extent. This ``partitioning'' of resources (based on data placement) contributes to some of the difficulties in designing large-scale I/O systems. Inappropriate data placement can lead to load imbalance problems due to skewness in data access patterns (common to CM applications). Moreover, most storage devices can be viewed as ``2-dimensional'' (storage and bandwidth) resources, which contributes to hardness of related resource allocation problems. For instance, in [D18] we consider data placement on parallel disks with applications to CM servers, and show this problem to be NP-hard as well as study approximation algorithms with tight bounds.

Data Sharing

Due to the skewness in data access patterns, I/O bandwidth is often the critical resource for CM data access. One approach to reducing the I/O demand on the server is to ``share'' data between requests for the same object thereby increasing the number of simultaneous users. Prior to our work two approaches to data sharing were in use, namely batching and buffering. Our work [A1, D4] introduced a third novel technique of stream merging (termed ``adaptive piggybacking'') which adjusts display rates of requests in progress until their corresponding I/O streams can be ``merged'' into one. It accomplishes significant reduction in I/O bandwidth demand (e.g., $50-80\%$) while eliminating the cost of additional latency or additional buffer space (incurred by batching or buffering, respectively). This work has been extended by other groups, for example at IBM T.J. Watson².

Since batching, buffering, and merging are not mutually exclusive, we continued our work and studied appropriate combinations of these techniques. In all such techniques once the users are ``grouped'' for the purpose of data sharing they may leave the group due to the use of VCR functionality (e.g., fast-forward) and hence require additional resources. Thus, we introduced models [A10, D9] for determining the amount of resources required for supporting both normal playback and VCR functionality under predefined constraints on performance and QoS characteristics. These models allow us to maximize the benefits of data sharing techniques as well as make system sizing decisions (e.g., to determine the appropriate compromise between use of buffer and I/O bandwidth resources).

In another work on data sharing [A7], we construct both a theoretical framework and practical optimal algorithms for reducing the I/O bandwidth requirements by partitioning client requests into groups and simultaneously servicing each group with a single set of resources. Our framework can accommodate a large class of objective functions and a variety of data sharing techniques and applications, including those with more interactive environments.

Data Placement and Dynamic Resource Management

The scalability of a CM server's architecture refers to its ability to maintain performance characteristics under workload growth and degradation of system resources (losses in network and storage capacities). The choice of data placement techniques can have a significant effect on scalability and efficiency of a distributed CM server. In the past, a great deal of work has focused on data placement based on``wide'' data striping as an approach to dealing with load imbalance problems. Another approach to this problem is replication. An appropriate compromise between the degree of striping and the degree of replication, which we term a hybrid system, is crucial to the design of scalable CM servers. We show [D19] that hybrid designs, in conjunction with dynamic replication techniques which adjust to changes in data access patterns, are less dependent on interconnection network constraints, provide higher reliability (even under conservative assumptions), and can be properly sized so as to result in cost-effective scalable systems.

We also studied dynamic replication policies [D23] under changes in data access patterns. One distinguishing characteristic of our work [D23] is that we consider schemes which do not rely on knowledge of object access frequencies but rather make the adjustments, in a threshold-based manner, based on the amount of resources currently available for servicing user requests for those objects. We believe that this is an important consideration, as determination of when a change in frequency requires a change in system state may not be a simple matter.

Our other work on dynamic resource management [A6, C1] focuses on exploitation of the multiresolution property of compressed CM streams in environments that face fluctuations in workload (e.g., due to VCR functionality) or lack of system resources --- here we use the flexibility of scalable compression to adjust a stream's bit rate as needed. We design a data placement scheme and study its affects on disk bandwidth and buffer space utilization, user requests' start-up delay, and probability of request rejection due to system overload [A6]. We also propose techniques for re-scheduling of data retrieval, which dynamically adjust resolution of streams in progress to react to fluctuations in workload and system resource availability while satisfying QoS constraints [C1].

Fault Tolerance and Tertiary Storage

Our work on fault tolerance [D5] introduced the first non-RAID fault-tolerant design of a multidisk CM storage server. This research produced data placement and scheduling techniques which resulted in highly reliable but significantly more cost-effective servers with higher throughputs and lower latencies than more traditional, previously pursued RAID-based techniques. More recent work focused on reduction of degradation in system performance under failure and rapid, but ``non-performance-degrading'', recovery from failure to normal operation [A9]. Construction of a framework for performance and reliability evaluation of possible design choices is described in [A2].

Our work on tertiary storage systems focused on analysis of striping techniques in robotic tape storage libraries, which indicated that significant improvements in system response time can be achieved through the use of non-RAID striping techniques [D6], and QoS-based delivery of CM objects from tertiary storage [D13].

²C. Aggarwal, J. Wolf, and P. Yu, ``On Optimal Piggyback Merging Policies for Video-On-Demand Systems'', in Proceedings of the ACM SIGMETRICS Conference, 1996.

[Last updated Sat Jul 29 2000]