Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
Skip to main content
Ethan L Miller
  • +1 831 459-1222
We compare and contract long-term file system activity for different Unix environments for periods of 120 to 280 days. Our focus is on finding common long-term activity trends and reference patterns. Our analysis shows that 90% of all... more
We compare and contract long-term file system activity for different Unix environments for periods of 120 to 280 days. Our focus is on finding common long-term activity trends and reference patterns. Our analysis shows that 90% of all files are not used after initial creation, those that are used are normally short-lived, and that if a file is not used in some manner the day after it is created, it will probably never be used. Additionally, we find approximately 1% of all files are used daily. This information allows us to more accurately predict the files which are never used. These files can be compressed or moved to tertiary storage enabling either more users per disk or larger user disk quotas.
Research Interests:
Massively parallel file systems must provide high bandwidth file access to programs running on their machines. Most accomplish this goal by striping files across arrays of disks attached to a few specialized I/O nodes in the massively... more
Massively parallel file systems must provide high bandwidth file access to programs running on their machines. Most accomplish this goal by striping files across arrays of disks attached to a few specialized I/O nodes in the massively parallel processor (MPP). This arrangement requires programmers to give the file system many hints on how their data is to be laid out on disk if they want to achieve good performance. Additionally, the custom interface makes massively parallel file systems hard for programmers to use and difficult to seamlessly integrate into an environment with workstations and tertiary storage. The RAMA file system addresses these problems by providing a massively parallel file system that does not need user hints to provide good performance. RAMA takes advantage of the recent decrease in physical disk size by assuming that each processor in an MPP has one or more disks attached to it. Hashing is then used to pseudo-randomly distribute data to all of these disks, insuring high bandwidth regardless of access pattern. Since MPP programs often have many nodes accessing a single file in parallel, the file system must allow access to different parts of the file without relying on a particular node. In RAMA, a file request involves only two nodes — the node making the request and the node on whose disk the data is stored. Thus, RAMA scales well to hundreds of processors. Since RAMA needs no layout hints from applications, it fits well into systems where users cannot (or will not) provide such hints. Fortunately, this flexibility does not cause a large loss of performance. RAMA's simulated performance is within 10-15% of the optimum performance of a similarly-sized striped file system, and is a factor of 4 or more better than a striped file system with poorly laid out data.
Research Interests:
The supercomputer center at the National Center for Atmospheric Research (NCAR) migrates large numbers of files to and from its mass storage system (MSS) because there is insufficient space to store them on the Cray supercomputer's local... more
The supercomputer center at the National Center for Atmospheric Research (NCAR) migrates large numbers of files to and from its mass storage system (MSS) because there is insufficient space to store them on the Cray supercomputer's local disks. This paper presents an analysis of file migration data collected over two years. The analysis shows that requests to the MSS are periodic, with one day and one week periods. Read requests to the MSS account for the majority of the periodicity; as write requests are relatively constant over the course of a week. Additionally , reads show a far greater fluctuation than writes over a day and week since reads are driven by human users while writes are machine-driven.
Research Interests:
In this paper we present the analysis of two large-scale network file system workloads. We measured CIFS traffic for two enterprise-class file servers deployed in the NetApp data center for a three month period. One file server was used... more
In this paper we present the analysis of two large-scale network file system workloads. We measured CIFS traffic for two enterprise-class file servers deployed in the NetApp data center for a three month period. One file server was used by marketing, sales, and finance departments and the other by the engineering department. Together these systems represent over 22 TB of storage used by over 1500 employees, making this the first ever large-scale study of the CIFS protocol. We analyzed how our network file system workloads compared to those of previous file system trace studies and took an in-depth look at access, usage, and sharing patterns. We found that our workloads were quite different from those previously studied; for example, our analysis found increased read-write file access patterns, decreased read-write ratios, more random file access, and longer file lifetimes. In addition, we found a number of interesting properties regarding file sharing, file re-use, and the access patterns of file types and users, showing that modern file system workload has changed in the past 5–10 years. This change in workload characteristics has implications on the future design of network file systems, which we describe in the paper.
Research Interests:
Users are storing ever-increasing amounts of information digitally, driven by many factors including government regulations and the public's desire to digitally record their personal histories. Unfortunately, many of the security... more
Users are storing ever-increasing amounts of information digitally, driven by many factors including government regulations and the public's desire to digitally record their personal histories. Unfortunately, many of the security mechanisms that modern systems rely upon, such as encryption, are poorly suited for storing data for indefinitely long periods of time—it is very difficult to manage keys and update cryptosystems to provide secrecy through encryption over periods of decades. Worse, an adversary who can compromise an archive need only wait for cryptanalysis techniques to catch up to the encryption algorithm used at the time of the compromise in order to obtain " secure " data. To address these concerns, we have developed POT-SHARDS, an archival storage system that provides long-term security for data with very long lifetimes without using encryption. Secrecy is achieved by using prov-ably secure secret splitting and spreading the resulting shares across separately-managed archives. Providing availability and data recovery in such a system can be difficult ; thus, we use a new technique, approximate pointers , in conjunction with secure distributed RAID techniques to provide availability and reliability across independent archives. To validate our design, we developed a prototype POTSHARDS implementation, which has demonstrated " normal " storage and retrieval of user data using indexes, the recovery of user data using only the pieces a user has stored across the archives and the reconstruction of an entire failed archive.
Research Interests:
With the growing use of large-scale distributed systems, the likelihood that at least one node is compromised is increasing. Large-scale systems that process sensitive data such as geographic data with defense implications, drug modeling,... more
With the growing use of large-scale distributed systems, the likelihood that at least one node is compromised is increasing. Large-scale systems that process sensitive data such as geographic data with defense implications, drug modeling, nuclear explosion modeling, and private genomic data would benefit greatly from strong security for their storage. Nevertheless, many high performance computing (HPC), cloud, or secure content delivery network (SCDN) systems that handle such data still store them unencrypted or use simple encryption schemes, relying heavily on physical isolation to ensure confidentiality , providing little protection against compromised computers or malicious insiders. Moreover, current en-cryption solutions cannot efficiently provide fine-grained encryption for large datasets. Our approach, Horus, encrypts large datasets using keyed hash trees (KHTs) to generate different keys for each region of the dataset, providing fine-grained security: the key for one region cannot be used to access another region. Horus also reduces key management and distribution overhead while providing end-to-end data encryption and reducing the need to trust system operators or cloud service providers. Horus requires little modification to existing systems and user applications. Performance evaluation shows that our prototype's key distribution is highly scalable and robust: a single key server can provide 140,000 keys per second, theoretically enough to sustain more than 100 GB/s I/O throughput, and multiple key servers can efficiently operate in parallel to support load balancing and reliability.
Research Interests:
Galois Field arithmetic forms the basis of Reed-Solomon and other erasure coding techniques to protect storage systems from failures. Most implementations of Galois Field arithmetic rely on multiplication tables or discrete logarithms to... more
Galois Field arithmetic forms the basis of Reed-Solomon and other erasure coding techniques to protect storage systems from failures. Most implementations of Galois Field arithmetic rely on multiplication tables or discrete logarithms to perform this operation. However, the advent of 128-bit instructions, such as Intel's Streaming SIMD Extensions, allows us to perform Galois Field arithmetic much faster. This short paper details how to leverage these instructions for various field sizes, and demonstrates the significant performance improvements on commodity microprocessors. The techniques that we describe are available as open source software.
Research Interests:
The scale of today's storage systems has made it increasingly difficult to find and manage files. To address this, we have developed Spyglass, a file metadata search system that is specially designed for large-scale storage systems. Using... more
The scale of today's storage systems has made it increasingly difficult to find and manage files. To address this, we have developed Spyglass, a file metadata search system that is specially designed for large-scale storage systems. Using an optimized design, guided by an analysis of real-world metadata traces and a user study, Spy-glass allows fast, complex searches over file metadata to help users and administrators better understand and manage their files. Spyglass achieves fast, scalable performance through the use of several novel metadata search techniques that exploit metadata search properties. Flexible index control is provided by an index partitioning mechanism that leverages namespace locality. Signature files are used to significantly reduce a query's search space, improving performance and scalability. Snapshot-based metadata collection allows incremental crawling of only modified files. A novel index versioning mechanism provides both fast index updates and " back-in-time " search of meta-data. An evaluation of our Spyglass prototype using our real-world, large-scale metadata traces shows search performance that is 1-4 orders of magnitude faster than existing solutions. The Spyglass index can quickly be updated and typically requires less than 0.1% of disk space. Additionally, metadata collection is up to 10× faster than existing approaches.
Research Interests:
As the world moves to digital storage for archival purposes , there is an increasing demand for reliable, low-power, cost-effective, easy-to-maintain storage that can still provide adequate performance for information retrieval and... more
As the world moves to digital storage for archival purposes , there is an increasing demand for reliable, low-power, cost-effective, easy-to-maintain storage that can still provide adequate performance for information retrieval and auditing purposes. Unfortunately, no current archival system adequately fulfills all of these requirements. Tape-based archival systems suffer from poor random access performance, which prevents the use of inter-media redundancy techniques and auditing, and requires the preservation of legacy hardware. Many disk-based systems are ill-suited for long-term storage because their high energy demands and management requirements make them cost-ineffective for archival purposes. Our solution, Pergamum, is a distributed network of intelligent, disk-based, storage appliances that stores data reliably and energy-efficiently. While existing MAID systems keep disks idle to save energy, Perga-mum adds NVRAM at each node to store data signatures , metadata, and other small items, allowing deferred writes, metadata requests and inter-disk data verification to be performed while the disk is powered off. Perga-mum uses both intra-disk and inter-disk redundancy to guard against data loss, relying on hash tree-like structures of algebraic signatures to efficiently verify the cor-rectness of stored data. If failures occur, Pergamum uses staggered rebuild to reduce peak energy usage while rebuilding large redundancy stripes. We show that our approach is comparable in both startup and ongoing costs to other archival technologies and provides very high reliability. An evaluation of our implementation of Perga-mum shows that it provides adequate performance.
Research Interests:
We have developed a scheme to secure network-attached storage systems against many types of attacks. Our system uses strong cryptography to hide data from unauthorized users; someone gaining complete access to a disk cannot obtain any... more
We have developed a scheme to secure network-attached storage systems against many types of attacks. Our system uses strong cryptography to hide data from unauthorized users; someone gaining complete access to a disk cannot obtain any useful data from the system, and backups can be done without allowing the super-user access to cleartext. While insider denial-of-service attacks cannot be prevented (an insider can physically destroy the storage devices), our system detects attempts to forge data. The system was developed using a raw disk, and can be integrated into common file systems. All of this security can be achieved with little penalty to performance. Our experiments show that, using a relatively inexpensive commodity CPU attached to a disk, our system can store and retrieve data with virtually no penalty for random disk requests and only a 15–20% performance loss over raw transfer rates for sequential disk requests. With such a minor performance penalty, there is no longer any reason not to include strong encryption and authentication in network file systems.
Research Interests:
Stored data needs to be protected against device failure and irrecoverable sector read errors, yet doing so at exabyte scale can be challenging given the large number of failures that must be handled. We have developed RESAR (Robust,... more
Stored data needs to be protected against device failure and irrecoverable sector read errors, yet doing so at exabyte scale can be challenging given the large number of failures that must be handled. We have developed RESAR (Robust, Efficient, Scalable, Autonomous, Reliable) storage, an approach to storage system redundancy that only uses XOR-based parity and employs a graph to lay out data and parity. The RESAR layout offers greater robustness and higher flexibility for repair at the same overhead as a declustered version of RAID 6. For instance, a RESAR-based layout with 16 data disklets per stripe has about 50 times lower probability of suffering data loss in the presence of a fixed number of failures than a corresponding RAID 6 organization. RESAR uses a layer of virtual storage elements to achieve better manageability, a broader potential for energy savings, as well as easier adoption of heterogeneous storage devices.
Research Interests:
Data used in high-performance computing (HPC) applications is often sensitive, necessitating protection against both physical compromise of the storage media and " rogue " computation nodes. Existing approaches to security may require... more
Data used in high-performance computing (HPC) applications is often sensitive, necessitating protection against both physical compromise of the storage media and " rogue " computation nodes. Existing approaches to security may require trusting storage nodes and are vulnerable to a single computation node gathering keys that can unlock all of the data used in the entire computation. Our approach, Horus, encrypts petabyte-scale files using a keyed hash tree to generate different keys for each region of the file, supporting much finer-grained security. A client can only access a file region for which it has a key, and the tree structure allows keys to be generated for large and small regions as needed. Horus can be integrated into a file system or layered between applications and existing file systems, simplifying deployment. Keys can be distributed in several ways, including the use of a small stateless key cluster that strongly limits the size of the system that must be secured against attack. The system poses no added demand on the metadata cluster or the storage devices, and little added demand on the clients beyond the unavoidable need to encrypt and decrypt data, making it highly suitable for protecting data in HPC systems.
Research Interests:
Research Interests:
■ and modeling Multiple self-adaptive system challenges: ■ ■ composition and openness Goals, objectives, and trust: the human side of ■ ■ autonomics A working group was convened to study each problem. Each working group met in the... more
■ and modeling Multiple self-adaptive system challenges: ■ ■ composition and openness Goals, objectives, and trust: the human side of ■ ■ autonomics A working group was convened to study each problem. Each working group met in the afternoon and presented a report; these are briefly summarized next. single self-adaptive systems Single self-adaptive systems can now be built, but system- atic methods should be developed for building these systems. Systematic methods require good models for prediction, control, error detection/fault diagnosis, and optimization. Models must describe behavior at different time and detail scales, for different tasks (e.g., energy, error detection) and for different degrees of accuracy. Models can be self-learned or provided by expert human engineers. Models should describe both the system and its environment. Objectives need to be clearly defined for accountability, performance, and reliability of self-adaptive systems. multiple self-adaptive systems Multi...
Research Interests:
In the past few years, the explosive growth of the Internet has allowed the construction of" virtual" systems containing hundreds or thousands of individual, relatively inexpensive computers. The agent paradigm is well-suited... more
In the past few years, the explosive growth of the Internet has allowed the construction of" virtual" systems containing hundreds or thousands of individual, relatively inexpensive computers. The agent paradigm is well-suited for this environment because it is based on distributed autonomous computation. Although the de nition of a software agent varies widely, some common features are present in most de nitions of agents. Agents should be autonomous, operating independently of their creator s. Agents should have the ability to ...
Managing storage in the face of relentless growth in the number and va- riety of files on storage systems creates demand for rich file s ystem meta- data as is made evident by the recent emergence of rich metadata support in many... more
Managing storage in the face of relentless growth in the number and va- riety of files on storage systems creates demand for rich file s ystem meta- data as is made evident by the recent emergence of rich metadata support in many applications as well as file systems. Yet, little suppor t exists for shar- ing metadata across file systems
Network file system usage has grown significantly in re- cent years due to the desire to lower costs, improve storage utilization and ease administration by consolidating stor - age. In order to understand how future network file systems... more
Network file system usage has grown significantly in re- cent years due to the desire to lower costs, improve storage utilization and ease administration by consolidating stor - age. In order to understand how future network file systems should be designed, it is critical to have a detailed under- standing of they are currently used in practice. We conducted an

And 107 more