The Art of the Book in China

Stacey Pierson

Fibre channel has long dominated the realm of storage area networks (SAN's). However, with increased development and refining, iSCSI is fast becoming an equal contender, which is causing many companies to reconsider how future storage networks shall be implemented. In addition to reduced costs and a unified network infrastructure, iSCSI allows for the deployment of storage networks over a commodity internet. Moreover, there are inexpensive software implementations of the iSCSI protocol that may provide compelling platforms for iSCSI deployment. This paper discusses findings of a performance study on the iSCSI protocol in four different configurations. The first two consist of iSCSI in a more commercial setting, where specialized hardware is used for the iSCSI target and two different configurations are examined for the initiator. These results are contrasted with the performance of fibre channel in a similar setting. The second two configurations focus on iSCSI deployed purely in software in both SAN and WAN environments. The performance results indicate that the iSCSI protocol can be severely limited by the implementation. This is due to either inefficient handling of the underlying network, or to not using sufficient system resources to take advantage of a network.

Nowadays computer data grows rapidly and make a lot of institutions and companies looking for storage solutions that are safe, reliable and trustworthy. One of the solutions is to implement a Storage Area Network (SAN). In this study, the authors chose the iSCSI protocol as an object for analyzing the performance of the SAN server based on software and hardware. The RAID type used are RAID 5 and RAID 10, while the parameters used are Input Output per Second (IOPS), throughput and average latency with workloads 4 KB, 8 KB, and 256 KB. The tests performed on the network running Link Aggregation (LAG) or network without LAG. From the test results, it is seen that a SAN network running LAG have increased throughput by 89.57 MBps. The best IOPS performance was achieved by Software-Based iSCSI SAN with a value of 26729.48 IOPS. The best throughput performance is achieved by Hardware-Based iSCSI SAN with a value of 118.55MBps. The best latency performance achieved by Software-Based iSCSI SAN with a value of 2.01ms.

iSCSI PERFORMANCE OVER RDMA-ENABLED NETWORK A Thesis by Amarnath Pallampati Bachelor’s Degree in Electronics and Communications Engineering, VNR VJIET, 2004 Submitted to Department of Electrical and Computer Engineering and faculty of the Graduate School of Wichita State University in partial fulfillment of the requirements for the degree of Master of Science July 2006 iSCSI PERFORMANCE OVER RDMA-ENABLED NETWORK I have examined the final copy of this thesis for form and content, and recommend that it be accepted in partial fulfillment of the requirements for the degree of Master of Science with a major in Electrical and Computer Engineering. ______________________________ Dr.Ravi Pendse, Committee Chair We have read this thesis and recommend its acceptance: _______________________________ Dr.John Watkins, Committee Member ________________________________ Dr.Krishnan, Committee Member ii DEDICATION To my parents and my family who have supported me throughout my life. Without their help, it would not have been possible to finish my thesis. iii ACKNOWLEDGEMENTS I would like to gratefully acknowledge my advisor, Dr. Ravi Pendse, for giving me an opportunity to do my Master’s thesis under him in the field of storage networking. I would also like to thank my committee for taking the time to work with me in this endeavor. I am thankful to Amarnath Jasti and all my friends at the Advanced Networking Research Center (ANRC) at Wichita State University for their help they have given me during this time. Finally, I am forever indebted to my parents and family for supporting and encouraging me throughout my life to achieve higher goals. iv ABSTRACT With an increase in the amount of information being exchanged every day, there is a need for storage repositories and faster networks to retrieve data with a minimum amount of time delay. Keeping this in mind, inroads have been made in developing different types of storage network technologies including Direct Attached Storage (DAS), Network Attached Storage (NAS), and Storage Area Network (SAN). Developments have also been made in building faster and parallel Input/Output (I/O) peripheral interface systems such as the Small Computer System Interface (SCSI). With the help of SCSI high-speed bus systems that quickly transfer large amounts of data requested by the user have been made, which is a very important need in storage are networks. Internet SCSI (iSCSI) protocol is a Small Computer System Interface (SCSI) transport protocol developed by the Internet Engineering Task Force (IETF), which maps block-oriented storage data over Transmission Control Protocol/Internet Protocol (TCP/IP) networks. This iSCSI storage technology is a viable solution utilizing IP networks for low-cost implementation of managing storage networks. Unlike the file access mechanism provided by protocols such as Network File System (NFS), iSCSI implementation has a block access mechanism to provide better performance and throughput. Remote Direct Memory Access (RDMA) technology provided by Internet Wide Area RDMA Protocol (iWARP), which runs over TCP/IP networks provides efficient data transfers. RDMA protocol is primarily chosen because it efficiently uses storage I/O (Input/Output) systems by providing a Zero Copy transfer mechanism. Implementing iSCSI over RDMA-enabled network adapters exploits RDMA operations for efficient data transfers using the existing infrastructure. This thesis involves the implementation and performance evaluation of iSCSI over RDMA, a process that maps iSCSI v protocol over the iWARP protocol suite. The main idea behing this implementation is to offload the data transfer operation incurred in iSCSI from TCP to RDMA operations, which has shown the benefits of attaining better performance. Since a large portion of data is transferred during the iSCSI storage I/O process, RDMA with its Zero Copy transfer ability provides better performance. iSCSI over RDMA uses the generic data movement of iSCSI to deliver data efficiently. This particular implementation provides a chance for iSCSI to operate in conjunction with RDMA over TCP/IP since RDMA-enabled Network Interface Controllers (RNICs) are expected to become pervasive. This study analyzes, the performance of SAN storage devices, demonstrating that iSCSI over RDMA performs better than running only iSCSI. In addition, the low cost and ease with which an iSCSI over RDMA can be managed make it superior to iSCSI, considering the performance gained. The iSCSI target from iSCSI Enterprise Target (IET) installed in a Fedora Core 5 machine with a 2.6.16 kernel was used to evaluate the iSCSI. The Initiator, was built using a Fedora Core 5 source iSCSI Initiator in a Fedora Core 5 machine with a 2.6.16 kernel and Microsoft iSCSI Initiator in a Microsoft server 2003 machine [1] [2]. This research used RNICs developed by Ammasso Corporation, to directly place data into Initiator buffers using Zero Copy transfer. Hence, unnecessary multiple copying procedures are eliminated, there by memory bus utilization and improving performance, which definitely impacts the overall performance of a system when there is large amount of data being transferred as occurs with iSCSI. The iSCSI over RDMA protocol implementation gives high performance and low overhead for I/O storage. vi TABLE OF CONTENTS Chapter 1. INTRODUCTION ...............................................................................................................1 1.1 1.2 1.3 2. SCSI Commands......................................................................................................7 SCSI Messages and Handshaking............................................................................8 SCSI Bus Types ...................................................................................................... 8 2.3.1 SCSI – 1 ...................................................................................................8 2.3.2 SCSI – 2 ...................................................................................................9 2.3.3 SCSI – 3 ...................................................................................................9 2.3.4 SCSI Bus Signals ...................................................................................10 2.3.5 SCSI Phases ...........................................................................................10 OVERVIEW OF iSCSI PROTOCOL ................................................................................12 3.1 3.2 3.3 3.4 3.5 4. Growth of Storage Area Networks ..........................................................................1 1.1.1 Direct Attached Storage.............................................................................2 1.1.2 Network Attached Storage.........................................................................2 1.1.3 Storage Area Network................................................................................4 Objectives ................................................................................................................5 Organization of Thesis.............................................................................................6 INTRODUCTION TO SCSI................................................................................................7 2.1 2.2 2.3 3. Page Introduction to iSCSI .............................................................................................12 iSCSI Naming ........................................................................................................14 iSCSI Session Establishment Procedure................................................................17 iSCSI Implementation............................................................................................19 iSCSI Read and Write............................................................................................19 INTRODUCTION TO RDMA PROTOCOL....................................................................22 4.1 4.2 4.3 Introduction to RDMA...........................................................................................22 RDMA Overview...................................................................................................23 RDMA Layering Overview ...................................................................................24 4.3.1 RDMA....................................................................................................24 4.3.2 Direct data placement ............................................................................25 4.3.3 MPA.......................................................................................................26 4.3.4 TCP and IP.............................................................................................26 vii TABLE OF CONTENTS (continued) 4.4 4.5 4.6 4.7 5. PERFORMANCE ANALYSIS OF OF iSCSI OVER RDMA............................................32 5.1 5.2 5.3 5.4 6. RDMA data transfer operations .............................................................................27 4.4.1 RDMA Send..........................................................................................27 4.4.2 RDMA Write ........................................................................................28 4.4.3 RDMA Read .........................................................................................28 4.4.4 RDMA Terminate .................................................................................29 RDMA Data Transfer Example .............................................................................29 AMMASSO RNIC.................................................................................................30 Advantage of Using RDMA ..................................................................................31 Test Objectives.......................................................................................................35 Test Procedure .......................................................................................................36 Theoritical Calculations .........................................................................................37 Analysis of Test Results and Parameters ...............................................................38 CONCLUSIONS AND FUTURE WORK ..........................................................................43 LIST OF REFERENCES...............................................................................................................45 viii LIST OF FIGURES Figure Page 1.1 Network attached storage.....................................................................................................3 1.2 Storage area network............................................................................................................4 2.1 SCSI phases ......................................................................................................................11 3.1 iSCSI protocol model.........................................................................................................12 3.2 iSCSI network entity..........................................................................................................16 3.3 iSCSI session establishment and login phase ....................................................................18 3.4 Normal SCSI write.............................................................................................................20 3.5 Normal SCSI read ..............................................................................................................21 4.1 RDMA layering overview..................................................................................................24 5.1 SCSI read with RDMA ......................................................................................................34 5.2 SCSI write with RDMA.....................................................................................................35 5.3 Test bed setup ....................................................................................................................38 5.4 IOPS in iSCSI and iSCSI over RDMA..............................................................................40 5.4 Percentage of CPU utilized in iSCSI and iSCSI over RDMA...........................................41 ix LIST OF ACRONYMS ASIC ................................................................................... Application Specific Integrated Circuit CCIL ............................................................................................. Cluster Core Interface Language CDB ..................................................................................................... Command Descriptor Block CPU............................................................................................................. Central Processing Unit CRC...................................................................................................... Cyclical Redundancy Check DAS.............................................................................................................Direct Attached Storage DDP............................................................................................................... Direct Data Placement EUI.........................................................................................................Extended Unique Identifier FPDU ........................................................................................................................... Framed PDU FC...................................................................................................................................Fedora Core HBA ......................................................................................................................Host Bus Adapter IET .............................................................................................................. iSCSI Enterprise Target IETF ...............................................................................................Internet Engineering Task Force I/O .................................................................................................................................Input/Output IOPS...........................................................................................................Input/Output Per Second IP .............................................................................................................................Internet Protocol IPSEC....................................................................................................... Internet Protocol Security ISID...................................................................................................Initiator Session Identification IT................................................................................................................ Information Technology iSCSI ..............................................................................Internet Small Computer System Interface iqn ................................................................................................................. iSCSI Qualified Name iWARP.................................................................................... Internet Wide Area RDMA Protocol x LIST OF ACRONYMS (Continued) JFS......................................................................................................................Journal File System LAN ..................................................................................................................Local Area Network LBA...............................................................................................................Logical Block Address MPA..............................................................................Marker-Based Protocol-Data-Unit-Aligned NAS.........................................................................................................Network Attached Storage PDUs ..................................................................................................................Protocol Data Units RDMA.............................................................................................. Remote Direct Memory access RNIC ........................................................................................RDMA network interface controller R2T ......................................................................................................................Ready to Transmit RSH..............................................................................................................................Remote Shell SAN................................................................................................................Storage Area Network SCSI ............................................................................................ Small Computer System Interface TCP/IP.................................................................. Transmission Control Protocol/Internet Protocol TSIN.................................................................................................. Target Session Indentification ULP .................................................................................................................Upper Layer Protocol xi CHAPTER 1 INTRODUCTION 1.1Growth of Storage Networks With the enormous amount of data being exchanged daily there is a definite need for storage repositories and the ability to retrieve data with a minimum amount of time delay. Storage capacity has grown significantly, and due to shrinking budgets, Information Technology (IT) managers are forced to find storage solutions that improve disk utilization, data availability, performance, and protection; and reduces the number of data centers, maintenance costs and Central Processing Unit (CPU) loads on storage devices. Basically, network storage is simply a method to store data and make it available to clients over the network. Over the year’s, keeping the above requirements in mind, developments have been made in technology in order to transfer data efficiently over networks in a minimum amount of time [3]. Brief summary of the various phases of the development of storing and accessing data over the network. 1.1.1 Direct Attached Storage First and foremost, Direct Attached Storage (DAS) is a simple and easy storage mechanism where by storage device is directly attached to a host system. An Ideal example of DAS would be an internal hard drive connected directly to the computer, i.e., the client has direct access to the storage device. However, with the need to maintain storage devices with increased capacities and the need to share data over the network, this particular technology cannot be implemented over that could be accessed appropriately by clients. 1.1.2 Network Attached Storage Efforts to achieve these needs led to the development of a storage technology called Network Attached Storage (NAS) as show in Figure 1.1. Here data is stored over the network, and clients/servers access the data in the form of files [4] by connecting to a storage device identified by a unique Internet Protocol (IP) address assigned to a Network Interface Card (NIC). The advantage of this type of network over DAS is its centralized administration and easier, more secure. Also, there is no single point of failure, as in DAS; hence, there are more available storage devices over the network. This particular technology has widely grown over the years and used in many firms and organizations. However, IT managers and companies needed an even better system that would provide faster transmission of data between clients and storage devices, and also include the ability to provide data back-up without impacting the Local Area Network (LAN) where actual clients reside. Hence the need for a Storage Area Network (SAN). 1.1.3 Storage Area Networking SAN is a storage network specifically designed to interconnect high-speed storage devices and servers, as show in Figure 1.2. This specific network can span a large number of systems and geographic locations. Unlike in NAS, SAN is connected behind the servers where storage devices and clients/servers are on same local segment. SAN provides a block-level interface to clients as opposed to the file-level interface in NAS, which provides a faster mechanism to transfer data over the network, i.e., reading and writing of data occurs at block level or the transfer of raw data at physical level. An example of how data is transferred in actual SAN, can be seen in Figure 1.2. When a client on a network requests information from a storage device, the request is received by server, 2 Clients LAN Ethernet Switch NAS Storage Servers Figure 1.1. Network attached storage. 3 Clients LAN Ethernet Switch Servers SAN Ethernet Switch FC Switch RAID Storage Device RAID Storage Device Figure 1.2. Storage area network. 4 which in turn obtains the data from the storage device and sends it to the appropriate client over the network. Hence, the server has full control over the storage device and transfers the appropriate data requested by the client in the minimum possible amount of time. This particular block-level transfer provides a faster level of data transmission than to file-level transfers in NAS systems, since there is no overhead involved. Also, backups in the SAN network do not affect the rest of the LAN, since the back-up data passes only over the SAN, providing a LAN free back-up environment and less chances of network congestion occurred in frequent backups of data over the LAN. Hence for providing faster transmission of data over SAN, devices like SCSI are used to transfer block of data unlike other devices that are used for storage. 1.2Objectives The major issue involving an Internet Small Computer System Interface (iSCSI), is the overhead incurred in its data transfer, i.e., degradation incurred due to extra memory copies with the host and due to an increase in CPU utilization. To address this issue, iSCSI over RDMA was implemented to eliminate extra memory copies in Transport Control Protocol/Internet Protocol (TCP/IP). This particular implementation allowed RDMA-enabled Network Interface Controllers (RNICs) to directly place data into Initiator buffers using Zero Copy transfer. Packet reordering, which is also required in the iSCSI protocol, experiences a few problems when there is a large amount of data transferred over the network. Since each TCP segment does not have an iSCSI header and there is a need for an Upper Layer Protocol (ULP) for signaling out-of-order packets, a large amount of buffering is necessary, which is very costly and hence an obstacle to wide deployment. 5 In order to avoid to this bottleneck and to better utilize the memory subsystem, memory bandwidth, and CPU cycles, a RDMA data-transfer procedure was employed. The iWARP protocol provided RDMA semantics over TCP/IP networks, and iSCSI over RDMA mapped the iSCSI protocol over iWARP protocol. This Thesis analyzes the performance of SAN storage devices, demonstrating that iSCSI over RDMA performs better than running iSCSI alone. In addition, the low cost and ease with which an iSCSI over RDMA can be managed make it superior to iSCSI when considering the gain in performance. Also this thesis studies the role that implementing iSCSI over RDMA played in evolving the RDMA-enabled Network Interface Controller (RNIC) and the functionality of the iWARP protocol suite. 1.3Organization of Thesis The organization of the remainder of this paper is as follows: Chapter 2 provides an overview of the SCSI protocol and SCSI devices, Chapter 3 gives a summary of the iSCSI protocol, Chapter 4 provides an overview of the RDMA protocol, Chapter 5 presents performance analysis of iSCSI over RDMA and results, Chapter 6 provides conclusions and future work. 6 CHAPTER 2 INTRODUCTION TO SCSI The Small Computer System Interface (SCSI), which is primarily used in SAN environments, is a computer industry standard for connecting computers to peripheral devices such as hard disk drives, CD-ROM drives, etc., where there is a need to transfer large amounts of data quickly. SCSI is a local I/O (Input/Output) high-speed bus technology used to interconnect peripheral devices to computers. The SCSI interface defines a logical-level rather than a devicelevel interface [5] to the disk, which allows the system's view of the disk to be independent of the physical geometry of the disk device in order to independently develop systems and peripheral devices that could be used together. This allows companies to integrate technology and costsaving advancements rapidly. 2.1 SCSI Commands SCSI devices are connected to the computer using an SCSI bus, and each SCSI device connected is identified using a unique SCSI identification number which ranges from 0 to 16 [6] SCSI devices transfer data using instructions called SCSI commands for reading and writing data in blocks. In turn, these SCSI commands are contained in a Command Descriptor Block (CDB), specifying the operation requested and number of bytes required to complete that operation. The major difference between the SCSI protocol and other interfaces used for storage devices is that SCSI commands address a device as a series of logical blocks rather than in terms of heads, tracks, and sectors, as done in file-transfer protocols. This particular implementation allows SCSI to be used with multi-vendor devices. 7 2.2 SCSI Messages and Handshaking SCSI messages are used for communicating a number of possible messages between the Initiator and Target, as described below, that indicate the successful completion of an operation, requests, and status information. All messages are sent during the message phase. The SCSI standard uses handshaking procedures so that SCSI devices request and acknowledge data and control signals for reliable communication. The SCSI information transfer phase uses this handshaking to transfer data or to control information between the Initiator and Target, in either direction. As an example, the Initiator senses a REQ signal, reads the data lines, and then asserts the ACK signal. When the Target senses the ACK signal, it releases the data lines and negates the REQ signal. The Initiator then senses that the REQ signal has been negated, and negates the ACK signal. After the Target senses that the ACK signal has been negated, it can repeat the whole process again, to transfer another byte of data. 2.3 SCSI Bus Types 2.3.1 SCSI-1 SCSI-1 [7] initially called SCSI, shared an eight bit-wide bus that operated at up to 5 MHz, and up to eight devices could be attached to it and daisy-chained together to form a bus. Any two devices on the SCSI bus could communicate by setting up a connection, exchanging control information, and transferring data between each other. The device that initiates the connection is called the Initiator, and the destination of the Initiator connection is called the Target. In practice, generally the host-system interfaces that initiate communications over the bus are Initiators, while peripherals such as disks, etc., are generally the Targets of these communications. Individual devices on an SCSI bus are distinguished from one another through 8 a unique SCSI identification number or SCSI ID, and the SCSI allows each Target to be sub divided into logical units called Logical Unit Numbers (LUNs) up to a maximum of eight logical units per Target. 2.3.2 SCSI-2 SCSI-2 [8] was developed in order to a keep pace with technology changes, operating at 10 MHz as opposed to 5 MHz in SCSI-1 and the bus was widened to sixteen bits called a Wide bus. Widening the bus allowed for increased data transfer rates and up to 16 devices. Data transfer rates on this 16 bit Wide bus give twice the performance of that an 8 bit Wide bus. These additional features gave way to the development of Fast SCSI, Wide SCSI, and Fast Wide SCSI. A Fast SCSI bus is 8 bits in width and can support eight devices operating at 10 MHz a Wide SCSI bus is 16 bits in width, supporting a connection of upto maximum of 16 devices operating at 5 MHz, a Fast-Wide SCSI bus is 16 bits in width and supports up to 16 devices operating at 10 MHz. Along with the increased number of bus types, SCSI-1 and SCSI-2 are compatible with each other. 2.3.3 SCSI-3 Further research led to the development of the SCSI-3 standard, which increased the data transfer cycles to 20 MHz (Ultra) and again up to 40 MHz. (Ultra2). Ultra and Ultra2 SCSI are supported in either 8-bit or 16-bit Wide Ultra SCSI and Wide Ultra2 SCSI. 2.3.4 SCSI Bus Signals The SCSI specification defines the 18 SCSI bus signals with others used for grounding. Nine of these signals are used to initiate and control transactions, and another nine are used for data transfer, including a parity bit as follows: 9 a. Busy signal to indicate that currently SCSI bus is in use. b. Select signal to choose the Target among those available for communication. c. Control signal and Data signal to indicate if the command used is for sending control information or data information. d. Input/Output signal to indicate direction of movement of data in respect to Initiator. e. Message signal to indicate message phase by Target. f. Request signal to indicate the handshake procedure during the connection setup process with SCSI devices. g. Acknowledge signal in response to request signal sent. h. Attention signal used by Initiator to indicate Target of its readiness to send messages. i. Reset signal to release SCSI bus so that other devices can use the SCSI bus. j. Data signal along with parity bit set used during data transfer operation. 2.3.5 SCSI Phases Phases involved in SCSI transfer are as follows: a. Bus Free: Indicates that no SCSI devices are using the bus, and that the bus is available for SCSI devices as show in Figure 2.1. b. Arbitration: Permits an SCSI device to gain control of the SCSI bus. Also permits other devices wishing to use the Busy signal to gain control and put their SCSI ID on the bus. The device with the highest SCSI ID wins the arbitration. c. Selection: Lets the device that won arbitration use the bus for data transfer. d. Reselection: Disconnects and reconnects SCSI devices from the bus during lengthy operations. e. Command: Allows Target to requests a command from the Initiator. 10 f. Data: Allows Target to request a transfer of data to or from the Initiator. g. Status: Occurs when Target requests that status information be sent to the Initiator. h. Message: Allows Target to request the transfer of a message to the Initiator. Messages are small blocks of data that carry information or requests between the Initiator and the Target. Multiple messages can be sent during this phase. Bus Free Phase Selection/Re Selection Phase Arbitration Phase Reset Phase Figure 2.1. SCSI phases. 11 Data Transfer Phase CHAPTER 3 OVERVIEW OF iSCSI PROTOCOL 3.1 Introduction to iSCSI Internet SCSI (iSCSI) is an Internet Engineering Task Force (IETF) draft standard protocol. It is known as a client/server protocol or an Initiator/Target protocol that uses the TCP/IP connection to exchange SCSI commands. These SCSI commands are encapsulated in Protocol Data Units (PDUs) called iSCSI PDUs [9]. The Target is responsible for providing proper data requested by the Initiator and reporting to the Initiator completion of the data-transfer operation. The iSCSI Protocol [10], as show in Figure 3.1, maps the SCSI remote procedure model to the TCP/IP protocol. INITIATOR TARGET Application Layer Application Layer SCSI Protocol SCSI Protocol iSCSI Protocol iSCSI Protocol TCP TCP IP IP Data Link Layer Data Link Layer Physical Layer Physical Layer Figure 3.1. iSCSI protocol model. 12 The iSCSI protocol [11] locates itself on top of Transport Control Protocol. Since features like congestion control and flow control already exists in TCP, there are very limited constraints in iSCSI, unlike other protocols which take care of these features independently, making the protocols design and use more complex. Since iSCSI runs over the TCP, it can utilize features such as guaranteed in order delivery of data, congestion control, etc., which provides a reliable connection and works over a variety of physical media and interconnect topologies. Added to these features, discussed above, TCP offers an end-to-end connection model independent of the underlying network. Also, since TCP has a mechanism to acknowledge all TCP packets that are received and to re-send packets that are not acknowledged within a certain time-out period. Packets may be re-sent automatically using TCP. Also, congestion control techniques are remedied by TCP. Hence, the iSCSI design is made simpler by eliminating those features that are already provided using TCP. Although TCP has additional features that are not needed for transport of SCSI in general, the designers of iSCSI felt that the benefits of using an existing, well-tested transport protocol like TCP would justify its use. As mentioned previously, the iSCSI protocol defines its packets as iSCSI Protocol Data Units (PDUs) consisting of a header and possible data, where the data length is specified within the PDU header. iSCSI runs on different modes, based on the parameters that were negotiated during the login phase. This protocol can establish a single TCP connection between the Initiator and Target or can have multiple paths using multiple TCP connections [12] for a single session, and can exchange both data and control messages on all connections. The multiple paths are beneficial for data integrity; when one of the links goes down, there is an alternate path to complete the task. 13 iSCSI performs the data integrity Cyclic Redundancy Check (CRC) on its header and PDU in the CRC mode. It can also use authentication protocols and encryptions such as Internet Protocol Security (IPSEC) that are negotiated at the beginning of the session. These however were not the focus of this research work. Two main techniques are used to accessing remote data, namely file-access protocols and block-access protocols. In file-access protocols, remote files are made to appear as local files and in block-access protocols, remote disk blocks are made to appear as local. A block is the smallest amount of data that can be read from a disk or written to a disk by a device issuing surface number, cylinder number, and sector number. The size of the disk sector is the size of the block. iSCSI, like most IP-based SAN [13] protocols uses the block-access protocol that is also used in the SCSI protocol. Unlike Network File System (NFS) where files may be shared among multiple-client machines, block protocols such as iSCSI support a single client for each volume on the block server. As mentioned earlier, Targets are accessed as block devices. Hence, only one system can use the iSCSI device at a time as opposed to a Fiber Channel where multiple machines may access one file system. In iSCSI, each machine is allocated a chunk of one large device. 3.2 iSCSI Naming All iSCSI devices have a unique address scheme namely, iSCSI Qualified Name (iqn) and the Institute of Electrical and Electronics Engineers Extended Unique Identifier-64 (eui), commonly known as the IEEE EUI-64 format. A sample iSCSI address in iqn format [14] is given below: iqn.1994-05.com.Cisco:Target.storage1.jfs 14 The string iqn. identifies iSCSI initiator name as an iSCSI Qualified Name to distinguish it from an iSCSI initiator name in the "eui." format. The notation 1994-05. is a date code in yyyy-mm format followed by a dot. This date MUST be a date during which the naming authority owned the domain name used in the iqnformatted iSCSI initiator name.The naming authority for the Target is represented by giving the domain name in the reverse order. (i.e., com.Cisco) and then following a colon (:) is the Target. The notation :Target is an optional string that must comply with a character set and length boundaries that the owner of the domain name deems appropriate. The optional string must be preceeded by a colon. The optional string may contain product types, serial numbers, host identifiers, or software keys i.e.,Target.storage1.jfs to identify the iSCSI address for Target device 1 that has a Journal File System (JFS). IEEE EUI-64 format addressing can be accessed by following the link given in reference [15]. For example, if the Hewlett-Packard Company owned the domain name "stor.hp.com," registered in 2001, the iSCSI qualified names that might be generated by the Hewlett Packard Company appear in the following example: Type Date Naming Authority String defined by "stor.hp.com" “iqn”. “2001-04”. “com.hp.stor”: “initiator:master-host-ae12345” The iSCSI address provides a mechanism for multiple Initiators or Targets to share a common IP network address, e.g, if the above address belongs to one of the three Target devices of a particular Target machine, all these Targets can be reached by an Initiator with a unique iSCSI name in conjunction with the IP address of the machine. Similarly, multiple Initiators or 15 Targets can be reached via multiple IP addresses. For example, if the above system had two NICS with two different IP addresses, then the Target identified by the above iSCSI Target address can be reached by an Initiator either through NIC1 or NIC2, which is the very basic idea to provide a redundant path to improve data integrity in a link-failure situation. The iSCSI node is a single iSCSI Initiator or iSCSI Target and can contain more than one iSCSI node within a network entity, i.e., a device or gateway that is accessible from the IP network. This could be the local system where the Initiator or Target device resides. As shown in Figure 3.2, a network entity must have one or more network portal such as NICS. Each network portal is used by the iSCSI Initiators or Targets within that network entity to gain access to the remote IP network. The Network portal is a component having a unique TCP/IP network address used for the iSCSI sessions. Both Initiator and Target machine network portals are identified by their IP addresses and TCP port number pair. If the port number is not specified, the default port number 3260 will be used. These concepts are described in Figure 3.2 [16]. As shown, iSCSI Client and iSCSI Server are nodes in the network and a unique way of identifying each network node by using both IP Addresses and Port numbers. iSCSI Client Node iSCSI Server Node Network Portal i.e, NIC IP Address 1.1.1.1 255.0.0.0 Port 3260 Network Portal i.e, NIC IP Address 1.1.1.2 255.0.0.0 Port 3260 Figure 3.2. iSCSI network entity. 16 3.3 iSCSI Session Establishment Procedure For an iSCSI Initiator to communicate with a Target, it first needs to establish a session between them. The Network Entity may also contain one or more iSCSI Nodes, represented by unique iSCSI names. Since, Initiators establish iSCSI sessions with targets, session IDs are generated to uniquely identify individual conversations between specific iSCSI Nodes within the corresponding Network Entities. An Initiator logging on to a target, for example, would include its iSCSI name and an Initiator Session ID (ISID), the combination of which would be unique within its host Network Entity. A Target, responding to the login request, would generate a unique Target Session ID (TSID), which likewise, in combination with its iSCSI name, gives that session a unique identity within the Network Entity in which it resides. A session [17] has two phases: login phase and full-featured. Within a session, one or multiple TCP connections are established. The data and command exchange occurs within the context of the session. The login phase is started when the Initiator establishes a TCP connection to the Target via a specified Target port. During this time, the Initiator and Target may authenticate each other and parameters are negotiated in the form of login request and login response. Once the login phase is completed, the session enters the full-featured phase where the Initiator discovers the available Targets. Only after that phase is completed, can the SCSI I/O begin. In the Command phase, SCSI commands in the form of a Command Descriptor Block (CDB) is encapsulated in an iSCSI command PDU. The CDB describes the operation and associated parameters, e.g., the logical block address (LBA) and the length of the requested data in the form of a “MaxBurstLength” parameter, once it enters into the full-featured phase. During this phase, the data PDUs are 17 transmitted from an Initiator to a Target. The iSCSI session establishment and phases are shown in Figure 3.3. TCP SYN I N I T I A T O R TCP SYN + ACK Login Phase SCSI Commands SCSI ResponseSSS Phase TCP ACK iSCSI Login Request iSCSI Response TCP Connection Established iSCSI Session Established T A R G E T SCSI Command SCSI Response Figure 3.3. iSCSI session establishment and login phase. iSCSI Initiators and Targets come either as hardware components or as a software installation. The Hardware component is an iSCSI Host Bus Adapter (HBA), which is a variant to a normal Ethernet card with an SCSI Application Specific Integrated Circuit (ASIC) onboard to off-load all the work from the system CPU. Software installation is done by a software driver that combines an NIC driver and an iSCSI driver to handle all iSCSI and other requests. Since IT managers we are expected to choose the least expensive and most productive solution, software Initiators and Targets without any added cost are chosen. 18 3.4 iSCSI Implementation Several iSCSI protocol implementations are available commercially and as open-source free versions. Under the Linux platform, this thesis work used iSCSI Target implementation from iSCSI Enterprise Target (IET) running Fedora 2.6.16-1.2096_Fc5smp kernel, and a built-in Initiator software package available in Fedora Core 5 release. Using a Microsoft operating system environment, this research work employed Microsoft iSCSI Initiator 2.01 running in a Windows 2003 Server Environment. 3.5 iSCSI Read and Write As previously discussed in iSCSI, all data and control traffic flows through the existing TCP/IP infrastructure. Hence, in order to perform a SCSI read request command, Initiator first creates a iSCSI PDU to be sent to the Target and once it reaches the Target machine, it sends back the data requested to be read. Since the iSCSI PDU is limited, a read request might need more than one PDU for a particular request. Meanwhile, flow control is handled by TCP to ensure that no buffer over flow occurs at the receiving end and that all data sent is acknowledged by the receiver. Figure 3.4 indicates the normal iSCSI write procedure [18] without involvement of RDMA-enabled network adapters. The iSCSI protocol involves an exchange of commands and responses between Initiator and Target using iSCSI PDUs, which encapsulate SCSI CDB commands, status, and data. In the SCSI write example, shown in Figure 3.4, the Ready to Transmit (R2T) PDUs perform, the role of SCSI flow control between Target and Initiator. These PDUs are issued by the Target device as buffers become available to receive more data. At the completion of the write, the Target issues the status and sense, indicating a successful transaction. The Target 19 controls the flow of data by indicating the amount of data it is able to receive via a transfer length field present in the iSCSI header field. If the Target does not respond, or responds with corrupted or incomplete test data, the Initiator may close the connection and establish a new one for recovery. SCSI WRITE READY TO TRANSMIT D SCSI DATA REQUESTED INITIATOR TARGET SCSI DATA REQUESTED READY TO TRANSMIT SCSI RESPONSE Figure 3.4. Normal SCSI write. In the SCSI read example, shown in Figure 3.5, the Ready to Transmit (R2T) PDUs perform the role of SCSI flow control between Target and Initiator. These PDUs are issued by the Initiator device as buffers become available to receive more data. At the completion of the read, the Initiator issues status and sense, indicating a successful transaction. The Initiator controls the flow of data by indicating the amount of data it is able to receive via a transfer length field present in the iSCSI header field. The status of the SCSI data transport during reads 20 and writes is monitored through status and data sequence numbers via TCP, which encapsulates the iSCSI PDU. SCSI READ REQUEST TO TRANSFER D DATA REQUESTED INITIATOR TARGET TARGET REQUEST TO TRANSFER DATA REQUESTED SCSI STATUS Figure 3.5. Normal SCSI read. All features of the iSCSI protocol discussed above provide an easier way to deploy SANs using an iSCSI software package that can be included with operating systems. Also, reduced latency due to the block level transfer of data and the lower-cost infrastructure provides easy for iSCSI to reach SAN solutions. The only downside comes from utilization of the CPU for processing TCP stacks for every I/O occurring in the Initiator or Target device, which requires accessing the memory bus a minimum of three times per data packet. This remains the barrier as the data transfer rates increase over TCP/IP, which can be overcome by using technologies such as RDMA. 21 CHAPTER 4 INTRODUCTION TO RDMA PROTOCOL 4.1 Introduction to RDMA With the advancement in computing and storage technologies, information technology managers are forced to build a faster data center network which requires more central processing unit [19] power to process communication. Adding to this problem, TCP/IP data consumes significant memory bus bandwidth because this data typically crosses the memory bus three times, will be explained below. This overhead keeps the CPU busy by not allowing it to do other useful work, increasing latency, etc. Also since most networks use a wide variety of links to interconnect devices in the network, the use of multiple system and peripheral [20] bus interconnects decreases compatibility, interoperability, and management efficiency with additional cost for equipment and special training. Hence, to increase efficiency and lower costs, the data center network infrastructure needs to be modified into a unified, scalable, high-speed framework. This concept of a modified network infrastructure is relatively new, requiring high bandwidth and low latency that can move data efficiently over the network [21]. With the use of more efficient communication protocols, processors can be less burdened, giving them a chance to work in a more useful way. Remote Direct Memory Access (RDMA) technology [22] is an emerging technology that promises to accomplish these goals. This data exchange technology is used by systems, applications, and storage to communicate directly over the existing infrastructure such as TCP/IP, which means packets are processed by the main system CPU. This creates a problem by 22 consuming more CPU resources for processing incoming and outgoing network traffic. In order to efficiently use CPU resources, RDMA is employed. According to the research, even with an increase in CPU power, the actual processing unit itself is still overburdened with system memory bandwidth, for efficient use of CPU processor cycles, system memory access must be reduced. Since memory bandwidth in modern architectures has rapidly become a scarce resource, gaining memory bandwidth is a major benefit. 4.2 RDMA Overview As is commonly known, that TCP and IP are a suite of protocols [23] providing Intranet and Internet access, and every device in the network uses this suite of protocols to communicate with each other. Information is transmitted in the form of packets so that multi-vendor communication is possible. Today, TCP/IP stacks are implemented in operating system software and because of this implementation all the packets transmitted or received are processed by the system’s CPU [24]. As a result, protocol processing of incoming and outgoing network traffic consumes CPU cycles, which can be more effectively used for other useful purposes. The amount of time consumed by the CPU during traffic processing will lead to a reduction in the number of processes handled by the CPU, increasing the overall delay in processing the packets. With this already-burdened CPU, a finite amount of memory bus bandwidth in the system causes even more delay in transmitting or receiving packets over the network. Both the TCP/IP protocol overhead and limited memory bandwidth available hinder the deployment of faster Ethernet networks. The use of RDMA over TCP technology [25] can provide a chance to overcome these barriers by providing a chance to effectively use faster Ethernet networks. 23 RDMA technology was developed to move data from the memory of one computer directly into the memory of another computer with minimal involvement from their CPUs. This Zero Copy or Direct Data Placement (DDP) [26] capability provides the most efficient network communication possible between systems. 4.3 RDMA Layering Overview The RDMA protocol [27] suite eliminates data copy operations and therefore reduces latencies by allowing an application to read and write data directly to the memory of a remote system with minimal demands on memory bus bandwidth and CPU processing. 4.3.1 RDMA RDMA over TCP [28] technology involves a set of layers as show in Figure 4.1 performing different operations, the RDMA protocol uses RDMA write and RDMA read commands to transfer data between Initiator and Target devices, and sends them to the layer which resides below it namely Direct Data Placement (DDP). DDP protocol in turn segments the APPLICATION RDMA DDP MPA TCP IP DATALINK PHYSICAL Figure 4.1. RDMA layering overview. 24 data obtained from the upper RDMA layer and also reassembles these segments into a DDP message. The Marker-Based Protocol Data Unit Aligned (MPA) protocol adds a backward marker at a fixed interval to the DDP segments, and it a length and Cyclical Redundancy Check (CRC) to each MPA segment. The TCP transmits or receives traffic in terms of bytes, while DDP uses fixed protocol data units to transmit or receive data. Hence, to enable RDMA, DDP needs a framing mechanism for the TCP transport protocol which is taken care of by MPA. This facility allows the network interface to place the data directly in the receiver's application buffer based on control information carried in the header. Hence, an efficient use of DDP comes only with additional usage of the MPA layer, allowing the system to avoid memory copy overhead and reduce the memory requirement for handling out-of-order and dropped packets. 4.3.2 Direct Data Placement DDP allows Upper Layer Protocol (ULP) data [29], such as application messages or disk I/O, which are generated every time the data is read or written from or to Initiator or Target devices and are contained within DDP segments, to be placed directly into memory at the final destination without further processing, by the ULP. A DDP segment includes a DDP header and ULP payload, providing control and placement fields that define the final destination for the payload, which is the actual data being transferred. A DDP message is a ULP-defined unit of data interchange that is subdivided into one or more DDP segments. This segmentation may occur for a variety of reasons, including to the respect of maximum segment size of TCP. A sequence of DDP messages is called a DDP stream. 25 DDP uses two data transfer models:  Tagged Buffer Model – Used to transfer Tagged buffers between the two members of the transfer, namely the local peer and the remote peer. Tagged buffers are explicitly advertised to the remote peer through exchange of a steering tag (STag), Tagged offset, and length. An STag is simply the identifier of a Tagged buffer on a node, and the Tagged offset identifies the base address of the buffer. They are typically used for large data transfers, such as large data structures and disk I/O.  UnTagged Buffer Model – Used to transfer UnTagged buffers from the local peer to the remote peer. UnTagged buffers are not explicitly advertised to the remote peer. They are typically used for small control messages, such as operation and I/O status messages. 4.3.3 MPA The MPA creates a Framed PDU (FPDU) by pre-pending a header, inserting markers, and appending a CRC after the DDP segment. The MPA delivers the FPDU to the TCP. The MPA-aware TCP sender puts the FPDUs into the TCP stream and segments the TCP stream so that each TCP segment contains a single FPDU. At the receiver, the MPA locates and assembles complete FPDUs within the stream, verifies their integrity, and removes information that is no longer necessary. The MPA then provides the complete DDP segment to DDP. 4.3.4 TCP and IP The TCP takes care in guaranteeing transfer of those segments received from the upper- layer protocols. It also takes care of flow control, congestion control, error correction, etc. Storage devices make use of a TCP connection to build iSCSI sessions to transfer data between Initiator and Target devices. Packets transmitted from Initiator to Target or vice-versa are acknowledged by informing the receipient of the packet sent. IP does the actual routing needed 26 to reach the destination over the network. Each device in the storage network identifies itself uniquely with the assigned IP address and iqn name identifier so that packets are routed to their proper destinations. 4.4 RDMA Data Transfer Operations RDMA protocol provides seven different data transfer operations. RDMA information is included inside of fields within the DDP header. With a RDMA Network Interface Controller (RNIC), CPU’s of both source and destination devices are not involved in the data transfer operations, and RNIC is responsible for generating outgoing and processing incoming RDMA packets. In addition, the data is placed directly where the application advertised that it wanted it and is pulled from where the application indicated it was located. 4.4.1 RDMA Send RDMA uses four variations of the send operation:  Send – Transfers data from the data source (the peer sending the data payload) into a buffer that has not been explicitly advertised by the data Target (the peer receiving the data payload). The send message uses the DDP UnTagged buffer model to transfer the ULP message into the data Target’s UnTagged buffer. Send operations are typically used to transfer small amounts of control data where the overhead of creating a STag for DDP does not justify the small amount of memory bandwidth consumed by the data copy.  Send with Invalidate Operation – Includes all functionality of the send message, plus the capability to invalidate a previously advertised STag. After the message has been placed and delivered at the data Target, the data Target’s buffer, identified by the STag included in the message, can no longer be accessed remotely until the data Target’s ULP reenables access and advertises the buffer again. 27  Send with Solicited Event – Similar to the send message except that when the Send with solicited event message has been placed and delivered, an event (for example, an interrupt) may be generated at the recipient end, if the recipient is configured to generate such an event. This allows the recipient to control the amount of interrupt overhead it will encounter.  Send with Solicited Event and Invalidate – Combines the functionality of the Send with an invalidate message and the Send with a solicited event message. 4.4.2 RDMA Write Operation The RDMA write operation is used to transfer data from the data source to a previously advertised buffer at the data Target. The ULP at the data Target enables the data Target Tagged buffer for access and advertises the buffer’s size (length), location (Tagged offset), and STag to the data source through a ULP-specific mechanism such as a prior-send message. The ULP at the data source initiates the RDMA write operation. The RDMA write message uses the DDP Tagged buffer model to transfer the ULP message into the data Target’s Tagged buffer. The STag associated with the Tagged buffer remains valid until the ULP at the data Target invalidates it or until the ULP at the data source invalidates it through a Send with Invalidate Operation or a Send with Solicited Event and invalidate operation. 4.4.3 RDMA Read The RDMA read operation transfers data to a Tagged buffer at the data Target from a Tagged buffer at the data source. The ULP at the data source enables the data source Tagged buffer for access and advertises the buffer’s size (length), location (Tagged offset), and Stag [30] to the data Target through a ULP-specific mechanism such as a prior send message. The ULP at the data Target enables the data Target Tagged buffer for access and initiates the RDMA read 28 operation. The RDMA read operation consists of a single RDMA read request message and a single RDMA read response message, which may be segmented into multiple DDP segments. The RDMA read request message uses the DDP UnTagged buffer model to deliver to the data source’s RDMA read request queue the STag, starting Tagged offset, and length for both the data source and the data Target Tagged buffers. When the data source receives this message, it then processes it and generates a read response message, which will transfer the data. The RDMA read response message uses the DDP Tagged buffer model to deliver the data source’s Tagged buffer to the data Target, without any involvement from the ULP at the data source. The data source STag associated with the Tagged buffer remains valid until the ULP at the data source invalidates it or until the ULP at the data Target invalidates it through a Send with invalidate or Send with Solicited Event and Invalidate operation. The data Target STag associated with the Tagged buffer remains valid until the ULP at the data Target invalidates it. 4.4.4 RDMA Terminate A terminate operation is included to tear down the connection when an error is encountered. The Initiator or Target device can disconnect using this signal in case either the Initiator or the Target device is not responding to the requests made to them. 4.5 RDMA Data Transfer Example In a typical network data transfer, each packet accesses the memory bus three times before storing it. The first instance occurs when the receiving device writes the data to the device driver buffer. From the device driver buffer, the data is copied to the temporary buffer and finally copied to the application memory. This appears simple in a typical network scenario, but as the network grows and packets become bigger, the problem worsens and will have adverse effects, possibly not meeting the data transfer requirements. 29 With an increase in data rates, access to the memory bus is made more frequently. Below is a brief explanation of how a packet transfer occurs using RDMA. RDMA enables the applications to directly issue commands to the NIC without having to execute a kernel call. This is known as “Kernel Bypassing.” Then the RDMA-enabled NIC in the local system, transfers data from the local buffer to the RDMA-enabled NIC in the remote system. From here, the remote NIC places the data directly into its local memory without the intervention of its system CPU. Once the data is placed, the NICs inform their system CPUs of the completion of the operation. An application requests a read or write request, and for every request made of this kind, two types of transfer phases take place, data and control operations. In RDMA, transfer memory needs to be registered before the application requests data transfer in the form of a special Tag called a Steering Tag or Stag, which is located in the RDMA header field. Hence, during the control operation, since memory is already registered, data transferred can be stored in memory regions advertised in the receiver or destination. Hence, the above-discussed advantages prove that RNICs provide better performance by reducing the load of the percentage of CPU utilized. 4.6 AMMASSO RNIC The AMMASSO Gigabit Ethernet Network Interface Card [31] used in this thesis provides an implementation of the RDMA over TCP/IP-enabled NIC, which in turn provides low latency and high bandwidth on a Gigabit Ethernet network. This card supports the legacy sockets interface and the Cluster Core Interface Language (CCIL) interface. The CCIL [32] interface uses the RDMA layer and off-loaded TCP/IP on the NIC to transmit the data. On the other hand, the sockets interface still sends and receives the data through the traditional TCP/IP implemented 30 in the operating system kernel. The CCIL interface enables Zero-Copy and Kernel-Bypass data transmission. 4.7 Advantage of Using RDMA This method of transferring data directly to or from application memory and data buffers to a remote memory over a network is known as Zero Copy networking. Since no CPU and cache overhead are involved, Zero Copy networking is particularly useful in applications where low CPU utilization, and low latency [33] in the network are desired. The RDMA engine notifies the CPU [34]. The addition of RDMA capability to Ethernet will simplify server deployment and improve infrastructure flexibility. As a result [35], the data center infrastructure will become less complex, easier to manage, and more adaptive. Hence, this immense potential of RDMA to improve communication performance while being extremely conservative on resource requirements has made RDMA the most popular of the current and future generation network infrastructures. 31 CHAPTER 5 PERFORMANCE ANALYSIS OF iSCSI OVER RDMA Typically, memory bandwidth is consumed by buffer copying because the data received from the network adapter did not arrive in the buffer that the application required. This buffer copy occurs both on the transmit or receipt of a packet. There is no known general-purpose algorithm for solving the receive copy problem for TCP/IP without the application being rewritten to accept buffers from the protocol stack instead of supplying buffers which is the causes the largest source of CPU utilization except for RDMA. Also, the interrupts generated when the adapter has finished transmitting or receiving posted data keep the CPU busy. Hence, the CPU offload can also solve latency bottlenecks, which can slow down the application and also can affect distributed applications running over the network. Latency in this case is a significant problem for client-to-server database communications, which have many outstanding transactions; therefore, the end-to-end latency of a single transfer is very critical. During TCP implementation, data arriving on a TCP connection is first copied into temporary buffers then the TCP driver checks connection identification information, such as port numbers, and source and destination addresses, to determine the intended receiver of the data. The data is then copied into the receiver’s buffers. For SCSI data, there might be many pending SCSI commands at any given instant, and the received data must typically be copied into the specific buffer that was provided by the SCSI layer for the particular command. This entire procedure might require the receiving host to copy the data a number of times before the data reaches the final destination buffer. Such copies require a significant amount of CPU and 32 memory bus usage that would adversely affect system performance. Therefore, it is most desirable to place the data in its final destination with a minimum number of copies. Currently, an I/O bottleneck has always been around the memory and the I/O subsystems and the formula for calculating I/O throughput from memory and I/O bus frequency is calculated as given in Equation 5.1. Hence, when I/O performance is particularly hindered because of the percentage of CPU utilization, this research will enhance the overall throughput of the system in terms of percentage CPU utilized or IOPS. The RDMA protocol did not make quick enough progress towards standardization, so the iSCSI protocol could not make use of such a mechanism being available for increasing the overall throughput of the system. Once the RDMA proposal standardized all problems, such as Zero Copy transfers were answered and since then were largely put in use. The information provided in an iSCSI data PDU header include the following: a transfer tag to identify the SCSI command and its corresponding buffer, a byte offset relative to the beginning of the corresponding buffer, and a data length parameter indicating the number of bytes being transferred in the current data packet. As described previously in section 3.5, we see that for each SCSI read or write command, a large number of flow control packets need to be sent from one end to the other and these packets need to be copied into temporary buffers and then to device driver buffers multiple times before storing the data at the destination. Hence, each packet will generate multiple interruptions at the destination. In a network where there is enormous amount of data traffic, multiple copy operations are needed, and the system suffers with interrupt overhead, and thus lowering the system performance. Also, having multiple small packets means that the iSCSI protocol layer and the underlying communication layer must interact with each other multiple times to process these packets. 33 These interactions can also increase communication overhead. Without going through all these unnecessary procedures, RNICs enable the data to be placed at a destination without multiple copy procedures, with the help of RDMA semantics. In order to prevent this inherent performance degradation from occurring RDMA along with implementation of iSCSI was used. In iSCSI over RDMA, the iSCSI nodes use RDMA operations and the Initiator advertises the buffer using the identifier i.e, STag to the target when the SCSI Command for the data transfer is issued by the iSCSI layer. Hence, for a SCSI read command as shown in Figure 5.1, the STag identifies the tagged buffer into which data from the target will be directly placed by the initiator RNIC using the RDMA write operation. For a SCSI write command as shown in Figure 5.2, the STag identifies the tagged buffer on the Initiator from which data is directly transferred by the Target RNIC using the RDMA read operation. As show in Figure 5.1 instead of using the normal SCSI read function as described in the SCSI section 3.5. The RDMA operations provide several advantages. The RDMA operations transfer data directly to buffers at the Initiator from the Target, since the memory regions are SCSI READ RDMA WRITE INITIATOR TARGET SCSI STATUS Figure 5.1. SCSI read with RDMA. 34 SCSI READ RDMA WRITE INITIATOR TARGET SCSI STATUS Figure 5.2. SCSI write with RDMA. already registered before the data transfer starts using RDMA. Hence, software flow control is not needed, thus eliminating multiple copy operations required in each data transfer operation. Hence, the interrupt frequency is reduced, there by increasing overall system performance. As shown in Figure 5.2, instead of using the normal SCSI write operations from section 3.5, as described in the RDMA section, RDMA operations transfer data directly from the buffers at the Initiator to the Target since the memory regions are already registered before the data transfer begins. Hence, the need for software flow control and multiple copy operations are no more needed, thus improving the overall network performance by reducing the percentage CPU utilized. If the buffers are contiguous, only one RDMA operation is needed for the entire SCSI read or write request. As a result, the software overhead of processing multiple packets is also eliminated. 5.1Test Objectives Understanding the sources of disk I/Os made for every data transfer requested by the Initiator or the Target helps to plan and configure nodes in SAN networks in a way that maximizes performance of the overall system. The objectives of this performance testing is 35 aimed to show that iSCSI in conjunction with RDMA improves throughput in terms of IOPS, which defines the access rate of storage devices at different transaction or block sizes compared to running only iSCSI. The observations made in this research work helps to understand how I/O performance changes with the variation in block size and helps to understand I/O requirements, to optimize storage devices. This research work focusses primarily on the I/O behavior, the percentage of CPU utilized which is generated in the log file generated by an IO Meter during data transfer between the Initiator and Target devices both in iSCSI and iSCSI over RDMA. 5.2Testing Procedure The following steps describe the general test procedure implemented in this research work: a. Start all services, like IO Meter and iSCSI software’s on Initiator, Target and Manager systems. b. Verify that cables used are not faulty, and check for proper connectivity between nodes using Ping utility. c. Configure and enable the Remote Shell (RSH) program on all nodes in the network. This allows execution of a single command on a remote host without login to the remote host for performing RDMA operations. d. Connect each of the participating Initiators to its iSCSI Targets. e. Login to the iSCSI Target using iSCSI by mounting all the storage devices available on the Target device from the Initiator. f. Once logged into the iSCSI Target device, tune the IO read sizes provided by the IO Meter. 36 g. Start the IO Meter on the Manager and the Dynamo on the Initiators. h. Create the IO Meter .icf configuration files, and set the access specifications (request sizes) and the test setup parameter . i. Start the test. j. Name the output CSV file. 5.3 Theoritical Calculations The generic formula used to calculate I/O throughput from memory and I/O bus frequency [36] is I/O Throughput = Memory Bandwidth / ( ( MemoryClock / (I/OBus Clock)) + 2 ) (5.1) With the implementation of RDMA , the additional two memory cycles needed to transfer data are no longer needed, therefore the new I/O throughput is I/O Throughput = Memory Bandwidth / ( (MemoryClock / (I/OBus Clock)) (5.2) Hence the difference in the I/O throughputs obtained in both the cases is calculated in terms of percentage as Percentage = (new Throughput – old Throughput) / new Throughput (5.3) An Example of a system might be an I/O bus speed of 80 MHz and memory clock speed of 400 MHz used in this test scenario, when these values are included in the above equations, the performance gain obtained because of Zero Copy implementation ranges from 20 to 30 percent. 5.4Analysis of Test Results and Parameters used An experiment was executed using the test bed setup as shown in Figure 5.3, the following for scenarios: iSCSI running without RDMA and iSCSI running over RDMA. 37 Initiators IP Cloud Storage Device Target Figure 5.3. Test bed setup. Performance measurements were made using the industry-standard IO performance analysis and testing tool known as IO (Input Output) Meter, which measures performance in terms of Input/Output per Second (IOPS) and percentage CPU utilization. Tests were carefully performed using IO Meter [37], the results predict the performance of complex applications running over the production networks. The tests measured the throughput performance, in terms of IOPS using RNICS provided by Ammasso, Inc. with various I/O request sizes provided by the IO Meter. 38 The IO Meter runs on the “Manager” running on Windows system and on the managed client system (iSCSI Initiator) for work load generation, i.e, generates read requests as requested by user. The IO Meter here used was version 2003.12.16.win32, run on the Manager system. Both the IO Meter manager and managed systems, used workstations running Windows 2003 Server, Fedora Core 5 (kernel 2.6 smp) with dual 2.4-gigahertz (GHz) Intel Pentium Xeon processors, 1 gigabyte (GB) of PC2100 RAM, a RNIC copper gigabit NIC and a single 80GB SCSI hard disk drive. Microsoft’s iSCSI Initiator version 2.0 was used for all the test configurations. The iSCSI Target host , Dual 2.4-gigahertz (GHz) Intel Pentium Xeon processors, 1 gigabyte (GB) of PC2100 RAM, a RNIC copper gigabit NIC, and a single 80GB SCSI hard disk drive running Fedora Core 5 (Kernel 2.6 smp) was used. Performance tests were run using, the IO Meter iSCSI Initiators, and Targets were connected to a 6 Port Ethernet switch using separate 100 Mbps copper connections. All systems were configured on the same subnet, and all traffic used standard 1,500 byte Ethernet frames. IO Meter [38] was configured to generate load on the iSCSI Initiators consisting of 100 percent sequential reads, etc. During each test, the IO Meter Initiators or clients made sequential read requests using different block sizes. For example, when performing the 512-byte request size, the worker process made sequential write requests for 512 bytes of data. The test was conducted using the following request sizes (in bytes): 512B Reads, 1KB Reads, 2KB Reads, 4KB Reads, 16KB Reads, 32KB Reads, 64KB Reads Results are represented graphically in Figures 5.4 and 5.5 for the test scenario shown in Figure 5.3. 39 iSCSI over RDMA 64 KB KB 32 KB 16 KB 4 2K B iSCSI 1K B 51 2B IOPS 100000 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 Block Size Transferred Figure 5.4. IOPS in iSCSI and iSCSI over RDMA. For the maximum number of sustained IOPS tests, the results in Figure 5.4, show that iSCSI over RDMA achieved its peak number of IOPS at a lesser value to the value obtained when running only iSCSI at a request size of 512 bytes and performing 100 sequential reads. The same trend was observed by changing request sizes varying in number as depicted in the Figure 5.4. As shown in the Figure 5.4, there is decrease in number of IOPS required to transfer the same amount of data in iSCSI over RDMA compared to running only iSCSI because of the Zero Copy implementation used in iSCSI over RDMA. With RDMA implementation, as discussed in previous sections, multiple copy overhead wass eliminated and the requirement for flow control was eliminated. Hence, the CPU interrupts were reduced and the number of IOPS required were also reduced. Overall, the test results shown in Figure 5.4 show that fewer number of IOPS are needed in iSCSI over RDMA compared to iSCSI, thus making the system more effective. 40 50 Percentage CPU utilized 45 40 35 30 iSCSI over RDMA 25 iSCSI 20 15 10 5 0 512B 1KB 2KB 4 KB 16KB 32 KB 64KB Block Size Transferred Figure 5.5. Percentage of CPU utilized in iSCSI and iSCSI over RDMA. As show in Figure 5.5, the percentage of CPU utilized in iSCSI is comparatively higher than that of iSCSI over RDMA because RDMA relieves the CPU time with the Kernel Bypass mechanism, thus providing efficient use of CPU cycles. Calculations made in section 5.3, and from the results obtained coincide with one another, i.e., both show an increase in the performance of the system by 20 to 25 percent when RDMA was implemented. Hence, RDMA provides a chance to efficiently utilize CPU resources for running other important applications in SAN environments and reduces the processing delay in the system, thereby providing efficient use of all network resources. The difference in the ratio of the number of IOPS for each request size, as plotted in Figure 5.4, is approximately between 20 and 30 percent i.e., iSCSI over RDMA provides a 20 to 30 percent increase in throughput in terms of IOPS. The difference in the ratio is calculated using equation 5.3 from the test results. 41 CHAPTER 6 CONCLUSIONS AND FUTURE WORK The protocol iSCSI over RDMA was successfully implemented on the existing Ethernet network keeping all the objectives discussed in this research work.  The major issues of CPU and memory bandwidth addressed in this work are as follows: • Zero Copy Receives—RDMA Writes place the data directly into the application buffer. • Protocol overhead is reduced. •Completion events are reduced.  Performance improved in terms of IOPS required, and percentage of CPU utilized when data was transferred over the network.  Since IP storage is based on industry standard protocols, like iSCSI, the RDMA significant performance gain can be achieved when the same is implemented in cluster sizes that are large in number.  Storage networks implemented in this fashion effectively use an existing IP infrastructure and provide better results at a reduced price.  There are few security concerns while implementing RDMA technologies in SAN environments, which might increase a chance to open memory on the network.  Hence, a better security implementation should be developed to enhance the security of SAN environments using iSCSI over RDMA. 42 REFERENCES 43 LIST OF REFERENCES 1.“Microsoft iSCSI Software Initiator 1.05 Users Guide,” Microsoft Cooperation. 2.“Deploying IP SAN’s with the Microsoft iSCSI Architecture,” Microsoft Cooperation, July2004. 3. Drew Bird, “Network Storage - The Basics,”http://www.enterprisestorageforum.com/ technology/features/article.php/947551. 4. De-Zhi Han, “SNINS: A Storage Network Integrating NAS and SAN,” Dept. of Computer Science, Jinan University. Machine Learning and Cybernetics, 2005. 5. Michael T.LoBue, “Surveying Today’s Most Popular Storage Interfaces,” LoBue and Majdalany Management Group. 6. Young, G.H.; Yiu, V.S.; Lai-Man Wan, “Parallel computing on SCSI network,” Aerospace and Electronics Conference, 1997. Proceedings of the IEEE 1997 National. 7. SCSI-2 Specification, Document X3.131-1994, ANSI. 8. SCSI-2, www.storagereview.com/guide2000/ref/hdd/if/scsi/std/scsi2.html 9. “iSCSI Technology: Convergence of Networking and Storage,” Hewlett-Packard Development Company, 2003. 10. Yingping Lu, Farrukh Noman and David H.C. Du, “Simulation Study of iSCSI-based Storage System,” Department of Computer Science & Engineering, University of Minnesota, Minneapolis. 11. Kalman Z. Meth and Julian Satran, “Design of the iSCSI Protocol,” IBM Haifa Research Laboratory, Haifa, Israel. 12. Qing yang, “On Performance of Parallel iSCSI Protocol for Networked Storage Systems ,” Dept. of Electrical and Computer Engineering, University of Rhode Island. 13. A SNIA IP Storage Forum Whitepaper, “iSCSI Building Blocks for IP Storage Networking”. 14. “Introduction to iSCSI,” Cisco Systems. http://www.cisco.com/warp/public/cc/ pd/rt/5420/prodlit/imdpm_wp.pdf. 44 LIST OF REFERENCES (Continued) 15. “HP-UX iSCSI Software Initiator Support Guide: HP-UX 11i v1 & 11i v2,” http://docs.hp.com/en/T1452-90011/ch04s01.html. 16. “An Introduction To iSCSI (Internet SCSI),” Embedded Systems & Product Engineering, Storage Center of Excellence at Wipro. 17. “iSCSI Protocol Concepts and Implementation,” Cisco Systems. 18. Jiuxing Liu, Dhabaleswar K. Panda and Mohammad Banikazemi “Evaluating the Impact of RDMA on Storage I/O over InfiniBand,” Department of Computer and Information Science Engineering, Ohio State University. 19. Thadani M and Yousef A K., “ An Efficient Zero-Copy I/O, Framework for UNIX,” http://www.sunmicrosystem.com, 2003. 20. “Ethernet RDMA Technologies,” Hewlett-Packard Development Company, 2003. 21. J. Nieplocha, V. Tipparaju, A. Saify, and D. Panda, “Protocols and Strategies for Optimizing Remote Memory Operations on Clusters,” Proc. Communication Architecture for Clusters Workshop of IPDPS, 2002. 22. D. D. Clark, V. Jacobson and J. Romkey, H. Salwen, “An analysis of TCP processingOverhead,” IEEE Communications Magazine, volume: 27, Issue: 6, June 1989, pp 23-29. 23. R. Recio, P. Culley, D. Garcia, J. Hilland, and B. Metzler, “An RDMA protocol specification,” April 2005. URL http://www.ietf.org/internet-drafts/draft-ietf-rddprdmap-04.txt. 24. Smyk, A.and Tudruj M., “RDMA control support for fine-grain parallel computations,” Institute of Computer Science, Polish Academy of Sciences, Poland. Parallel, Distributed and Network-Based Processing, 2004. Proceedings 12th Euromicro Conference. 25. Allyn Romanow, “An Overview of RDMA over IP,” Cisco Systems. 26. S. Bailey, T. Talpey, “The Architecture of Direct Data Placement (DDP) And Remote Direct Memory Access (RDMA) On Internet Protocols,” December 2002. 27. RDMA Consortium. “Architectural specifications for RDMA over TCP/IP” URL http://www.rdmaconsortium.org/. 45 LIST OF REFERENCES (Continued) 28. Dennis Dalessandro and Pete Wyckoff “A Performance Analysis of the Ammasso RDMA Enabled Ethernet Adapter and its iWARP API,” Ohio Supercomputer Center. 29. Hemal Shah, James Pinkerton, Renato Recio and Paul Culley, “Direct data placement over reliable transports,” February 2005. URL http://www.ietf.org/ internetdrafts/draft-ietf-rddp-ddp-04.txt. 30. Tipparaju, V. Santhanaraman, G. Nieplocha, J. and D.K. Panda, “Host-assisted zerocopy remote memory access communication on InfiniBand,” Pacific Northwest National Laboratory, Parallel and Distributed Processing Symposium, 2004. Proceedings 18th International. 31. Casey B. Reardon, Alan D. George and Clement T. Cole, “Comparative Performance Analysis of RDMA-Enhanced Ethernet,” Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL. 32. “Ammasso 1100 Ethernet Adapters,” http://www.ammasso.com/products.htm. Ammasso Inc. URL 33. Ariel Cohen, RDMA offers low overhead, high speed, Network World, 2003. 34. Pinkerton, Jim, “The Case for RDMA,” 2002. 35. Hyun-Wook Jin, Sundeep Narravula, Gregory Brown, Karthikeyan Vaidyanathan,Pavan Balaji and Dhabaleswar, K. Panda, “Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC,” Department of Computer Science and Engineering, Ohio State University. 36. Guojun Jin and Brian L.Tierney, “System Capability Effects on Algorithms for Network Bandwidth Measurement,” Distributed Systems Department, Lawrence Berkeley National Laboratory, Berkeley. 37. “IO Meter Complete Guide,” URL www.iometer.org. 38. Stephen Aiken, Dirk Grunwald, Andrew R. Pleszkun and Jesse Willeke, “A Performance analysis of the iSCSI Protocol,” Colorado Center for Information Storage University of Colorado, Boulder. 46

RELATED PAPERS

RELATED TOPICS

Log In

The Art of the Book in China

The Art of the Book in China

Related Papers

RELATED PAPERS

RELATED TOPICS