NFSv4 S. Shepler Internet-Draft Editor Intended status: Standards Track March 6, 2006 Expires: September 7, 2006 NFSv4 Minor Version 1 draft-ietf-nfsv4-minorversion1-02.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on September 7, 2006. Copyright Notice Copyright (C) The Internet Society (2006). Abstract This Internet-Draft describes the NFSv4 minor version 1 protocol extensions. These most significant of these extensions are commonly called: Sessions, Directory Delegations, and parallel NFS or pNFS Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this Shepler Expires September 7, 2006 [Page 1] Internet-Draft NFSv4 Minior Version 1 March 2006 document are to be interpreted as described in RFC 2119 [1]. Table of Contents 1. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 9 1.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 9 1.2. Structured Data Types . . . . . . . . . . . . . . . . . 10 2. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1. Obtaining the First Filehandle . . . . . . . . . . . . . 19 2.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 20 2.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 20 2.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 20 2.2.1. General Properties of a Filehandle . . . . . . . . . 21 2.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 22 2.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 22 2.3. One Method of Constructing a Volatile Filehandle . . . . 23 2.4. Client Recovery from Filehandle Expiration . . . . . . . 24 3. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 25 3.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 26 3.2. Recommended Attributes . . . . . . . . . . . . . . . . . 26 3.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 27 3.4. Classification of Attributes . . . . . . . . . . . . . . 27 3.5. Mandatory Attributes - Definitions . . . . . . . . . . . 28 3.6. Recommended Attributes - Definitions . . . . . . . . . . 30 3.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 38 3.8. Interpreting owner and owner_group . . . . . . . . . . . 38 3.9. Character Case Attributes . . . . . . . . . . . . . . . 40 3.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 40 3.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 41 3.12. send_impl_id and recv_impl_id . . . . . . . . . . . . . 42 3.13. fs_layouttype . . . . . . . . . . . . . . . . . . . . . 43 3.14. layouttype . . . . . . . . . . . . . . . . . . . . . . . 43 3.15. layouthint . . . . . . . . . . . . . . . . . . . . . . . 43 3.16. Access Control Lists . . . . . . . . . . . . . . . . . . 43 3.16.1. ACE type . . . . . . . . . . . . . . . . . . . . . . 45 3.16.2. ACE Access Mask . . . . . . . . . . . . . . . . . . 46 3.16.3. ACE flag . . . . . . . . . . . . . . . . . . . . . . 51 3.16.4. ACE who . . . . . . . . . . . . . . . . . . . . . . 53 3.16.5. Mode Attribute . . . . . . . . . . . . . . . . . . . 54 3.16.6. Interaction Between Mode and ACL Attributes . . . . 55 4. Filesystem Migration and Replication . . . . . . . . . . . . 69 4.1. Replication . . . . . . . . . . . . . . . . . . . . . . 69 4.2. Migration . . . . . . . . . . . . . . . . . . . . . . . 70 4.3. Interpretation of the fs_locations Attribute . . . . . . 70 4.4. Filehandle Recovery for Migration or Replication . . . . 72 5. NFS Server Name Space . . . . . . . . . . . . . . . . . . . . 72 5.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 72 Shepler Expires September 7, 2006 [Page 2] Internet-Draft NFSv4 Minior Version 1 March 2006 5.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 72 5.3. Server Pseudo Filesystem . . . . . . . . . . . . . . . . 73 5.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 73 5.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 74 5.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 74 5.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 74 5.8. Security Policy and Name Space Presentation . . . . . . 75 6. File Locking and Share Reservations . . . . . . . . . . . . . 76 6.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 76 6.1.1. Client ID . . . . . . . . . . . . . . . . . . . . . 77 6.1.2. Server Release of Clientid . . . . . . . . . . . . . 79 6.1.3. lock_owner and stateid Definition . . . . . . . . . 80 6.1.4. Use of the stateid and Locking . . . . . . . . . . . 82 6.1.5. Sequencing of Lock Requests . . . . . . . . . . . . 84 6.1.6. Recovery from Replayed Requests . . . . . . . . . . 85 6.1.7. Releasing lock_owner State . . . . . . . . . . . . . 85 6.1.8. Use of Open Confirmation . . . . . . . . . . . . . . 85 6.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 87 6.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 87 6.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 87 6.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 88 6.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 89 6.6.1. Client Failure and Recovery . . . . . . . . . . . . 89 6.6.2. Server Failure and Recovery . . . . . . . . . . . . 90 6.6.3. Network Partitions and Recovery . . . . . . . . . . 92 6.7. Recovery from a Lock Request Timeout or Abort . . . . . 95 6.8. Server Revocation of Locks . . . . . . . . . . . . . . . 96 6.9. Share Reservations . . . . . . . . . . . . . . . . . . . 97 6.10. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 97 6.10.1. Close and Retention of State Information . . . . . . 98 6.11. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 99 6.12. Short and Long Leases . . . . . . . . . . . . . . . . . 99 6.13. Clocks, Propagation Delay, and Calculating Lease Expiration . . . . . . . . . . . . . . . . . . . . . . . 100 6.14. Migration, Replication and State . . . . . . . . . . . . 100 6.14.1. Migration and State . . . . . . . . . . . . . . . . 101 6.14.2. Replication and State . . . . . . . . . . . . . . . 102 6.14.3. Notification of Migrated Lease . . . . . . . . . . . 102 6.14.4. Migration and the Lease_time Attribute . . . . . . . 103 7. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 103 7.1. Performance Challenges for Client-Side Caching . . . . . 104 7.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 105 7.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 106 7.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 108 7.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 108 7.3.2. Data Caching and File Locking . . . . . . . . . . . 109 7.3.3. Data Caching and Mandatory File Locking . . . . . . 111 7.3.4. Data Caching and File Identity . . . . . . . . . . . 111 Shepler Expires September 7, 2006 [Page 3] Internet-Draft NFSv4 Minior Version 1 March 2006 7.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 112 7.4.1. Open Delegation and Data Caching . . . . . . . . . . 115 7.4.2. Open Delegation and File Locks . . . . . . . . . . . 116 7.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 116 7.4.4. Recall of Open Delegation . . . . . . . . . . . . . 119 7.4.5. Clients that Fail to Honor Delegation Recalls . . . 121 7.4.6. Delegation Revocation . . . . . . . . . . . . . . . 122 7.5. Data Caching and Revocation . . . . . . . . . . . . . . 122 7.5.1. Revocation Recovery for Write Open Delegation . . . 123 7.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 124 7.7. Data and Metadata Caching and Memory Mapped Files . . . 126 7.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 128 7.9. Directory Caching . . . . . . . . . . . . . . . . . . . 129 8. Security Negotiation . . . . . . . . . . . . . . . . . . . . 130 9. Clarification of Security Negotiation in NFSv4.1 . . . . . . 130 9.1. PUTFH + LOOKUP . . . . . . . . . . . . . . . . . . . . . 130 9.2. PUTFH + LOOKUPP . . . . . . . . . . . . . . . . . . . . 131 9.3. PUTFH + SECINFO . . . . . . . . . . . . . . . . . . . . 131 9.4. PUTFH + Anything Else . . . . . . . . . . . . . . . . . 131 10. NFSv4.1 Sessions . . . . . . . . . . . . . . . . . . . . . . 132 10.1. Sessions Background . . . . . . . . . . . . . . . . . . 132 10.1.1. Introduction to Sessions . . . . . . . . . . . . . . 132 10.1.2. Motivation . . . . . . . . . . . . . . . . . . . . . 133 10.1.3. Problem Statement . . . . . . . . . . . . . . . . . 134 10.1.4. NFSv4 Session Extension Characteristics . . . . . . 136 10.2. Transport Issues . . . . . . . . . . . . . . . . . . . . 136 10.2.1. Session Model . . . . . . . . . . . . . . . . . . . 136 10.2.2. Connection State . . . . . . . . . . . . . . . . . . 137 10.2.3. NFSv4 Channels, Sessions and Connections . . . . . . 138 10.2.4. Reconnection, Trunking and Failover . . . . . . . . 140 10.2.5. Server Duplicate Request Cache . . . . . . . . . . . 141 10.3. Session Initialization and Transfer Models . . . . . . . 142 10.3.1. Session Negotiation . . . . . . . . . . . . . . . . 142 10.3.2. RDMA Requirements . . . . . . . . . . . . . . . . . 144 10.3.3. RDMA Connection Resources . . . . . . . . . . . . . 144 10.3.4. TCP and RDMA Inline Transfer Model . . . . . . . . . 145 10.3.5. RDMA Direct Transfer Model . . . . . . . . . . . . . 148 10.4. Connection Models . . . . . . . . . . . . . . . . . . . 151 10.4.1. TCP Connection Model . . . . . . . . . . . . . . . . 152 10.4.2. Negotiated RDMA Connection Model . . . . . . . . . . 153 10.4.3. Automatic RDMA Connection Model . . . . . . . . . . 154 10.5. Buffer Management, Transfer, Flow Control . . . . . . . 154 10.6. Retry and Replay . . . . . . . . . . . . . . . . . . . . 157 10.7. The Back Channel . . . . . . . . . . . . . . . . . . . . 158 10.8. COMPOUND Sizing Issues . . . . . . . . . . . . . . . . . 159 10.9. Data Alignment . . . . . . . . . . . . . . . . . . . . . 159 10.10. NFSv4 Integration . . . . . . . . . . . . . . . . . . . 161 10.10.1. Minor Versioning . . . . . . . . . . . . . . . . . . 161 Shepler Expires September 7, 2006 [Page 4] Internet-Draft NFSv4 Minior Version 1 March 2006 10.10.2. Slot Identifiers and Server Duplicate Request Cache . . . . . . . . . . . . . . . . . . . . . . . 161 10.10.3. Resolving server callback races with sessions . . . 165 10.10.4. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . 166 10.10.5. eXternal Data Representation Efficiency . . . . . . 167 10.10.6. Effect of Sessions on Existing Operations . . . . . 167 10.10.7. Authentication Efficiencies . . . . . . . . . . . . 168 10.11. Sessions Security Considerations . . . . . . . . . . . . 169 10.11.1. Authentication . . . . . . . . . . . . . . . . . . . 171 11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 172 11.1. Introduction to Directory Delegations . . . . . . . . . 172 11.2. Directory Delegation Design (in brief) . . . . . . . . . 173 11.3. Recommended Attributes in support of Directory Delegations . . . . . . . . . . . . . . . . . . . . . . 174 11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 175 11.5. Delegation Recovery . . . . . . . . . . . . . . . . . . 175 12. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 175 13. General Definitions . . . . . . . . . . . . . . . . . . . . . 178 13.1. Metadata Server . . . . . . . . . . . . . . . . . . . . 178 13.2. Client . . . . . . . . . . . . . . . . . . . . . . . . . 178 13.3. Storage Device . . . . . . . . . . . . . . . . . . . . . 178 13.4. Storage Protocol . . . . . . . . . . . . . . . . . . . . 179 13.5. Control Protocol . . . . . . . . . . . . . . . . . . . . 179 13.6. Metadata . . . . . . . . . . . . . . . . . . . . . . . . 179 13.7. Layout . . . . . . . . . . . . . . . . . . . . . . . . . 180 14. pNFS protocol semantics . . . . . . . . . . . . . . . . . . . 180 14.1. Definitions . . . . . . . . . . . . . . . . . . . . . . 180 14.1.1. Layout Types . . . . . . . . . . . . . . . . . . . . 180 14.1.2. Layout Iomode . . . . . . . . . . . . . . . . . . . 181 14.1.3. Layout Segments . . . . . . . . . . . . . . . . . . 181 14.1.4. Device IDs . . . . . . . . . . . . . . . . . . . . . 182 14.1.5. Aggregation Schemes . . . . . . . . . . . . . . . . 183 14.2. Guarantees Provided by Layouts . . . . . . . . . . . . . 183 14.3. Getting a Layout . . . . . . . . . . . . . . . . . . . . 184 14.4. Committing a Layout . . . . . . . . . . . . . . . . . . 185 14.4.1. LAYOUTCOMMIT and mtime/atime/change . . . . . . . . 186 14.4.2. LAYOUTCOMMIT and size . . . . . . . . . . . . . . . 186 14.4.3. LAYOUTCOMMIT and layoutupdate . . . . . . . . . . . 187 14.5. Recalling a Layout . . . . . . . . . . . . . . . . . . . 187 14.5.1. Basic Operation . . . . . . . . . . . . . . . . . . 188 14.5.2. Recall Callback Robustness . . . . . . . . . . . . . 189 14.5.3. Recall/Return Sequencing . . . . . . . . . . . . . . 190 14.6. Metadata Server Write Propagation . . . . . . . . . . . 192 14.7. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 193 14.7.1. Leases . . . . . . . . . . . . . . . . . . . . . . . 193 14.7.2. Client Recovery . . . . . . . . . . . . . . . . . . 194 14.7.3. Metadata Server Recovery . . . . . . . . . . . . . . 195 14.7.4. Storage Device Recovery . . . . . . . . . . . . . . 197 Shepler Expires September 7, 2006 [Page 5] Internet-Draft NFSv4 Minior Version 1 March 2006 15. Security Considerations . . . . . . . . . . . . . . . . . . . 198 15.1. File Layout Security . . . . . . . . . . . . . . . . . . 199 15.2. Object Layout Security . . . . . . . . . . . . . . . . . 199 15.3. Block/Volume Layout Security . . . . . . . . . . . . . . 201 16. The NFSv4 File Layout Type . . . . . . . . . . . . . . . . . 201 16.1. File Striping and Data Access . . . . . . . . . . . . . 202 16.1.1. Sparse and Dense Storage Device Data Layouts . . . . 203 16.1.2. Metadata and Storage Device Roles . . . . . . . . . 205 16.1.3. Device Multipathing . . . . . . . . . . . . . . . . 206 16.1.4. Operations Issued to Storage Devices . . . . . . . . 206 16.2. Global Stateid Requirements . . . . . . . . . . . . . . 207 16.3. The Layout Iomode . . . . . . . . . . . . . . . . . . . 207 16.4. Storage Device State Propagation . . . . . . . . . . . . 208 16.4.1. Lock State Propagation . . . . . . . . . . . . . . . 208 16.4.2. Open-mode Validation . . . . . . . . . . . . . . . . 209 16.4.3. File Attributes . . . . . . . . . . . . . . . . . . 209 16.5. Storage Device Component File Size . . . . . . . . . . . 210 16.6. Crash Recovery Considerations . . . . . . . . . . . . . 211 16.7. Security Considerations . . . . . . . . . . . . . . . . 211 16.8. Alternate Approaches . . . . . . . . . . . . . . . . . . 211 17. Layouts and Aggregation . . . . . . . . . . . . . . . . . . . 212 17.1. Simple Map . . . . . . . . . . . . . . . . . . . . . . . 213 17.2. Block Extent Map . . . . . . . . . . . . . . . . . . . . 213 17.3. Striped Map (RAID 0) . . . . . . . . . . . . . . . . . . 213 17.4. Replicated Map . . . . . . . . . . . . . . . . . . . . . 213 17.5. Concatenated Map . . . . . . . . . . . . . . . . . . . . 214 17.6. Nested Map . . . . . . . . . . . . . . . . . . . . . . . 214 18. Minor Versioning . . . . . . . . . . . . . . . . . . . . . . 214 19. Internationalization . . . . . . . . . . . . . . . . . . . . 216 19.1. Stringprep profile for the utf8str_cs type . . . . . . . 218 19.2. Stringprep profile for the utf8str_cis type . . . . . . 219 19.3. Stringprep profile for the utf8str_mixed type . . . . . 221 19.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 222 20. Error Definitions . . . . . . . . . . . . . . . . . . . . . . 222 21. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 231 21.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 231 21.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 232 22. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 234 22.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 235 22.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 237 22.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 238 22.4. Operation 6: CREATE - Create a Non-Regular File Object . 241 22.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting Recovery . . . . . . . . . . . . . . . . . . . . . . . . 244 22.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 245 22.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 245 22.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 247 22.9. Operation 11: LINK - Create Link to a File . . . . . . . 248 Shepler Expires September 7, 2006 [Page 6] Internet-Draft NFSv4 Minior Version 1 March 2006 22.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 249 22.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 253 22.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 255 22.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 256 22.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 258 22.15. Operation 17: NVERIFY - Verify Difference in Attributes . . . . . . . . . . . . . . . . . . . . . . . 259 22.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 260 22.17. Operation 19: OPENATTR - Open Named Attribute Directory . . . . . . . . . . . . . . . . . . . . . . . 269 22.18. Operation 20: OPEN_CONFIRM - Confirm Open . . . . . . . 271 22.19. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 273 22.20. Operation 22: PUTFH - Set Current Filehandle . . . . . . 274 22.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 275 22.22. Operation 25: READ - Read from File . . . . . . . . . . 276 22.23. Operation 26: READDIR - Read Directory . . . . . . . . . 278 22.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 282 22.25. Operation 28: REMOVE - Remove Filesystem Object . . . . 283 22.26. Operation 29: RENAME - Rename Directory Entry . . . . . 285 22.27. Operation 30: RENEW - Renew a Lease . . . . . . . . . . 287 22.28. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 288 22.29. Operation 32: SAVEFH - Save Current Filehandle . . . . . 289 22.30. Operation 33: SECINFO - Obtain Available Security . . . 290 22.31. Operation 34: SETATTR - Set Attributes . . . . . . . . . 293 22.32. Operation 35: SETCLIENTID - Negotiate Clientid . . . . . 296 22.33. Operation 36: SETCLIENTID_CONFIRM - Confirm Clientid . . 300 22.34. Operation 37: VERIFY - Verify Same Attributes . . . . . 303 22.35. Operation 38: WRITE - Write to File . . . . . . . . . . 304 22.36. Operation 39: RELEASE_LOCKOWNER - Release Lockowner State . . . . . . . . . . . . . . . . . . . . . . . . . 309 22.37. Operation 10044: ILLEGAL - Illegal operation . . . . . . 310 22.38. SECINFO_NO_NAME - Get Security on Unnamed Object . . . . 310 22.39. CREATECLIENTID - Instantiate Clientid . . . . . . . . . 312 22.40. CREATESESSION - Create New Session and Confirm Clientid . . . . . . . . . . . . . . . . . . . . . . . . 317 22.41. BIND_BACKCHANNEL - Create a callback channel binding . . 322 22.42. DESTROYSESSION - Destroy existing session . . . . . . . 324 22.43. SEQUENCE - Supply per-procedure sequencing and control . 325 22.44. GET_DIR_DELEGATION - Get a directory delegation . . . . 326 22.45. LAYOUTGET - Get Layout Information . . . . . . . . . . . 330 22.46. LAYOUTCOMMIT - Commit writes made using a layout . . . . 332 22.47. LAYOUTRETURN - Release Layout Information . . . . . . . 336 22.48. GETDEVICEINFO - Get Device Information . . . . . . . . . 337 22.49. GETDEVICELIST . . . . . . . . . . . . . . . . . . . . . 338 23. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 340 23.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 340 23.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 340 24. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 342 Shepler Expires September 7, 2006 [Page 7] Internet-Draft NFSv4 Minior Version 1 March 2006 24.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 342 24.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 343 24.3. Operation 10044: CB_ILLEGAL - Illegal Callback Operation . . . . . . . . . . . . . . . . . . . . . . . 344 24.4. CB_RECALLCREDIT - change flow control limits . . . . . . 345 24.5. CB_SEQUENCE - Supply callback channel sequencing and control . . . . . . . . . . . . . . . . . . . . . . . . 346 24.6. CB_NOTIFY - Notify directory changes . . . . . . . . . . 348 24.7. CB_RECALL_ANY - Keep any N delegations . . . . . . . . . 351 24.8. CB_SIZECHANGED . . . . . . . . . . . . . . . . . . . . . 354 24.9. CB_LAYOUTRECALL . . . . . . . . . . . . . . . . . . . . 355 25. References . . . . . . . . . . . . . . . . . . . . . . . . . 357 25.1. Normative References . . . . . . . . . . . . . . . . . . 357 25.2. Informative References . . . . . . . . . . . . . . . . . 357 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 358 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 358 Intellectual Property and Copyright Statements . . . . . . . . . 360 Shepler Expires September 7, 2006 [Page 8] Internet-Draft NFSv4 Minior Version 1 March 2006 1. Protocol Data Types The syntax and semantics to describe the data types of the NFS version 4 protocol are defined in the XDR RFC1832 [2] and RPC RFC1831 [3] documents. The next sections build upon the XDR data types to define types and structures specific to this protocol. 1.1. Basic Data Types These are the base NFSv4 data types. +---------------+---------------------------------------------------+ | Data Type | Definition | +---------------+---------------------------------------------------+ | int32_t | typedef int int32_t; | | uint32_t | typedef unsigned int uint32_t; | | int64_t | typedef hyper int64_t; | | uint64_t | typedef unsigned hyper uint64_t; | | attrlist4 | typedef opaque attrlist4<> | | | Used for file/directory attributes | | bitmap4 | typedef uint32_t bitmap4<> | | | Used in attribute array encoding. | | changeid4 | typedef uint64_t changeid4; | | | Used in definition of change_info | | clientid4 | typedef uint64_t clientid4; | | | Shorthand reference to client identification | | component4 | typedef utf8str_cs component4; | | | Represents path name components | | count4 | typedef uint32_t count4; | | | Various count parameters (READ, WRITE, COMMIT) | | length4 | typedef uint64_t length4; | | | Describes LOCK lengths | | linktext4 | typedef utf8str_cs linktext4; | | | Symbolic link contents | | mode4 | typedef uint32_t mode4; | | | Mode attribute data type | | nfs_cookie4 | typedef uint64_t nfs_cookie4; | | | Opaque cookie value for READDIR | | nfs_fh4 | typedef opaque nfs_fh4<NFS4_FHSIZE> | | | Filehandle definition; NFS4_FHSIZE is defined as | | | 128 | | nfs_ftype4 | enum nfs_ftype4; | | | Various defined file types | | nfsstat4 | enum nfsstat4; | | | Return value for operations | | offset4 | typedef uint64_t offset4; | | | Various offset designations (READ, WRITE, LOCK, | | | COMMIT) | Shepler Expires September 7, 2006 [Page 9] Internet-Draft NFSv4 Minior Version 1 March 2006 | pathname4 | typedef component4 pathname4<> | | | Represents path name for fs_locations | | qop4 | typedef uint32_t qop4; | | | Quality of protection designation in SECINFO | | sec_oid4 | typedef opaque sec_oid4<> | | | Security Object Identifier The sec_oid4 data type | | | is not really opaque. Instead contains an ASN.1 | | | OBJECT IDENTIFIER as used by GSS-API in the | | | mech_type argument to GSS_Init_sec_context. See | | | RFC2743 [4] for details. | | seqid4 | typedef uint32_t seqid4; | | | Sequence identifier used for file locking | | utf8string | typedef opaque utf8string<> | | | UTF-8 encoding for strings | | utf8str_cis | typedef opaque utf8str_cis; | | | Case-insensitive UTF-8 string | | utf8str_cs | typedef opaque utf8str_cs; | | | Case-sensitive UTF-8 string | | utf8str_mixed | typedef opaque utf8str_mixed; | | | UTF-8 strings with a case sensitive prefix and a | | | case insensitive suffix. | | verifier4 | typedef opaque verifier4[NFS4_VERIFIER_SIZE]; | | | Verifier used for various operations (COMMIT, | | | CREATE, OPEN, READDIR, SETCLIENTID, | | | SETCLIENTID_CONFIRM, WRITE) NFS4_VERIFIER_SIZE is | | | defined as 8. | +---------------+---------------------------------------------------+ End of Base Data Types Table 1 1.2. Structured Data Types 1.2.1. nfstime4 struct nfstime4 { int64_t seconds; uint32_t nseconds; } The nfstime4 structure gives the number of seconds and nanoseconds since midnight or 0 hour January 1, 1970 Coordinated Universal Time (UTC). Values greater than zero for the seconds field denote dates after the 0 hour January 1, 1970. Values less than zero for the seconds field denote dates before the 0 hour January 1, 1970. In both cases, the nseconds field is to be added to the seconds field for the final time representation. For example, if the time to be Shepler Expires September 7, 2006 [Page 10] Internet-Draft NFSv4 Minior Version 1 March 2006 represented is one-half second before 0 hour January 1, 1970, the seconds field would have a value of negative one (-1) and the nseconds fields would have a value of one-half second (500000000). Values greater than 999,999,999 for nseconds are considered invalid. This data type is used to pass time and date information. A server converts to and from its local representation of time when processing time values, preserving as much accuracy as possible. If the precision of timestamps stored for a filesystem object is less than defined, loss of precision can occur. An adjunct time maintenance protocol is recommended to reduce client and server time skew. 1.2.2. time_how4 enum time_how4 { SET_TO_SERVER_TIME4 = 0, SET_TO_CLIENT_TIME4 = 1 }; 1.2.3. settime4 union settime4 switch (time_how4 set_it) { case SET_TO_CLIENT_TIME4: nfstime4 time; default: void; }; The above definitions are used as the attribute definitions to set time values. If set_it is SET_TO_SERVER_TIME4, then the server uses its local representation of time for the time value. 1.2.4. specdata4 struct specdata4 { uint32_t specdata1; /* major device number */ uint32_t specdata2; /* minor device number */ }; This data type represents additional information for the device file types NF4CHR and NF4BLK. 1.2.5. fsid4 struct fsid4 { uint64_t major; uint64_t minor; }; Shepler Expires September 7, 2006 [Page 11] Internet-Draft NFSv4 Minior Version 1 March 2006 1.2.6. fs_location4 struct fs_location4 { utf8str_cis server<> pathname4 rootpath; }; 1.2.7. fs_locations4 struct fs_locations4 { pathname4 fs_root; fs_location4 locations<> }; The fs_location4 and fs_locations4 data types are used for the fs_locations recommended attribute which is used for migration and replication support. 1.2.8. fattr4 struct fattr4 { bitmap4 attrmask; attrlist4 attr_vals; }; The fattr4 structure is used to represent file and directory attributes. The bitmap is a counted array of 32 bit integers used to contain bit values. The position of the integer in the array that contains bit n can be computed from the expression (n / 32) and its bit within that integer is (n mod 32). 0 1 +-----------+-----------+-----------+-- | count | 31 .. 0 | 63 .. 32 | +-----------+-----------+-----------+-- 1.2.9. change_info4 struct change_info4 { bool atomic; changeid4 before; changeid4 after; }; This structure is used with the CREATE, LINK, REMOVE, RENAME Shepler Expires September 7, 2006 [Page 12] Internet-Draft NFSv4 Minior Version 1 March 2006 operations to let the client know the value of the change attribute for the directory in which the target filesystem object resides. 1.2.10. clientaddr4 struct clientaddr4 { /* see struct rpcb in RFC1833 */ string r_netid<> /* network id */ string r_addr<> /* universal address */ }; The clientaddr4 structure is used as part of the SETCLIENTID operation to either specify the address of the client that is using a clientid or as part of the callback registration. The r_netid and r_addr fields are specified in RFC1833 [9], but they are underspecified in RFC1833 [9] as far as what they should look like for specific protocols. For TCP over IPv4 and for UDP over IPv4, the format of r_addr is the US-ASCII string: h1.h2.h3.h4.p1.p2 The prefix, "h1.h2.h3.h4", is the standard textual form for representing an IPv4 address, which is always four octets long. Assuming big-endian ordering, h1, h2, h3, and h4, are respectively, the first through fourth octets each converted to ASCII-decimal. Assuming big-endian ordering, p1 and p2 are, respectively, the first and second octets each converted to ASCII-decimal. For example, if a host, in big-endian order, has an address of 0x0A010307 and there is a service listening on, in big endian order, port 0x020F (decimal 527), then complete universal address is "10.1.3.7.2.15". For TCP over IPv4 the value of r_netid is the string "tcp". For UDP over IPv4 the value of r_netid is the string "udp". For TCP over IPv6 and for UDP over IPv6, the format of r_addr is the US-ASCII string: x1:x2:x3:x4:x5:x6:x7:x8.p1.p2 The suffix "p1.p2" is the service port, and is computed the same way as with universal addresses for TCP and UDP over IPv4. The prefix, "x1:x2:x3:x4:x5:x6:x7:x8", is the standard textual form for representing an IPv6 address as defined in Section 2.2 of RFC1884 [5]. Additionally, the two alternative forms specified in Section 2.2 of RFC1884 [5] are also acceptable. Shepler Expires September 7, 2006 [Page 13] Internet-Draft NFSv4 Minior Version 1 March 2006 For TCP over IPv6 the value of r_netid is the string "tcp6". For UDP over IPv6 the value of r_netid is the string "udp6". 1.2.11. cb_client4 struct cb_client4 { unsigned int cb_program; clientaddr4 cb_location; }; This structure is used by the client to inform the server of its call back address; includes the program number and client address. 1.2.12. nfs_client_id4 struct nfs_client_id4 { verifier4 verifier; opaque id<NFS4_OPAQUE_LIMIT> }; This structure is part of the arguments to the SETCLIENTID operation. NFS4_OPAQUE_LIMIT is defined as 1024. 1.2.13. open_owner4 struct open_owner4 { clientid4 clientid; opaque owner<NFS4_OPAQUE_LIMIT> }; This structure is used to identify the owner of open state. NFS4_OPAQUE_LIMIT is defined as 1024. 1.2.14. lock_owner4 struct lock_owner4 { clientid4 clientid; opaque owner<NFS4_OPAQUE_LIMIT> }; This structure is used to identify the owner of file locking state. NFS4_OPAQUE_LIMIT is defined as 1024. Shepler Expires September 7, 2006 [Page 14] Internet-Draft NFSv4 Minior Version 1 March 2006 1.2.15. open_to_lock_owner4 struct open_to_lock_owner4 { seqid4 open_seqid; stateid4 open_stateid; seqid4 lock_seqid; lock_owner4 lock_owner; }; This structure is used for the first LOCK operation done for an open_owner4. It provides both the open_stateid and lock_owner such that the transition is made from a valid open_stateid sequence to that of the new lock_stateid sequence. Using this mechanism avoids the confirmation of the lock_owner/lock_seqid pair since it is tied to established state in the form of the open_stateid/open_seqid. 1.2.16. stateid4 struct stateid4 { uint32_t seqid; opaque other[12]; }; This structure is used for the various state sharing mechanisms between the client and server. For the client, this data structure is read-only. The starting value of the seqid field is undefined. The server is required to increment the seqid field monotonically at each transition of the stateid. This is important since the client will inspect the seqid in OPEN stateids to determine the order of OPEN processing done by the server. 1.2.17. layouttype4 enum layouttype4 { LAYOUT_NFSV4_FILES = 1, LAYOUT_OSD2_OBJECTS = 2, LAYOUT_BLOCK_VOLUME = 3 }; A layout type specifies the layout being used. The implication is that clients have "layout drivers" that support one or more layout types. The file server advertises the layout types it supports through the LAYOUT_TYPES file system attribute. A client asks for layouts of a particular type in LAYOUTGET, and passes those layouts to its layout driver. The set of well known layout types must be defined. As well, a private range of layout types is to be defined by this document. This would allow custom installations to introduce new layout types. Shepler Expires September 7, 2006 [Page 15] Internet-Draft NFSv4 Minior Version 1 March 2006 [[Comment.1: Determine private range of layout types]] New layout types must be specified in RFCs approved by the IESG before becoming part of the pNFS specification. The LAYOUT_NFSV4_FILES enumeration specifies that the NFSv4 file layout type is to be used. The LAYOUT_OSD2_OBJECTS enumeration specifies that the object layout, as defined in [10], is to be used. Similarly, the LAYOUT_BLOCK_VOLUME enumeration that the block/volume layout, as defined in [11], is to be used. 1.2.18. pnfs_deviceid4 typedef uint32_t pnfs_deviceid4; /* 32-bit device ID */ Layout information includes device IDs that specify a storage device through a compact handle. Addressing and type information is obtained with the GETDEVICEINFO operation. A client must not assume that device IDs are valid across metadata server reboots. The device ID is qualified by the layout type and are unique per file system (FSID). This allows different layout drivers to generate device IDs without the need for co-ordination. See Section 14.1.4 for more details. 1.2.19. pnfs_deviceaddr4 struct pnfs_netaddr4 { string r_netid<> /* network ID */ string r_addr<> /* universal address */ }; struct pnfs_deviceaddr4 { pnfs_layouttype4 type; opaque device_addr<> }; The device address is used to set up a communication channel with the storage device. Different layout types will require different types of structures to define how they communicate with storage devices. The opaque device_addr field must be interpreted based on the specified layout type. Currently, the only defined device address is that for the NFSv4 file layout (struct pnfs_netaddr4), which identifies a storage device by network IP address and port number. This is sufficient for the clients to communicate with the NFSv4 storage devices, and may also be sufficient for object-based storage drivers to communicate with OSDs. The other device address we expect to support is a SCSI volume Shepler Expires September 7, 2006 [Page 16] Internet-Draft NFSv4 Minior Version 1 March 2006 identifier. The final protocol specification will detail the allowed values for device_type and the format of their associated location information. [NOTE: other device addresses will be added as the respective specifications mature. It has been suggested that a separate device_type enumeration is used as a switch to the pnfs_deviceaddr4 structure (e.g., if multiple types of addresses exist for the same layout type). Until such a time as a real case is made and the respective layout types have matured, the device address structure will be left as is.] 1.2.20. pnfs_devlist_item4 struct pnfs_devlist_item4 { pnfs_deviceid4 id; pnfs_deviceaddr4 addr; }; An array of these values is returned by the GETDEVICELIST operation. They define the set of devices associated with a file system. 1.2.21. pnfs_layout4 struct pnfs_layout4 { offset4 offset; length4 length; pnfs_layoutiomode4 iomode; pnfs_layouttype4 type; opaque layout<>; }; The pnfs_layout4 structure defines a layout for a file. The layout type specific data is opaque within this structure and must be interepreted based on the layout type. Currently, only the NFSv4 file layout type is defined; see Section 16.1 for its definition. Since layouts are sub-dividable, the offset and length together with the file's filehandle, the clientid, iomode, and layout type, identifies the layout. [[Comment.2: there is a discussion of moving the striping information, or more generally the "aggregation scheme", up to the generic layout level. This creates a two-layer system where the top level is a switch on different data placement layouts, and the next level down is a switch on different data storage types. This lets different layouts (e.g., striping or mirroring or redundant servers) to be layered over different storage devices. This would move geometry information out of nfsv4_file_layouttype4 and up into a Shepler Expires September 7, 2006 [Page 17] Internet-Draft NFSv4 Minior Version 1 March 2006 generic pnfs_striped_layout type that would specify a set of pnfs_deviceid4 and pnfs_devicetype4 to use for storage. Instead of nfsv4_file_layouttype4, there would be pnfs_nfsv4_devicetype4.]] 1.2.22. pnfs_layoutupdate4 struct pnfs_layoutupdate4 { pnfs_layouttype4 type; opaque layoutupdate_data<>; }; The pnfs_layoutupdate4 structure is used by the client to return 'updated' layout information to the metadata server at LAYOUTCOMMIT time. This structure provides a channel to pass layout type specific information back to the metadata server. E.g., for block/volume layout types this could include the list of reserved blocks that were written. The contents of the opaque layoutupdate_data argument are determined by the layout type and are defined in their context. The NFSv4 file-based layout does not use this structure, thus the update_data field should have a zero length. 1.2.23. layouthint4 struct pnfs_layouthint4 { pnfs_layouttype4 type; opaque layouthint_data<> }; The layouthint4 structure is used by the client to pass in a hint about the type of layout it would like created for a particular file. It is the structure specified by the FILE_LAYOUT_HINT attribute described below. The metadata server may ignore the hint, or may selectively ignore fields within the hint. This hint should be provided at create time as part of the initial attributes within OPEN. The NFSv4 file-based layout uses the "nfsv4_file_layouthint" structure as defined in Section 16.1. 1.2.24. pnfs_layoutiomode4 enum pnfs_layoutiomode4 { LAYOUTIOMODE_READ = 1, LAYOUTIOMODE_RW = 2, LAYOUTIOMODE_ANY = 3 }; The iomode specifies whether the client intends to read or write (with the possibility of reading) the data represented by the layout. The ANY iomode MUST NOT be used for LAYOUTGET, however, it can be Shepler Expires September 7, 2006 [Page 18] Internet-Draft NFSv4 Minior Version 1 March 2006 used for LAYOUTRETURN and LAYOUTRECALL. The ANY iomode specifies that layouts pertaining to both READ and RW iomodes are being returned or recalled, respectively. The metadata server's use of the iomode may depend on the layout type being used. The storage devices may validate I/O accesses against the iomode and reject invalid accesses. 1.2.25. nfs_impl_id4 struct nfs_impl_id4 { utf8str_cis nii_domain; utf8str_cs nii_name; nfstime4 nii_date; }; This structure is used to identify client and server implementation detail. The nii_domain field is the DNS domain name that the implementer is associated with. The nii_name field is the product name of the implementation and is completely free form. It is encouraged that the nii_name be used to distinguish machine architecture, machine platforms, revisions, versions, and patch levels. The nii_date field is the timestamp of when the software instance was published or built. 1.2.26. impl_ident4 struct impl_ident4 { clientid4 ii_clientid; struct nfs_impl_id4 ii_impl_id; }; This is used for exchanging implementation identification between client and server. 2. Filehandles The filehandle in the NFS protocol is a per server unique identifier for a filesystem object. The contents of the filehandle are opaque to the client. Therefore, the server is responsible for translating the filehandle to an internal representation of the filesystem object. 2.1. Obtaining the First Filehandle The operations of the NFS protocol are defined in terms of one or more filehandles. Therefore, the client needs a filehandle to initiate communication with the server. With the NFS version 2 Shepler Expires September 7, 2006 [Page 19] Internet-Draft NFSv4 Minior Version 1 March 2006 protocol [RFC1094] and the NFS version 3 protocol [RFC1813], there exists an ancillary protocol to obtain this first filehandle. The MOUNT protocol, RPC program number 100005, provides the mechanism of translating a string based filesystem path name to a filehandle which can then be used by the NFS protocols. The MOUNT protocol has deficiencies in the area of security and use via firewalls. This is one reason that the use of the public filehandle was introduced in [RFC2054] and [RFC2055]. With the use of the public filehandle in combination with the LOOKUP operation in the NFS version 2 and 3 protocols, it has been demonstrated that the MOUNT protocol is unnecessary for viable interaction between NFS client and server. Therefore, the NFS version 4 protocol will not use an ancillary protocol for translation from string based path names to a filehandle. Two special filehandles will be used as starting points for the NFS client. 2.1.1. Root Filehandle The first of the special filehandles is the ROOT filehandle. The ROOT filehandle is the "conceptual" root of the filesystem name space at the NFS server. The client uses or starts with the ROOT filehandle by employing the PUTROOTFH operation. The PUTROOTFH operation instructs the server to set the "current" filehandle to the ROOT of the server's file tree. Once this PUTROOTFH operation is used, the client can then traverse the entirety of the server's file tree with the LOOKUP operation. A complete discussion of the server name space is in the section "NFS Server Name Space". 2.1.2. Public Filehandle The second special filehandle is the PUBLIC filehandle. Unlike the ROOT filehandle, the PUBLIC filehandle may be bound or represent an arbitrary filesystem object at the server. The server is responsible for this binding. It may be that the PUBLIC filehandle and the ROOT filehandle refer to the same filesystem object. However, it is up to the administrative software at the server and the policies of the server administrator to define the binding of the PUBLIC filehandle and server filesystem object. The client may not make any assumptions about this binding. The client uses the PUBLIC filehandle via the PUTPUBFH operation. 2.2. Filehandle Types In the NFS version 2 and 3 protocols, there was one type of filehandle with a single set of semantics. This type of filehandle Shepler Expires September 7, 2006 [Page 20] Internet-Draft NFSv4 Minior Version 1 March 2006 is termed "persistent" in NFS Version 4. The semantics of a persistent filehandle remain the same as before. A new type of filehandle introduced in NFS Version 4 is the "volatile" filehandle, which attempts to accommodate certain server environments. The volatile filehandle type was introduced to address server functionality or implementation issues which make correct implementation of a persistent filehandle infeasible. Some server environments do not provide a filesystem level invariant that can be used to construct a persistent filehandle. The underlying server filesystem may not provide the invariant or the server's filesystem programming interfaces may not provide access to the needed invariant. Volatile filehandles may ease the implementation of server functionality such as hierarchical storage management or filesystem reorganization or migration. However, the volatile filehandle increases the implementation burden for the client. Since the client will need to handle persistent and volatile filehandles differently, a file attribute is defined which may be used by the client to determine the filehandle types being returned by the server. 2.2.1. General Properties of a Filehandle The filehandle contains all the information the server needs to distinguish an individual file. To the client, the filehandle is opaque. The client stores filehandles for use in a later request and can compare two filehandles from the same server for equality by doing a byte-by-byte comparison. However, the client MUST NOT otherwise interpret the contents of filehandles. If two filehandles from the same server are equal, they MUST refer to the same file. Servers SHOULD try to maintain a one-to-one correspondence between filehandles and files but this is not required. Clients MUST use filehandle comparisons only to improve performance, not for correct behavior. All clients need to be prepared for situations in which it cannot be determined whether two filehandles denote the same object and in such cases, avoid making invalid assumptions which might cause incorrect behavior. Further discussion of filehandle and attribute comparison in the context of data caching is presented in the section "Data Caching and File Identity". As an example, in the case that two different path names when traversed at the server terminate at the same filesystem object, the server SHOULD return the same filehandle for each path. This can occur if a hard link is used to create two file names which refer to the same underlying file object and associated data. For example, if paths /a/b/c and /a/d/c refer to the same file, the server SHOULD return the same filehandle for both path names traversals. Shepler Expires September 7, 2006 [Page 21] Internet-Draft NFSv4 Minior Version 1 March 2006 2.2.2. Persistent Filehandle A persistent filehandle is defined as having a fixed value for the lifetime of the filesystem object to which it refers. Once the server creates the filehandle for a filesystem object, the server MUST accept the same filehandle for the object for the lifetime of the object. If the server restarts or reboots the NFS server must honor the same filehandle value as it did in the server's previous instantiation. Similarly, if the filesystem is migrated, the new NFS server must honor the same filehandle as the old NFS server. The persistent filehandle will be become stale or invalid when the filesystem object is removed. When the server is presented with a persistent filehandle that refers to a deleted object, it MUST return an error of NFS4ERR_STALE. A filehandle may become stale when the filesystem containing the object is no longer available. The file system may become unavailable if it exists on removable media and the media is no longer available at the server or the filesystem in whole has been destroyed or the filesystem has simply been removed from the server's name space (i.e. unmounted in a UNIX environment). 2.2.3. Volatile Filehandle A volatile filehandle does not share the same longevity characteristics of a persistent filehandle. The server may determine that a volatile filehandle is no longer valid at many different points in time. If the server can definitively determine that a volatile filehandle refers to an object that has been removed, the server should return NFS4ERR_STALE to the client (as is the case for persistent filehandles). In all other cases where the server determines that a volatile filehandle can no longer be used, it should return an error of NFS4ERR_FHEXPIRED. The mandatory attribute "fh_expire_type" is used by the client to determine what type of filehandle the server is providing for a particular filesystem. This attribute is a bitmask with the following values: FH4_PERSISTENT The value of FH4_PERSISTENT is used to indicate a persistent filehandle, which is valid until the object is removed from the filesystem. The server will not return NFS4ERR_FHEXPIRED for this filehandle. FH4_PERSISTENT is defined as a value in which none of the bits specified below are set. FH4_VOLATILE_ANY The filehandle may expire at any time, except as specifically excluded (i.e. FH4_NO_EXPIRE_WITH_OPEN). Shepler Expires September 7, 2006 [Page 22] Internet-Draft NFSv4 Minior Version 1 March 2006 FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set. If this bit is set, then the meaning of FH4_VOLATILE_ANY is qualified to exclude any expiration of the filehandle when it is open. FH4_VOL_MIGRATION The filehandle will expire as a result of migration. If FH4_VOL_ANY is set, FH4_VOL_MIGRATION is redundant. FH4_VOL_RENAME The filehandle will expire during rename. This includes a rename by the requesting client or a rename by any other client. If FH4_VOL_ANY is set, FH4_VOL_RENAME is redundant. Servers which provide volatile filehandles that may expire while open (i.e. if FH4_VOL_MIGRATION or FH4_VOL_RENAME is set or if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set), should deny a RENAME or REMOVE that would affect an OPEN file of any of the components leading to the OPEN file. In addition, the server should deny all RENAME or REMOVE requests during the grace period upon server restart. Note that the bits FH4_VOL_MIGRATION and FH4_VOL_RENAME allow the client to determine that expiration has occurred whenever a specific event occurs, without an explicit filehandle expiration error from the server. FH4_VOL_ANY does not provide this form of information. In situations where the server will expire many, but not all filehandles upon migration (e.g. all but those that are open), FH4_VOLATILE_ANY (in this case with FH4_NOEXPIRE_WITH_OPEN) is a better choice since the client may not assume that all filehandles will expire when migration occurs, and it is likely that additional expirations will occur (as a result of file CLOSE) that are separated in time from the migration event itself. 2.3. One Method of Constructing a Volatile Filehandle A volatile filehandle, while opaque to the client could contain: [volatile bit = 1 | server boot time | slot | generation number] o slot is an index in the server volatile filehandle table o generation number is the generation number for the table entry/ slot When the client presents a volatile filehandle, the server makes the following checks, which assume that the check for the volatile bit has passed. If the server boot time is less than the current server boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return NFS4ERR_BADHANDLE. If the generation number does not match, return Shepler Expires September 7, 2006 [Page 23] Internet-Draft NFSv4 Minior Version 1 March 2006 NFS4ERR_FHEXPIRED. When the server reboots, the table is gone (it is volatile). If volatile bit is 0, then it is a persistent filehandle with a different structure following it. 2.4. Client Recovery from Filehandle Expiration If possible, the client SHOULD recover from the receipt of an NFS4ERR_FHEXPIRED error. The client must take on additional responsibility so that it may prepare itself to recover from the expiration of a volatile filehandle. If the server returns persistent filehandles, the client does not need these additional steps. For volatile filehandles, most commonly the client will need to store the component names leading up to and including the filesystem object in question. With these names, the client should be able to recover by finding a filehandle in the name space that is still available or by starting at the root of the server's filesystem name space. If the expired filehandle refers to an object that has been removed from the filesystem, obviously the client will not be able to recover from the expired filehandle. It is also possible that the expired filehandle refers to a file that has been renamed. If the file was renamed by another client, again it is possible that the original client will not be able to recover. However, in the case that the client itself is renaming the file and the file is open, it is possible that the client may be able to recover. The client can determine the new path name based on the processing of the rename request. The client can then regenerate the new filehandle based on the new path name. The client could also use the compound operation mechanism to construct a set of operations like: RENAME A B LOOKUP B GETFH Note that the COMPOUND procedure does not provide atomicity. This example only reduces the overhead of recovering from an expired filehandle. Shepler Expires September 7, 2006 [Page 24] Internet-Draft NFSv4 Minior Version 1 March 2006 3. File Attributes To meet the requirements of extensibility and increased interoperability with non-UNIX platforms, attributes must be handled in a flexible manner. The NFS version 3 fattr3 structure contains a fixed list of attributes that not all clients and servers are able to support or care about. The fattr3 structure can not be extended as new needs arise and it provides no way to indicate non-support. With the NFS version 4 protocol, the client is able query what attributes the server supports and construct requests with only those supported attributes (or a subset thereof). To this end, attributes are divided into three groups: mandatory, recommended, and named. Both mandatory and recommended attributes are supported in the NFS version 4 protocol by a specific and well- defined encoding and are identified by number. They are requested by setting a bit in the bit vector sent in the GETATTR request; the server response includes a bit vector to list what attributes were returned in the response. New mandatory or recommended attributes may be added to the NFS protocol between major revisions by publishing a standards-track RFC which allocates a new attribute number value and defines the encoding for the attribute. See the section "Minor Versioning" for further discussion. Named attributes are accessed by the new OPENATTR operation, which accesses a hidden directory of attributes associated with a file system object. OPENATTR takes a filehandle for the object and returns the filehandle for the attribute hierarchy. The filehandle for the named attributes is a directory object accessible by LOOKUP or READDIR and contains files whose names represent the named attributes and whose data bytes are the value of the attribute. For example: +----------+-----------+---------------------------------+ | LOOKUP | "foo" | ; look up file | | GETATTR | attrbits | | | OPENATTR | | ; access foo's named attributes | | LOOKUP | "x11icon" | ; look up specific attribute | | READ | 0,4096 | ; read stream of bytes | +----------+-----------+---------------------------------+ Named attributes are intended for data needed by applications rather than by an NFS client implementation. NFS implementors are strongly encouraged to define their new attributes as recommended attributes by bringing them to the IETF standards-track process. The set of attributes which are classified as mandatory is deliberately small since servers must do whatever it takes to support Shepler Expires September 7, 2006 [Page 25] Internet-Draft NFSv4 Minior Version 1 March 2006 them. A server should support as many of the recommended attributes as possible but by their definition, the server is not required to support all of them. Attributes are deemed mandatory if the data is both needed by a large number of clients and is not otherwise reasonably computable by the client when support is not provided on the server. Note that the hidden directory returned by OPENATTR is a convenience for protocol processing. The client should not make any assumptions about the server's implementation of named attributes and whether the underlying filesystem at the server has a named attribute directory or not. Therefore, operations such as SETATTR and GETATTR on the named attribute directory are undefined. 3.1. Mandatory Attributes These MUST be supported by every NFS version 4 client and server in order to ensure a minimum level of interoperability. The server must store and return these attributes and the client must be able to function with an attribute set limited to these attributes. With just the mandatory attributes some client functionality may be impaired or limited in some ways. A client may ask for any of these attributes to be returned by setting a bit in the GETATTR request and the server must return their value. 3.2. Recommended Attributes These attributes are understood well enough to warrant support in the NFS version 4 protocol. However, they may not be supported on all clients and servers. A client may ask for any of these attributes to be returned by setting a bit in the GETATTR request but must handle the case where the server does not return them. A client may ask for the set of attributes the server supports and should not request attributes the server does not support. A server should be tolerant of requests for unsupported attributes and simply not return them rather than considering the request an error. It is expected that servers will support all attributes they comfortably can and only fail to support attributes which are difficult to support in their operating environments. A server should provide attributes whenever they don't have to "tell lies" to the client. For example, a file modification time should be either an accurate time or should not be supported by the server. This will not always be comfortable to clients but the client is better positioned decide whether and how to fabricate or construct an attribute or whether to do without the attribute. Shepler Expires September 7, 2006 [Page 26] Internet-Draft NFSv4 Minior Version 1 March 2006 3.3. Named Attributes These attributes are not supported by direct encoding in the NFS Version 4 protocol but are accessed by string names rather than numbers and correspond to an uninterpreted stream of bytes which are stored with the filesystem object. The name space for these attributes may be accessed by using the OPENATTR operation. The OPENATTR operation returns a filehandle for a virtual "attribute directory" and further perusal of the name space may be done using READDIR and LOOKUP operations on this filehandle. Named attributes may then be examined or changed by normal READ and WRITE and CREATE operations on the filehandles returned from READDIR and LOOKUP. Named attributes may have attributes. It is recommended that servers support arbitrary named attributes. A client should not depend on the ability to store any named attributes in the server's filesystem. If a server does support named attributes, a client which is also able to handle them should be able to copy a file's data and meta-data with complete transparency from one location to another; this would imply that names allowed for regular directory entries are valid for named attribute names as well. Names of attributes will not be controlled by this document or other IETF standards track documents. See the section "IANA Considerations" for further discussion. 3.4. Classification of Attributes Each of the Mandatory and Recommended attributes can be classified in one of three categories: per server, per filesystem, or per filesystem object. Note that it is possible that some per filesystem attributes may vary within the filesystem. See the "homogeneous" attribute for its definition. Note that the attributes time_access_set and time_modify_set are not listed in this section because they are write-only attributes corresponding to time_access and time_modify, and are used in a special instance of SETATTR. o The per server attribute is: lease_time o The per filesystem attributes are: supp_attr, fh_expire_type, link_support, symlink_support, unique_handles, aclsupport, cansettime, case_insensitive, case_preserving, chown_restricted, files_avail, files_free, files_total, fs_locations, homogeneous, maxfilesize, maxname, Shepler Expires September 7, 2006 [Page 27] Internet-Draft NFSv4 Minior Version 1 March 2006 maxread, maxwrite, no_trunc, space_avail, space_free, space_total, time_delta, fs_layouttype, send_impl_id, recv_impl_id o The per filesystem object attributes are: type, change, size, named_attr, fsid, rdattr_error, filehandle, ACL, archive, fileid, hidden, maxlink, mimetype, mode, numlinks, owner, owner_group, rawdev, space_used, system, time_access, time_backup, time_create, time_metadata, time_modify, mounted_on_fileid, layouttype, layouthint, layout_blksize, layout_alignment For quota_avail_hard, quota_avail_soft, and quota_used see their definitions below for the appropriate classification. 3.5. Mandatory Attributes - Definitions +-----------------+----+------------+--------+----------------------+ | name | # | Data Type | Access | Description | +-----------------+----+------------+--------+----------------------+ | supp_attr | 0 | bitmap | READ | The bit vector which | | | | | | would retrieve all | | | | | | mandatory and | | | | | | recommended | | | | | | attributes that are | | | | | | supported for this | | | | | | object. The scope of | | | | | | this attribute | | | | | | applies to all | | | | | | objects with a | | | | | | matching fsid. | | type | 1 | nfs4_ftype | READ | The type of the | | | | | | object (file, | | | | | | directory, symlink, | | | | | | etc.) | | fh_expire_type | 2 | uint32 | READ | Server uses this to | | | | | | specify filehandle | | | | | | expiration behavior | | | | | | to the client. See | | | | | | the section | | | | | | "Filehandles" for | | | | | | additional | | | | | | description. | Shepler Expires September 7, 2006 [Page 28] Internet-Draft NFSv4 Minior Version 1 March 2006 | change | 3 | uint64 | READ | A value created by | | | | | | the server that the | | | | | | client can use to | | | | | | determine if file | | | | | | data, directory | | | | | | contents or | | | | | | attributes of the | | | | | | object have been | | | | | | modified. The server | | | | | | may return the | | | | | | object's | | | | | | time_metadata | | | | | | attribute for this | | | | | | attribute's value | | | | | | but only if the | | | | | | filesystem object | | | | | | can not be updated | | | | | | more frequently than | | | | | | the resolution of | | | | | | time_metadata. | | size | 4 | uint64 | R/W | The size of the | | | | | | object in bytes. | | link_support | 5 | bool | READ | True, if the | | | | | | object's filesystem | | | | | | supports hard links. | | symlink_support | 6 | bool | READ | True, if the | | | | | | object's filesystem | | | | | | supports symbolic | | | | | | links. | | named_attr | 7 | bool | READ | True, if this object | | | | | | has named | | | | | | attributes. In other | | | | | | words, object has a | | | | | | non-empty named | | | | | | attribute directory. | | fsid | 8 | fsid4 | READ | Unique filesystem | | | | | | identifier for the | | | | | | filesystem holding | | | | | | this object. fsid | | | | | | contains major and | | | | | | minor components | | | | | | each of which are | | | | | | uint64. | | unique_handles | 9 | bool | READ | True, if two | | | | | | distinct filehandles | | | | | | guaranteed to refer | | | | | | to two different | | | | | | filesystem objects. | Shepler Expires September 7, 2006 [Page 29] Internet-Draft NFSv4 Minior Version 1 March 2006 | lease_time | 10 | nfs_lease4 | READ | Duration of leases | | | | | | at server in | | | | | | seconds. | | rdattr_error | 11 | enum | READ | Error returned from | | | | | | getattr during | | | | | | readdir. | | filehandle | 19 | nfs_fh4 | READ | The filehandle of | | | | | | this object | | | | | | (primarily for | | | | | | readdir requests). | +-----------------+----+------------+--------+----------------------+ 3.6. Recommended Attributes - Definitions +--------------------+-----+--------------+--------+----------------+ | name | # | Data Type | Access | Description | +--------------------+-----+--------------+--------+----------------+ | ACL | 12 | nfsace4<> | R/W | The access | | | | | | control list | | | | | | for the | | | | | | object. | | aclsupport | 13 | uint32 | READ | Indicates what | | | | | | types of ACLs | | | | | | are supported | | | | | | on the current | | | | | | filesystem. | | archive | 14 | bool | R/W | True, if this | | | | | | file has been | | | | | | archived since | | | | | | the time of | | | | | | last | | | | | | modification | | | | | | (deprecated in | | | | | | favor of | | | | | | time_backup). | | cansettime | 15 | bool | READ | True, if the | | | | | | server able to | | | | | | change the | | | | | | times for a | | | | | | filesystem | | | | | | object as | | | | | | specified in a | | | | | | SETATTR | | | | | | operation. | Shepler Expires September 7, 2006 [Page 30] Internet-Draft NFSv4 Minior Version 1 March 2006 | case_insensitive | 16 | bool | READ | True, if | | | | | | filename | | | | | | comparisons on | | | | | | this | | | | | | filesystem are | | | | | | case | | | | | | insensitive. | | case_preserving | 17 | bool | READ | True, if | | | | | | filename case | | | | | | on this | | | | | | filesystem are | | | | | | preserved. | | chown_restricted | 18 | bool | READ | If TRUE, the | | | | | | server will | | | | | | reject any | | | | | | request to | | | | | | change either | | | | | | the owner or | | | | | | the group | | | | | | associated | | | | | | with a file if | | | | | | the caller is | | | | | | not a | | | | | | privileged | | | | | | user (for | | | | | | example, | | | | | | "root" in UNIX | | | | | | operating | | | | | | environments | | | | | | or in Windows | | | | | | 2000 the "Take | | | | | | Ownership" | | | | | | privilege). | | fileid | 20 | uint64 | READ | A number | | | | | | uniquely | | | | | | identifying | | | | | | the file | | | | | | within the | | | | | | filesystem. | Shepler Expires September 7, 2006 [Page 31] Internet-Draft NFSv4 Minior Version 1 March 2006 | files_avail | 21 | uint64 | READ | File slots | | | | | | available to | | | | | | this user on | | | | | | the filesystem | | | | | | containing | | | | | | this object - | | | | | | this should be | | | | | | the smallest | | | | | | relevant | | | | | | limit. | | files_free | 22 | uint64 | READ | Free file | | | | | | slots on the | | | | | | filesystem | | | | | | containing | | | | | | this object - | | | | | | this should be | | | | | | the smallest | | | | | | relevant | | | | | | limit. | | files_total | 23 | uint64 | READ | Total file | | | | | | slots on the | | | | | | filesystem | | | | | | containing | | | | | | this object. | | fs_locations | 24 | fs_locations | READ | Locations | | | | | | where this | | | | | | filesystem may | | | | | | be found. If | | | | | | the server | | | | | | returns | | | | | | NFS4ERR_MOVED | | | | | | as an error, | | | | | | this attribute | | | | | | MUST be | | | | | | supported. | | hidden | 25 | bool | R/W | True, if the | | | | | | file is | | | | | | considered | | | | | | hidden with | | | | | | respect to the | | | | | | Windows API? | Shepler Expires September 7, 2006 [Page 32] Internet-Draft NFSv4 Minior Version 1 March 2006 | homogeneous | 26 | bool | READ | True, if this | | | | | | object's | | | | | | filesystem is | | | | | | homogeneous, | | | | | | i.e. are per | | | | | | filesystem | | | | | | attributes the | | | | | | same for all | | | | | | filesystem's | | | | | | objects. | | maxfilesize | 27 | uint64 | READ | Maximum | | | | | | supported file | | | | | | size for the | | | | | | filesystem of | | | | | | this object. | | maxlink | 28 | uint32 | READ | Maximum number | | | | | | of links for | | | | | | this object. | | maxname | 29 | uint32 | READ | Maximum | | | | | | filename size | | | | | | supported for | | | | | | this object. | | maxread | 30 | uint64 | READ | Maximum read | | | | | | size supported | | | | | | for this | | | | | | object. | | maxwrite | 31 | uint64 | READ | Maximum write | | | | | | size supported | | | | | | for this | | | | | | object. This | | | | | | attribute | | | | | | SHOULD be | | | | | | supported if | | | | | | the file is | | | | | | writable. Lack | | | | | | of this | | | | | | attribute can | | | | | | lead to the | | | | | | client either | | | | | | wasting | | | | | | bandwidth or | | | | | | not receiving | | | | | | the best | | | | | | performance. | | mimetype | 32 | utf8<> | R/W | MIME body | | | | | | type/subtype | | | | | | of this | | | | | | object. | Shepler Expires September 7, 2006 [Page 33] Internet-Draft NFSv4 Minior Version 1 March 2006 | mode | 33 | mode4 | R/W | UNIX-style | | | | | | mode and | | | | | | permission | | | | | | bits for this | | | | | | object. | | no_trunc | 34 | bool | READ | True, if a | | | | | | name longer | | | | | | than name_max | | | | | | is used, an | | | | | | error be | | | | | | returned and | | | | | | name is not | | | | | | truncated. | | numlinks | 35 | uint32 | READ | Number of hard | | | | | | links to this | | | | | | object. | | owner | 36 | utf8<> | R/W | The string | | | | | | name of the | | | | | | owner of this | | | | | | object. | | owner_group | 37 | utf8<> | R/W | The string | | | | | | name of the | | | | | | group | | | | | | ownership of | | | | | | this object. | | quota_avail_hard | 38 | uint64 | READ | For definition | | | | | | see "Quota | | | | | | Attributes" | | | | | | section below. | | quota_avail_soft | 39 | uint64 | READ | For definition | | | | | | see "Quota | | | | | | Attributes" | | | | | | section below. | | quota_used | 40 | uint64 | READ | For definition | | | | | | see "Quota | | | | | | Attributes" | | | | | | section below. | Shepler Expires September 7, 2006 [Page 34] Internet-Draft NFSv4 Minior Version 1 March 2006 | rawdev | 41 | specdata4 | READ | Raw device | | | | | | identifier. | | | | | | UNIX device | | | | | | major/minor | | | | | | node | | | | | | information. | | | | | | If the value | | | | | | of type is not | | | | | | NF4BLK or | | | | | | NF4CHR, the | | | | | | value return | | | | | | SHOULD NOT be | | | | | | considered | | | | | | useful. | | space_avail | 42 | uint64 | READ | Disk space in | | | | | | bytes | | | | | | available to | | | | | | this user on | | | | | | the filesystem | | | | | | containing | | | | | | this object - | | | | | | this should be | | | | | | the smallest | | | | | | relevant | | | | | | limit. | | space_free | 43 | uint64 | READ | Free disk | | | | | | space in bytes | | | | | | on the | | | | | | filesystem | | | | | | containing | | | | | | this object - | | | | | | this should be | | | | | | the smallest | | | | | | relevant | | | | | | limit. | | space_total | 44 | uint64 | READ | Total disk | | | | | | space in bytes | | | | | | on the | | | | | | filesystem | | | | | | containing | | | | | | this object. | | space_used | 45 | uint64 | READ | Number of | | | | | | filesystem | | | | | | bytes | | | | | | allocated to | | | | | | this object. | Shepler Expires September 7, 2006 [Page 35] Internet-Draft NFSv4 Minior Version 1 March 2006 | system | 46 | bool | R/W | True, if this | | | | | | file is a | | | | | | "system" file | | | | | | with respect | | | | | | to the Windows | | | | | | API? | | time_access | 47 | nfstime4 | READ | The time of | | | | | | last access to | | | | | | the object by | | | | | | a read that | | | | | | was satisfied | | | | | | by the server. | | time_access_set | 48 | settime4 | WRITE | Set the time | | | | | | of last access | | | | | | to the object. | | | | | | SETATTR use | | | | | | only. | | time_backup | 49 | nfstime4 | R/W | The time of | | | | | | last backup of | | | | | | the object. | | time_create | 50 | nfstime4 | R/W | The time of | | | | | | creation of | | | | | | the object. | | | | | | This attribute | | | | | | does not have | | | | | | any relation | | | | | | to the | | | | | | traditional | | | | | | UNIX file | | | | | | attribute | | | | | | "ctime" or | | | | | | "change time". | | time_delta | 51 | nfstime4 | READ | Smallest | | | | | | useful server | | | | | | time | | | | | | granularity. | | time_metadata | 52 | nfstime4 | READ | The time of | | | | | | last meta-data | | | | | | modification | | | | | | of the object. | | time_modify | 53 | nfstime4 | READ | The time of | | | | | | last | | | | | | modification | | | | | | to the object. | Shepler Expires September 7, 2006 [Page 36] Internet-Draft NFSv4 Minior Version 1 March 2006 | time_modify_set | 54 | settime4 | WRITE | Set the time | | | | | | of last | | | | | | modification | | | | | | to the object. | | | | | | SETATTR use | | | | | | only. | | mounted_on_fileid | 55 | uint64 | READ | Like fileid, | | | | | | but if the | | | | | | target | | | | | | filehandle is | | | | | | the root of a | | | | | | filesystem | | | | | | return the | | | | | | fileid of the | | | | | | underlying | | | | | | directory. | | send_impl_id | TBD | impl_ident4 | WRITE | Client | | | | | | provides | | | | | | server with | | | | | | implementation | | | | | | identity via | | | | | | SETATTR. | | recv_impl_id | TBD | nfs_impl_id4 | READ | Client obtains | | | | | | server | | | | | | implementation | | | | | | via GETATTR. | | dir_notif_delay | TBD | R/W | READ | notification | | | | | | delays on | | | | | | directory | | | | | | attributes | | dirent_notif_delay | TBD | R/W | READ | notification | | | | | | delays on | | | | | | child | | | | | | attributes | | fs_layouttype | TBD | layouttype4 | READ | Layout types | | | | | | available for | | | | | | the | | | | | | filesystem. | | layouttype | TBD | layouttype4 | READ | Layout types | | | | | | available for | | | | | | the file. | | layouthint | TBD | layouthint4 | WRITE | Client | | | | | | specified hint | | | | | | for file | | | | | | layout. | Shepler Expires September 7, 2006 [Page 37] Internet-Draft NFSv4 Minior Version 1 March 2006 | layout_blksize | TBD | uint32_t | READ | Preferred | | | | | | block size for | | | | | | layout related | | | | | | I/O. | | layout_alignment | TBD | uint32_t | READ | Preferred | | | | | | alignment for | | | | | | layout related | | | | | | I/O. | | | TBD | | READ | desc | | | TBD | | READ | desc | +--------------------+-----+--------------+--------+----------------+ 3.7. Time Access As defined above, the time_access attribute represents the time of last access to the object by a read that was satisfied by the server. The notion of what is an "access" depends on server's operating environment and/or the server's filesystem semantics. For example, for servers obeying POSIX semantics, time_access would be updated only by the READLINK, READ, and READDIR operations and not any of the operations that modify the content of the object. Of course, setting the corresponding time_access_set attribute is another way to modify the time_access attribute. Whenever the file object resides on a writable filesystem, the server should make best efforts to record time_access into stable storage. However, to mitigate the performance effects of doing so, and most especially whenever the server is satisfying the read of the object's content from its cache, the server MAY cache access time updates and lazily write them to stable storage. It is also acceptable to give administrators of the server the option to disable time_access updates. 3.8. Interpreting owner and owner_group The recommended attributes "owner" and "owner_group" (and also users and groups within the "acl" attribute) are represented in terms of a UTF-8 string. To avoid a representation that is tied to a particular underlying implementation at the client or server, the use of the UTF-8 string has been chosen. Note that section 6.1 of [RFC2624] provides additional rationale. It is expected that the client and server will have their own local representation of owner and owner_group that is used for local storage or presentation to the end user. Therefore, it is expected that when these attributes are transferred between the client and server that the local representation is translated to a syntax of the form "user@ dns_domain". This will allow for a client and server that do not use the same local representation the ability to translate to a common Shepler Expires September 7, 2006 [Page 38] Internet-Draft NFSv4 Minior Version 1 March 2006 syntax that can be interpreted by both. Similarly, security principals may be represented in different ways by different security mechanisms. Servers normally translate these representations into a common format, generally that used by local storage, to serve as a means of identifying the users corresponding to these security principals. When these local identifiers are translated to the form of the owner attribute, associated with files created by such principals they identify, in a common format, the users associated with each corresponding set of security principals. The translation used to interpret owner and group strings is not specified as part of the protocol. This allows various solutions to be employed. For example, a local translation table may be consulted that maps between a numeric id to the user@dns_domain syntax. A name service may also be used to accomplish the translation. A server may provide a more general service, not limited by any particular translation (which would only translate a limited set of possible strings) by storing the owner and owner_group attributes in local storage without any translation or it may augment a translation method by storing the entire string for attributes for which no translation is available while using the local representation for those cases in which a translation is available. Servers that do not provide support for all possible values of the owner and owner_group attributes, should return an error (NFS4ERR_BADOWNER) when a string is presented that has no translation, as the value to be set for a SETATTR of the owner, owner_group, or acl attributes. When a server does accept an owner or owner_group value as valid on a SETATTR (and similarly for the owner and group strings in an acl), it is promising to return that same string when a corresponding GETATTR is done. Configuration changes and ill-constructed name translations (those that contain aliasing) may make that promise impossible to honor. Servers should make appropriate efforts to avoid a situation in which these attributes have their values changed when no real change to ownership has occurred. The "dns_domain" portion of the owner string is meant to be a DNS domain name. For example, user@ietf.org. Servers should accept as valid a set of users for at least one domain. A server may treat other domains as having no valid translations. A more general service is provided when a server is capable of accepting users for multiple domains, or for all domains, subject to security constraints. In the case where there is no translation available to the client or server, the attribute value must be constructed without the "@". Shepler Expires September 7, 2006 [Page 39] Internet-Draft NFSv4 Minior Version 1 March 2006 Therefore, the absence of the @ from the owner or owner_group attribute signifies that no translation was available at the sender and that the receiver of the attribute should not use that string as a basis for translation into its own internal format. Even though the attribute value can not be translated, it may still be useful. In the case of a client, the attribute string may be used for local display of ownership. To provide a greater degree of compatibility with previous versions of NFS (i.e. v2 and v3), which identified users and groups by 32-bit unsigned uid's and gid's, owner and group strings that consist of decimal numeric values with no leading zeros can be given a special interpretation by clients and servers which choose to provide such support. The receiver may treat such a user or group string as representing the same user as would be represented by a v2/v3 uid or gid having the corresponding numeric value. A server is not obligated to accept such a string, but may return an NFS4ERR_BADOWNER instead. To avoid this mechanism being used to subvert user and group translation, so that a client might pass all of the owners and groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER error when there is a valid translation for the user or owner designated in this way. In that case, the client must use the appropriate name@domain string and not the special form for compatibility. The owner string "nobody" may be used to designate an anonymous user, which will be associated with a file created by a security principal that cannot be mapped through normal means to the owner attribute. 3.9. Character Case Attributes With respect to the case_insensitive and case_preserving attributes, each UCS-4 character (which UTF-8 encodes) has a "long descriptive name" [RFC1345] which may or may not included the word "CAPITAL" or "SMALL". The presence of SMALL or CAPITAL allows an NFS server to implement unambiguous and efficient table driven mappings for case insensitive comparisons, and non-case-preserving storage. For general character handling and internationalization issues, see the section "Internationalization". 3.10. Quota Attributes For the attributes related to filesystem quotas, the following definitions apply: Shepler Expires September 7, 2006 [Page 40] Internet-Draft NFSv4 Minior Version 1 March 2006 quota_avail_soft The value in bytes which represents the amount of additional disk space that can be allocated to this file or directory before the user may reasonably be warned. It is understood that this space may be consumed by allocations to other files or directories though there is a rule as to which other files or directories. quota_avail_hard The value in bytes which represent the amount of additional disk space beyond the current allocation that can be allocated to this file or directory before further allocations will be refused. It is understood that this space may be consumed by allocations to other files or directories. quota_used The value in bytes which represent the amount of disc space used by this file or directory and possibly a number of other similar files or directories, where the set of "similar" meets at least the criterion that allocating space to any file or directory in the set will reduce the "quota_avail_hard" of every other file or directory in the set. Note that there may be a number of distinct but overlapping sets of files or directories for which a quota_used value is maintained. E.g. "all files with a given owner", "all files with a given group owner". etc. The server is at liberty to choose any of those sets but should do so in a repeatable way. The rule may be configured per-filesystem or may be "choose the set with the smallest quota". 3.11. mounted_on_fileid UNIX-based operating environments connect a filesystem into the namespace by connecting (mounting) the filesystem onto the existing file object (the mount point, usually a directory) of an existing filesystem. When the mount point's parent directory is read via an API like readdir(), the return results are directory entries, each with a component name and a fileid. The fileid of the mount point's directory entry will be different from the fileid that the stat() system call returns. The stat() system call is returning the fileid of the root of the mounted filesystem, whereas readdir() is returning the fileid stat() would have returned before any filesystems were mounted on the mount point. Unlike NFS version 3, NFS version 4 allows a client's LOOKUP request to cross other filesystems. The client detects the filesystem crossing whenever the filehandle argument of LOOKUP has an fsid attribute different from that of the filehandle returned by LOOKUP. A UNIX-based client will consider this a "mount point crossing". Shepler Expires September 7, 2006 [Page 41] Internet-Draft NFSv4 Minior Version 1 March 2006 UNIX has a legacy scheme for allowing a process to determine its current working directory. This relies on readdir() of a mount point's parent and stat() of the mount point returning fileids as previously described. The mounted_on_fileid attribute corresponds to the fileid that readdir() would have returned as described previously. While the NFS version 4 client could simply fabricate a fileid corresponding to what mounted_on_fileid provides (and if the server does not support mounted_on_fileid, the client has no choice), there is a risk that the client will generate a fileid that conflicts with one that is already assigned to another object in the filesystem. Instead, if the server can provide the mounted_on_fileid, the potential for client operational problems in this area is eliminated. If the server detects that there is no mounted point at the target file object, then the value for mounted_on_fileid that it returns is the same as that of the fileid attribute. The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD provide it if possible, and for a UNIX-based server, this is straightforward. Usually, mounted_on_fileid will be requested during a READDIR operation, in which case it is trivial (at least for UNIX- based servers) to return mounted_on_fileid since it is equal to the fileid of a directory entry returned by readdir(). If mounted_on_fileid is requested in a GETATTR operation, the server should obey an invariant that has it returning a value that is equal to the file object's entry in the object's parent directory, i.e. what readdir() would have returned. Some operating environments allow a series of two or more filesystems to be mounted onto a single mount point. In this case, for the server to obey the aforementioned invariant, it will need to find the base mount point, and not the intermediate mount points. 3.12. send_impl_id and recv_impl_id These recommended attributes are used to identify the client and server. In the case of the send_impl_id attribute, the client sends its clientid4 value along with the nfs_impl_id4. The use of the clientid4 value allows the server to identify and match specific client interaction. In the case of the recv_impl_id attribute, the client receives the nfs_impl_id4 value. Access to this identification information can be most useful at both client and server. Being able to identify specific implementations can help in planning by administrators or implementers. For example, diagnostic software may extract this information in an attempt to identify implementation problems, performance workload behaviors or Shepler Expires September 7, 2006 [Page 42] Internet-Draft NFSv4 Minior Version 1 March 2006 general usage statistics. Since the intent of having access to this information is for planning or general diagnosis only, the client and server MUST NOT interpret this implementation identity information in a way that affects interoperational behavior of the implementation. The reason is the if clients and servers did such a thing, they might use fewer capabilities of the protocol than the peer can support, or the client and server might refuse to interoperate. Because it is likely some implementations will violate the protocol specification and interpret the identity information, implementations MUST allow the users of the NFSv4 client and server to set the contents of the sent nfs_impl_id structure to any value. Even though these attributes are recommended, if the server supports one of them it MUST support the other. 3.13. fs_layouttype This attribute applies to a file system and indicates what layout types are supported by the file system. We expect this attribute to be queried when a client encounters a new fsid. This attribute is used by the client to determine if it has applicable layout drivers. 3.14. layouttype This attribute indicates the particular layout type(s) used for a file. This is for informational purposes only. The client needs to use the LAYOUTGET operation in order to get enough information (e.g., specific device information) in order to perform I/O. 3.15. layouthint This attribute may be set on newly created files to influence the metadata server's choice for the file's layout. It is suggested that this attribute is set as one of the initial attributes within the OPEN call. The metadata server may ignore this attribute. This attribute is a sub-set of the layout structure returned by LAYOUTGET. For example, instead of specifying particular devices, this would be used to suggest the stripe width of a file. It is up to the server implementation to determine which fields within the layout it uses. [[Comment.3: it has been suggested that the HINT is a well defined type other than pnfs_layoutdata4, similar to pnfs_layoutupdate4.]] 3.16. Access Control Lists The NFS version 4 ACL attribute is an array of access control entries (ACE). Although, the client can read and write the ACL attribute, Shepler Expires September 7, 2006 [Page 43] Internet-Draft NFSv4 Minior Version 1 March 2006 the NFSv4 model is the server does all access control based on the server's interpretation of the ACL. If at any point the client wants to check access without issuing an operation that modifies or reads data or metadata, the client can use the OPEN and ACCESS operations to do so. There are various access control entry types, as defined in Section 3.16.1. The server is able to communicate which ACE types are supported by returning the appropriate value within the aclsupport attribute. Each ACE covers one or more operations on a file or directory as described in Section 3.16.2. It may also contain one or more flags that modify the semantics of the ACE as defined in Section 3.16.3. The NFS ACE attribute is defined as follows: typedef uint32_t acetype4; typedef uint32_t aceflag4; typedef uint32_t acemask4; struct nfsace4 { acetype4 type; aceflag4 flag; acemask4 access_mask; utf8str_mixed who; }; To determine if a request succeeds, each nfsace4 entry is processed in order by the server. Only ACEs which have a "who" that matches the requester are considered. Each ACE is processed until all of the bits of the requester's access have been ALLOWED. Once a bit (see below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer considered in the processing of later ACEs. If an ACCESS_DENIED_ACE is encountered where the requester's access still has unALLOWED bits in common with the "access_mask" of the ACE, the request is denied. However, unlike the ALLOWED and DENIED ACE types, the ALARM and AUDIT ACE types do not affect a requester's access, and instead are for triggering events as a result of a requester's access attempt. Therefore, all AUDIT and ALARM ACEs are processed until end of the ACL. When the ACL is fully processed, if there are bits in requester's mask that have not been considered whether the server allows or denies, the access is denied. Even though a request is denied, servers may choose to have other restrictions or implementation defined security policies in place. In those cases, access may be decided outside of what is in the ACL. Examples of such security policies or restrictions are: o The owner of the file will always be able granted ACE4_WRITE_ACL and ACE4_READ_ACL permissions. This would prevent the user from Shepler Expires September 7, 2006 [Page 44] Internet-Draft NFSv4 Minior Version 1 March 2006 getting into the situation where they can't ever modify the ACL. o The ACL may say that an entity is to be granted ACE4_WRITE_DATA permission, but the file system is mounted read only, therefore write access is denied. As mentioned before, this is one of the reasons that client implementations are not recommended to do their own access checking. The NFS version 4 ACL model is quite rich. Some server platforms may provide access control functionality that goes beyond the UNIX-style mode attribute, but which is not as rich as the NFS ACL model. So that users can take advantage of this more limited functionality, the server may indicate that it supports ACLs as long as it follows the guidelines for mapping between its ACL model and the NFS version 4 ACL model. The situation is complicated by the fact that a server may have multiple modules that enforce ACLs. For example, the enforcement for NFS version 4 access may be different from the enforcement for local access, and both may be different from the enforcement for access through other protocols such as SMB. So it may be useful for a server to accept an ACL even if not all of its modules are able to support it. The guiding principle in all cases is that the server must not accept ACLs that appear to make the file more secure than it really is. 3.16.1. ACE type Type Description _____________________________________________________ ALLOW Explicitly grants the access defined in acemask4 to the file or directory. DENY Explicitly denies the access defined in acemask4 to the file or directory. AUDIT LOG (system dependent) any access attempt to a file or directory which uses any of the access methods specified in acemask4. ALARM Generate a system ALARM (system dependent) when any access attempt is made to a file or directory for the access methods specified in acemask4. Shepler Expires September 7, 2006 [Page 45] Internet-Draft NFSv4 Minior Version 1 March 2006 A server need not support all of the above ACE types. The bitmask constants used to represent the above definitions within the aclsupport attribute are as follows: const ACL4_SUPPORT_ALLOW_ACL = 0x00000001; const ACL4_SUPPORT_DENY_ACL = 0x00000002; const ACL4_SUPPORT_AUDIT_ACL = 0x00000004; const ACL4_SUPPORT_ALARM_ACL = 0x00000008; The semantics of the "type" field follow the descriptions provided above. The constants used for the type field (acetype4) are as follows: const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000; const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001; const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002; const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003; Clients should not attempt to set an ACE unless the server claims support for that ACE type. If the server receives a request to set an ACE that it cannot store, it MUST reject the request with NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE that it can store but cannot enforce, the server SHOULD reject the request with NFS4ERR_ATTRNOTSUPP. Example: suppose a server can enforce NFS ACLs for NFS access but cannot enforce ACLs for local access. If arbitrary processes can run on the server, then the server SHOULD NOT indicate ACL support. On the other hand, if only trusted administrative programs run locally, then the server may indicate ACL support. 3.16.2. ACE Access Mask The access_mask field contains values based on the following: ACE4_READ_DATA Operation(s) affected: READ OPEN Discussion: Permission to read the data of the file. ACE4_LIST_DIRECTORY Operation(s) affected: READDIR Discussion: Permission to list the contents of a directory. Shepler Expires September 7, 2006 [Page 46] Internet-Draft NFSv4 Minior Version 1 March 2006 ACE4_WRITE_DATA Operation(s) affected: WRITE OPEN Discussion: Permission to modify a file's data anywhere in the file's offset range. This includes the ability to write to any arbitrary offset and as a result to grow the file. ACE4_ADD_FILE Operation(s) affected: CREATE OPEN Discussion: Permission to add a new file in a directory. The CREATE operation is affected when nfs_ftype4 is NF4LNK, NF4BLK, NF4CHR, NF4SOCK, or NF4FIFO. (NF4DIR is not listed because it is covered by ACE4_ADD_SUBDIRECTORY.) OPEN is affected when used to create a regular file. ACE4_APPEND_DATA Operation(s) affected: WRITE OPEN Discussion: The ability to modify a file's data, but only starting at EOF. This allows for the notion of append-only files, by allowing ACE4_APPEND_DATA and denying ACE4_WRITE_DATA to the same user or group. If a file has an ACL such as the one described above and a WRITE request is made for somewhere other than EOF, the server SHOULD return NFS4ERR_ACCESS. ACE4_ADD_SUBDIRECTORY Operation(s) affected: CREATE Discussion: Permission to create a subdirectory in a directory. The CREATE operation is affected when nfs_ftype4 is NF4DIR. ACE4_READ_NAMED_ATTRS Operation(s) affected: OPENATTR Discussion: Permission to read the named attributes of a file or to lookup the named attributes directory. OPENATTR is affected when it is not used to create a named attribute directory. This is when 1.) createdir is TRUE, but a Shepler Expires September 7, 2006 [Page 47] Internet-Draft NFSv4 Minior Version 1 March 2006 named attribute directory already exists, or 2.) createdir is FALSE. ACE4_WRITE_NAMED_ATTRS Operation(s) affected: OPENATTR Discussion: Permission to write the named attributes of a file or to create a named attribute directory. OPENATTR is affected when it is used to create a named attribute directory. This is when createdir is TRUE and no named attribute directory exists. The ability to check whether or not a named attribute directory exists depends on the ability to look it up, therefore, users also need the ACE4_READ_NAMED_ATTRS permission in order to create a named attribute directory. ACE4_EXECUTE Operation(s) affected: LOOKUP Discussion: Permission to execute a file or traverse/search a directory. ACE4_DELETE_CHILD Operation(s) affected: REMOVE Discussion: Permission to delete a file or directory within a directory. See section "ACE4_DELETE vs. ACE4_DELETE_CHILD" for information on how these two access mask bits interact. ACE4_READ_ATTRIBUTES Operation(s) affected: GETATTR of file system object attributes Discussion: The ability to read basic attributes (non-ACLs) of a file. On a UNIX system, basic attributes can be thought of as the stat level attributes. Allowing this access mask bit would mean the entity can execute "ls -l" and stat. ACE4_WRITE_ATTRIBUTES Operation(s) affected: SETATTR of time_access_set, time_backup, time_create, time_modify_set Discussion: Permission to change the times associated with a file Shepler Expires September 7, 2006 [Page 48] Internet-Draft NFSv4 Minior Version 1 March 2006 or directory to an arbitrary value. A user having ACE4_WRITE_DATA permission, but lacking ACE4_WRITE_ATTRIBUTES must be allowed to implicitly set the times associated with a file. ACE4_DELETE Operation(s) affected: REMOVE Discussion: Permission to delete the file or directory. See section "ACE4_DELETE vs. ACE4_DELETE_CHILD" for information on how