Sophie: kernel-2.6.18-128.29.1.el5 src

kernel-2.6.18-128.29.1.el5.src.rpm

From: Peter Staubach <staubach@redhat.com>
Date: Fri, 18 Sep 2009 14:39:00 -0400
Subject: [nfs] r/w I/O perf degraded by FLUSH_STABLE page flush
Message-id: 4AB3D3C4.30603@redhat.com
O-Subject: [PATCH RHEL-5_3-Z] BZ521243 Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing [rhel-5.3.z]
Bugzilla: 521243
RH-Acked-by: Steve Dickson <SteveD@redhat.com>
RH-Acked-by: Jeff Layton <jlayton@redhat.com>

Hi.

Attached is a patch to address bz521243, "Read/Write NFS I/O
performance degraded by FLUSH_STABLE page flushing [rhel-5.3.z]".
This bugzilla describes a performance issue when reading and writing
files using small system call sizes.

The issue is that the NFS client currently implements something
akin to modify-write-read when updating a file.  When a write(2)
occurs first to a page, an empty page is allocated, the contents
of the write system call are transfered to the page, and the dirty
region is kept track of.  If a read(2) is then issued for something
on the page, the contents of the page must be written out before
the page can be read in from the server.  This writing of the data
must be done synchronously and must flush the data all of the to
stable storage on the server.  This data flushing tends to happen
in very small chunks, so is very slow.

The usual mechanism for handling situations like this is read-
modify-write.  When the write(2) occurs, the current contents of the
page are read in the from server and then the contents of the write
system call are copied to the page.  When the read(2) is issued,
the page contents are available and without needing to flush the
modified data to the server.

This patch has been accepted into RHEL-5.  It was tested by building
the RHEL-5_3-Z kernel RPMS and counting the number of the specific
NFS protocol operations generated.

	Thanx...

		ps

--- linux-2.6.18.i686/fs/nfs/file.c.org
+++ linux-2.6.18.i686/fs/nfs/file.c
@@ -296,6 +296,40 @@ nfs_fsync(struct file *file, struct dent
 }
 
 /*
+ * Decide whether a read/modify/write cycle may be more efficient
+ * then a modify/write/read cycle when writing to a page in the
+ * page cache.
+ *
+ * The modify/write/read cycle may occur if a page is read before
+ * being completely filled by the writer.  In this situation, the
+ * page must be completely written to stable storage on the server
+ * before it can be refilled by reading in the page from the server.
+ * This can lead to expensive, small, FILE_SYNC mode writes being
+ * done.
+ *
+ * It may be more efficient to read the page first if the file is
+ * open for reading in addition to writing, the page is not marked
+ * as Uptodate, it is not dirty or waiting to be committed,
+ * indicating that it was previously allocated and then modified,
+ * that there were valid bytes of data in that range of the file,
+ * and that the new data won't completely replace the old data in
+ * that range of the file.
+ */
+static int nfs_want_read_modify_write(struct file *file, struct page *page,
+				      unsigned offset, unsigned to)
+{
+	unsigned int pglen = nfs_page_length(page->mapping->host, page);
+
+	if ((file->f_mode & FMODE_READ) &&	/* open for read? */
+	    !PageUptodate(page) &&		/* Uptodate? */
+	    !PagePrivate(page) &&		/* i/o request already? */
+	    pglen &&				/* valid bytes of file? */
+	    (to < pglen || offset))		/* replace all valid bytes? */
+		return 1;
+	return 0;
+}
+
+/*
  * This does the "real" work of the write. The generic routine has
  * allocated the page, locked it, done all the page alignment stuff
  * calculations etc. Now we should just copy the data from user
@@ -303,10 +337,29 @@ nfs_fsync(struct file *file, struct dent
  *
  * If the writer ends up delaying the write, the writer needs to
  * increment the page use counts until he is done with the page.
+ *
+ * If a read-modify-write cycle is to be used, then AOP_TRUNCATED_PAGE
+ * needs to be returned to indicate that the caller should drop his
+ * reference to the page and look it up again.  This needs to occur
+ * because nfs_readpage() drops the lock on the page.  Additionally,
+ * any errors from nfs_readpage() should be returned.  However, if
+ * an error to be returned is not AOP_TRUNCATED_PAGE, then the page
+ * needs to be relocked so that the caller can unlock it.
  */
-static int nfs_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to)
+static int nfs_prepare_write(struct file *file, struct page *page,
+			     unsigned offset, unsigned to)
 {
-	return nfs_flush_incompatible(file, page);
+	int ret;
+
+	ret = nfs_flush_incompatible(file, page);
+	if (!ret && nfs_want_read_modify_write(file, page, offset, to)) {
+		ret = nfs_readpage(file, page);
+		if (!ret)
+			ret = AOP_TRUNCATED_PAGE;
+		else if (ret != AOP_TRUNCATED_PAGE)
+			lock_page(page);
+	}
+	return ret;
 }
 
 static int nfs_commit_write(struct file *file, struct page *page, unsigned offset, unsigned to)
--- linux-2.6.18.i686/fs/nfs/internal.h.org
+++ linux-2.6.18.i686/fs/nfs/internal.h
@@ -186,6 +186,7 @@ extern int nfs4_path_walk(struct nfs_ser
 
 /* read.c */
 extern int nfs_readpage_async(struct nfs_open_context *, struct inode *, struct page *);
+unsigned int nfs_page_length(struct inode *, struct page *);
 
 /*
  * Determine the device name as a string
--- linux-2.6.18.i686/fs/nfs/read.c.org
+++ linux-2.6.18.i686/fs/nfs/read.c
@@ -85,7 +85,6 @@ void nfs_readdata_release(void *data)
         nfs_readdata_free(data);
 }
 
-static
 unsigned int nfs_page_length(struct inode *inode, struct page *page)
 {
 	loff_t i_size = i_size_read(inode);