Re: [PATCH] prune_icache_sb

Andrew Morton wrote:

On Thu, 30 Nov 2006 11:05:32 -0500
Wendy Cheng <wcheng@redhat.com> wrote:
The idea is, instead of unconditionally dropping every buffer associatedwith the particular mount point (that defeats the purpose of pagecaching), base kernel exports the "drop_pagecache_sb()" call that allowspage cache to be trimmed. More importantly, it is changed to offer thechoice of not randomly purging any buffer but the ones that seem to beunused (i_state is NULL and i_count is zero). This will encouragefilesystem(s) to pro actively response to vm memory shortage if theychoose so.
argh.

I read this as "It is ok to give system admin(s) commands (that this"drop_pagecache_sb() call" is all about) to drop page cache. It is,however, not ok to give filesystem developer(s) this very same functionto trim their own page cache if the filesystems choose to do so" ?

In Linux a filesystem is a dumb layer which sits between the VFS and the
I/O layer and provides dumb services such as reading/writing inodes,
reading/writing directory entries, mapping pagecache offsets to disk
blocks, etc.  (This model is to varying degrees incorrect for every
post-ext2 filesystem, but that's the way it is).

Linux kernel, particularly the VFS layer, is starting to show signs ofinadequacy as the software components built upon it keep growing. I havedoubts that it can keep up and handle this complexity with a developmentpolicy like you just described (filesystem is a dumb layer ?). Aren'tthese DIO_xxx_LOCKING flags inside __blockdev_direct_IO() a perfectexample why trying to do too many things inside vfs layer for so manyfilesystems is a bad idea ? By the way, since we're on this subject,could we discuss a little bit about vfs rename call (or I can startanother new discussion thread) ?

Note that linux do_rename() starts with the usual lookup logic, followedby "lock_rename", then a final round of dentry lookup, and finally comesto filesystem's i_op->rename call. Since lock_rename() only calls forvfs layer locks that are local to this particular machine, for a clusterfilesystem, there exists a huge window between the final lookup andfilesystem's i_op->rename calls such that the file could get deletedfrom another node before fs can do anything about it. Is it possiblethat we could get a new function pointer (lock_rename) ininode_operations structure so a cluster filesystem can do proper locking ?

From our end (cluster locks are expensive - that's why we cache them),one of our kernel daemons will invoke this newly exported call based ona set of pre-defined tunables. It is then followed by a lock reclaimlogic to trim the locks by checking the page cache associated with theinode (that this cluster lock is created for). If nothing is attached tothe inode (based on i_mapping->nrpages count), we know it is a goodcandidate for trimming and will subsequently drop this lock (instead ofwaiting until the end of vfs inode life cycle).
Again, I don't understand why you're tying the lifetime of these locks to
the VFS inode reclaim mechanisms.  Seems odd.

Cluster locks are expensive because:

1. Every node in the cluster has to agree about it upon granting therequest (communication overhead).2. It involves disk flushing if bouncing between nodes. Say one noderequests a read lock after another node's write... before the read lockcan be granted, the write node needs to flush the data to the disk (diskio overhead).

For optimization purpose, we want to refrain the disk flush after writesand hope (and encourage) the next person who requests the lock to be onthe very same node (to take the advantage of OS write-back logic).That's why the locks are cached on the very same node. It will not getremoved unless necessary.What would be better to build the lock caching on top of the existinginode cache logic - since these are the objects that the cluster locksare created for in the first place.

If you want to put an upper bound on the number of in-core locks, why not
string them on a list and throw away the old ones when the upper bound is
reached?

Don't take me wrong. DLM *has* a tunable to set the max lock counts. Wedo drop the locks but to drop the right locks, we need a little bit helpfrom VFS layer. Latency requirement is difficult to manage.

Did you look at improving that lock-lookup algorithm, btw?  Core kernel has
no problem maintaining millions of cached VFS objects - is there any reason
why your lock lookup cannot be similarly efficient?

Don't be so confident. I did see some complaints from ext3 based mailservers in the past - when the storage size was large enough, people hadto explicitly umount the filesystem from time to time to rescue theirperformance. I don't recall the details at this moment though.

For us with this particular customer, it is a 15TB storage.

-- Wendy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [PATCH] prune_icache_sb
  - From: Russell Cattelan <cattelan@thebarn.com>
- Re: [PATCH] prune_icache_sb
  - From: Andrew Morton <akpm@osdl.org>

References:
- [PATCH] prune_icache_sb
  - From: Wendy Cheng <wcheng@redhat.com>
- Re: [PATCH] prune_icache_sb
  - From: Andrew Morton <akpm@osdl.org>
- Re: [PATCH] prune_icache_sb
  - From: Wendy Cheng <wcheng@redhat.com>
- Re: [PATCH] prune_icache_sb
  - From: Andrew Morton <akpm@osdl.org>
- Re: [PATCH] prune_icache_sb
  - From: Wendy Cheng <wcheng@redhat.com>
- Re: [PATCH] prune_icache_sb
  - From: Andrew Morton <akpm@osdl.org>
- Re: [PATCH] prune_icache_sb
  - From: Wendy Cheng <wcheng@redhat.com>
- Re: [PATCH] prune_icache_sb
  - From: Wendy Cheng <wcheng@redhat.com>
- Re: [PATCH] prune_icache_sb
  - From: Andrew Morton <akpm@osdl.org>

Prev by Date: Re: PATCH? rcu_do_batch: fix a pure theoretical memory ordering race
Next by Date: Re: [PATCH] ipc: Convert kmalloc()+memset() to kzalloc() in ipc/.
Previous by thread: Re: [PATCH] prune_icache_sb
Next by thread: Re: [PATCH] prune_icache_sb
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]