[RFC-2 PATCH 0/8] shared subtree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Enclosed 8 patches that implement shared subtree functionality as
detailed in Al Viro's RFC found at http://lwn.net/Articles/119232/

I have incorporated all the comments received earlier in first round.  Thanks
to Miklos and Pekka for the valuable comments.  Also I have optimized lots of
code, especially in pnode.c . Code is unit tested. However the code in its
current form does not handle ENOMEM error gracefully. I am working on it. 

The incremental patches provide the following functionality:

1) shared_private_slave.patch : Provides the ability to mark a subtree as
shared or private or slave.
        
2) unclone.patch : provides the ability to mark a subtree as unclonable.  NOTE:
this feature is an addition to Al Viro's RFC, to solve the vfsmount explosion.
The problem is  detailed here:
http://www.ussg.iu.edu/hypermail/linux/kernel/0502.0/0468.html

3) rbind.patch : this patch adds the ability to propagate binds/rbinds across
vfsmounts.

4) move.patch : this patch provides the ability to move a
shared/private/slave/unclonable subtree to some other mount-point. It also
provides the same feature to pivot_root()

5) umount.patch: this patch provides the ability to propagate unmounts.

6) namespace.patch: this patch provides ability to clone a namespace, with
propagation set to vfsmounts in the new namespace.

7) automount.patch: this patch provides the automatic propagation for
mounts/unmounts done through automounter.

8) pnode_opt.patch: this patch optimizes the redundant code in pnode.c .

Looking forward for comments, 
RP
-----------------------------------------------------------------------

CHANGES DONE IN RESPONSE TO COMMENTS RECEIVED IN 1ST ROUND,


response to Pekka J Enberg Comments:


>>Inlining the patches to email would be greatly appreciated. Here are
>>some comments.

done

> +int
> +_do_make_mounted(struct nameidata *nd, struct vfsmount **mnt)

>>Use two underscores to follow naming conventions.

Yes done. In fact this function is renamed as make_mounted because
that seemed to make more sense. But in general throughout the patches
I have changed all newly introduced function that start with one
underscore to two underscores.


> Index: 2.6.12/fs/pnode.c
> ===================================================================
> --- /dev/null
> +++ 2.6.12/fs/pnode.c
> @@ -0,0 +1,362 @@
> +
> +#define PNODE_MEMBER_VFS  0x01
> +#define PNODE_SLAVE_VFS   0x02

>>Enums, please.

done


> +
> +static kmem_cache_t * pnode_cachep;
> +
> +/* spinlock for pnode related operations */
> + __cacheline_aligned_in_smp DEFINE_SPINLOCK(vfspnode_lock);
> +
> +
> +static void
> +pnode_init_fn(void *data, kmem_cache_t *cachep, unsigned long flags)
> +{
> +	struct vfspnode *pnode = (struct vfspnode *)data;

>>Redundant cast.

yes. removed.

> +	INIT_LIST_HEAD(&pnode->pnode_vfs);
> +	INIT_LIST_HEAD(&pnode->pnode_slavevfs);
> +	INIT_LIST_HEAD(&pnode->pnode_slavepnode);
> +	INIT_LIST_HEAD(&pnode->pnode_peer_slave);
> +	pnode->pnode_master = NULL;
> +	pnode->pnode_flags = 0;
> +	atomic_set(&pnode->pnode_count,0);
> +}
> +
> +void __init
> +pnode_init(unsigned long mempages)
> +{
> +	pnode_cachep = kmem_cache_create("pnode_cache",
> +                       sizeof(struct vfspnode), 0,
> +                       SLAB_HWCACHE_ALIGN|SLAB_PANIC, pnode_init_fn, NULL);
> +}
> +
> +
> +struct vfspnode *
> +pnode_alloc(void)
> +{
> +	struct vfspnode *pnode =  (struct vfspnode *)kmem_cache_alloc(
> +			pnode_cachep, GFP_KERNEL);

>>Redundant cast.

yes removed.

> +struct inoutdata {

>>Wants a better name.

This datastructure is gone after optimizing pnode.c


> +	void *my_data; /* produced and consumed by me */
> +	void *in_data; /* produced by master, consumed by slave */
> +	void *out_data; /* produced by slave, comsume by master */
> +};
> +
> +struct pcontext {
> +	struct vfspnode *start;
> +	int 	flag;
> +	int 	traversal;
> +	int	level;
> +	struct vfspnode *master_pnode;
> +	struct vfspnode *pnode;
> +	struct vfspnode *slave_pnode;
> +};
> +
> +
> +#define PNODE_UP 1
> +#define PNODE_DOWN 2
> +#define PNODE_MID 3

>>Enums, please.

These #defines are gone after the optimizations and cleanup.

> +
> +/*
> + * Walk the pnode tree for each pnode encountered.  A given pnode in the tree
> + * can be returned a minimum of 2 times.  First time the pnode is encountered,
> + * it is returned with the flag PNODE_DOWN. Every time the pnode is encountered
> + * after having traversed through each of its children, it is returned with the
> + * flag PNODE_MID.  And finally when the pnode is encountered after having
> + * walked all of its children, it is returned with the flag PNODE_UP.
> + *
> + * @context: provides context on the state of the last walk in the pnode
> + * 		tree.
> + */
> +static int inline
> +pnode_next(struct pcontext *context)

>>Rather large function to be an inline.

yes. done.

> +{
> +	int traversal = context->traversal;
> +	int ret=0;
> +	struct vfspnode *pnode = context->pnode,
> +			*slave_pnode=context->slave_pnode,
> +			*master_pnode=context->master_pnode;

>>Add a separate declaration for each variable. The above is hard to read.

yes done.

> +	struct list_head *next;
> +
> +	spin_lock(&vfspnode_lock);
> +	/*
> +	 * the first traversal will be on the root pnode
> +	 * with flag PNODE_DOWN
> +	 */
> +	if (!pnode) {
> +		context->pnode = get_pnode(context->start);
> +		context->master_pnode = NULL;
> +		context->traversal = PNODE_DOWN;
> +		context->slave_pnode = NULL;
> +		context->level = 0;
> +		ret = 1;
> +		goto out;
> +	}
> +
> +	/*
> +	 * if the last traversal was PNODE_UP, than the current
> +	 * traversal is PNODE_MID on the master pnode.
> +	 */
> +	if (traversal == PNODE_UP) {
> +		if (!master_pnode) {
> +			/* DONE. return */
> +			put_pnode(pnode);
> +			ret = 0;

>> Using goto out and dropping the else branch would make this more readable.

This code is also gone after optimization and cleanups.


> +		} else {
> +			context->traversal = PNODE_MID;
> +			context->level--;
> +			context->pnode = master_pnode;
> +			put_pnode(slave_pnode);
> +			context->slave_pnode = pnode;
> +			context->master_pnode = (master_pnode ?
> +				master_pnode->pnode_master : NULL);
> +			ret = 1;
> +		}
> +	} else {
> +		if(traversal == PNODE_MID) {

>> Missing space before parenthesis.

Yes ensured that throughout the patches. Hope I have not missed any.


> +			next = slave_pnode->pnode_peer_slave.next;
> +		} else {
> +			next = pnode->pnode_slavepnode.next;
> +		}

>> Please drop the extra braces.

Yes done.


> +		put_pnode(slave_pnode);
> +		context->slave_pnode = NULL;
> +		/*
> +		 * if the last traversal was PNODE_MID or PNODE_DOWN, and the
> +		 * master pnode has some slaves to traverse, the current
> +		 * traversal will be PNODE_DOWN on the slave pnode.
> +		 */
> +		if ((next != &pnode->pnode_slavepnode) &&
> +			(traversal == PNODE_DOWN || traversal == PNODE_MID)) {
> +			context->traversal = PNODE_DOWN;
> +			context->level++;
> +			context->pnode = get_pnode(list_entry(next,
> +				struct vfspnode, pnode_peer_slave));
> +			context->master_pnode = pnode;
> +			ret = 1;
> +		} else {
> +			/*
> +			 * since there are no more children, the current traversal
> +			 * is PNODE_UP on the same pnode
> +			 */
> +			context->traversal = PNODE_UP;
> +			ret = 1;

>> Would probably make more sense to check if
>> (next == &pnode->pnode_slavepnode && traversal == PNODE_UP) and use goto out to
>> get rid of the else branch.

Again after code-cleanup and optimization, this code is gone.

> +		}
> +	}
> +out:
> +	spin_unlock(&vfspnode_lock);
> +	return ret;
> +}
> +
> +

> +static void
> +_pnode_disassociate_mnt(struct vfsmount *mnt)

>> Two underscores, please.

Yes. Did it!

> +struct vfsmount *
> +pnode_make_mounted(struct vfspnode *pnode, struct vfsmount *mnt, struct dentry *dentry)
> +{
> +	struct vfsmount *child_mnt;
> +	int ret=0,traversal,level;

>> Spaces, please.

Yes done.


> +     	struct vfspnode *slave_pnode, *master_pnode, *child_pnode, *slave_child_pnode;
> +     	struct vfsmount *slave_mnt, *member_mnt, *t_m;

>> Formatting damage.

Yep. Done


> +	while(pnode_next(&context)) {

>> Missing space before parenthesis.

Done


> +		traversal = context.traversal;
> +		level = context.level;
> +		pnode = context.pnode;
> +		slave_pnode = context.slave_pnode;
> +		master_pnode = context.master_pnode;
> +
> +		if (traversal == PNODE_DOWN ) {

>> Use switch statement here.

yep! did it.

> +			if (master_pnode) {
> +				child_pnode = (struct vfspnode *)p_array[level-1].in_data;

>> Redundant cast.

done.

> +			} else {
> +				child_pnode = NULL;
> +			}

>> Extra braces.

done

> +			while (!(child_pnode = pnode_alloc()))
> +				schedule();

>> This looks dangerous. Why this must not fail and in other places you 
>> return -ENOMEM?

I have changed it to return -ENOMEM. But its not perfect yet. The
code is not graceful returning. It leaves around some data-structures.
I am working on this last aspect currently.


> +			p_array[level].my_data = (void *)child_pnode;

>> Redundant cast.

Yes. done.

> +			child_pnode = (struct vfspnode *)p_array[level].my_data;

>> Redundant cast.

Yes. done.

> +#define MS_PRIVATE	262144
> +#define MS_SLAVE	524288
> +#define MS_SHARED	1048576

>> The expression (1<<bit) would make more sense here.

Yes. done.


> Index: 2.6.12/include/linux/pnode.h
> ===================================================================
> --- /dev/null
> +++ 2.6.12/include/linux/pnode.h
> @@ -0,0 +1,78 @@
> +/*
> + *  linux/fs/pnode.c
> + *
> + * (C) Copyright IBM Corporation 2005.
> + *	Released under GPL v2.
> + *
> + */
> +#ifndef _LINUX_PNODE_H
> +#define _LINUX_PNODE_H
> +#ifdef __KERNEL__

>> No need for the above. Kernel headers are not supposed to be included by
>> userspace.

Removed it.

> +
> +#include <linux/list.h>
> +#include <linux/mount.h>
> +#include <linux/spinlock.h>
> +#include <asm/atomic.h>
> +
> +struct vfspnode {
> +	struct list_head pnode_vfs; 	 /* list of vfsmounts anchored here */
> +	struct list_head pnode_slavevfs; /* list of slave vfsmounts */
> +	struct list_head pnode_slavepnode;/* list of slave pnode */
> +	struct list_head pnode_peer_slave;/* going through master's slave pnode
> +					    list*/
> +	struct vfspnode	 *pnode_master;	  /* master pnode */
> +	int 		 pnode_flags;
> +	atomic_t 	 pnode_count;
> +};
> +#define PNODE_MAX_SLAVE_LEVEL 10
> +#define PNODE_DELETE  0x01
> +#define PNODE_SLAVE   0x02

>> Enums, please.

I have left it as is for now. These defines can also go. I will fix
them.


> +
> +#define IS_PNODE_DELETE(pn)  ((pn->pnode_flags&PNODE_DELETE)==PNODE_DELETE)
> +#define IS_PNODE_SLAVE(pn)  ((pn->pnode_flags&PNODE_SLAVE)==PNODE_SLAVE)
> +#define SET_PNODE_DELETE(pn)  pn->pnode_flags |= PNODE_DELETE
> +#define SET_PNODE_SLAVE(pn)  pn->pnode_flags |= PNODE_SLAVE

>> Static inline functions are preferred over #define.

done

> +
> +extern spinlock_t vfspnode_lock;
> +extern void __put_pnode(struct vfspnode *);
> +
> +static inline struct vfspnode *
> +get_pnode(struct vfspnode *pnode)
> +{
> +	if (!pnode)
> +		return NULL;

>> Can pnode really be NULL here? Looking at the callers in this patch, it can't.
>> Please remember that you should do NULL checks like this only when it makes
>> sense from API point of view to call the function with NULL.

Yes there are cases where it they can be called with null. If I don't
have these checks, than the check has to be done in the caller. So I chose
it here.

> +//TOBEDONE WRITE BETTER MACROS. ..
>>Please use static inline functions instead.

yes done.


Response to Miklos comments follows:

> -struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
> +struct vfsmount *lookup_mnt(struct vfsmount *mnt, struct dentry *dentry, struct dentry *root)
>  {

>> How about changing it to inline and calling it __lookup_mnt_root(),
>> and calling it from lookup_mnt() (which could keep the old signature)
>> and lookup_mnt_root().  That way the compiler can optimize away the
>> root check for the plain lookup_mnt() case, and there's no need to
>> modify callers of lookup_mnt().

Yes. DOne.

>  
> +struct vfsmount *do_make_mounted(struct vfsmount *mnt, struct dentry *dentry)
> +{

>> What does this function do?  Can we have a header comment?

Yes added a header.

> +int
> +_do_make_mounted(struct nameidata *nd, struct vfsmount **mnt)
> +{

>> Ditto.

Added a header.

> +/*
> + * recursively change the type of the mountpoint.
> + */
> +static int do_change_type(struct nameidata *nd, int flag)
> +{
> +	struct vfsmount *m, *mnt;
> +	struct vfspnode *old_pnode = NULL;
> +	int err;
> +
> +	if (!(flag & MS_SHARED) && !(flag & MS_PRIVATE)
> +			&& !(flag & MS_SLAVE))
> +		return -EINVAL;
> +
> +	if ((err = _do_make_mounted(nd, &mnt)))
> +		return err;


>> Why does this opertation do any mounting?  If it's a type change, it
>> should just change the type of something already mounted, no?

Ok. Now it returns -EINVAL.


> +		case MS_SHARED:
> +			SET_MNT_PRIVATE(m);
....
> +			break;
> +

>> Can this be split into three functions?

Yes split them into 3 functions.


> +static int inline
> +pnode_next(struct pcontext *context)
> +{

>> Is such a generic traversal function really needed?  Why?

As I said in my earlier mail, this function is equivalent of next_mnt
for traversing vfsmount tree.
I have optimized this function as well as pnode_traverse(). Hopefully you
may find the new code palatable.


> +struct vfsmount *
> +pnode_make_mounted(struct vfspnode *pnode, struct vfsmount *mnt, struct dentry *dentry)
> +{

>>Again a header comment would be nice, on what exactly this function
>>does.  Also the implementation is really cryptic, but I can't even
>>start to decipher without knowing what it's supposed to do.

Yes. added a header comment.


> +static inline struct vfspnode *
> +get_pnode_n(struct vfspnode *pnode, size_t n)
> +{

>> Seems to be unused throughout the patch series

True removed.


>  
> +
>  static inline struct vfsmount *mntget(struct vfsmount *mnt)

>> Please don't add empty lines.

Yes done.
-----------------------------------------------------------------------
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]
  Powered by Linux