On Sun, Mar 01, 2009 at 09:54:59AM -0800, bruce wrote: > > hi bruno. > > for my situation. i have a bunch of files being created by an upfront > process, and on the backend, i have a number of client/child processes that > get created, which have to operate/process the files. no file is processed > by more than a single client app. .... > i can easily setup a file read/write lock process where a client app > gets/locks a file, and then copies/moves the required files from the initial > dir to a tmp dir. after the move/copy, the lock is released, and the client > can go ahead and do whatever with the files in the tmp dir.. the process > allows multiple clients to operate in a pseudo parallel manner... > > i'm trying to figure out if there's a much better/faster approach that might > be available.. which is where the academic/research issue was raised.. > > the issue that i'm looking at is analogous to a FIFO, where i have lots of > files being shoved in a dir from different processes.. on the other end, i > want to allow mutiple client processes to access unique groups of these > files as fast as possible.. access being fetch/gather/process/delete the > files. each file is only handled by a single client process. You should benchmark some strategies. Standard flock(), lockf(), stat() and fcntl() and friends should do the trick..... You do not want to do a copy. Locking over NFS is 'interesting' so look for a 'lock free' strategy if you expect this to run on NFS file systems now or in the future. My simple minded solution is to have the creation process or a dispatcher process move files from the front end input directory into a modest set of directories for the back end processing to pick from. The number of back end processes and dirs can be tuned to match the number of processors and IO subsystem performance. Renaming a file on the same device is a "quick" meta data transaction that does not risk data loss. The renaming can also be done by the input process at the point that the file creation is finished. It may be necessary to do this step to ensure that a tail end process does not open a file before it is ready for processing. Depending on the complexity of the activity you may need to resort to some of the tricks used by sendmail or postfix. For example how do you know that a data file has been processed: not at all, completely or incompletely and if it matter should it get processed twice. Do consider the use of mmap() as it can help you limit the pollution of the page cache by a one time read activity. With mmap revisit the NFS topic. The number of files in any one directory can be important. Establish some design limits. Too many files in any one dir can bog down the file system. Also give consideration to the names of the files. Stuff like sorting file names can be important i.e. 2009.28.07 .vs. 2009.07.28 .vs. 09.7.28 .vs. 09.11.11. Also dots and dashes (".","-") are shell or regexp meta character so this might be a better list to start. i.e. 2009_28_07 .vs. 2009_07_28 .vs. 09_7_28 .vs. 09_11_11. Letters and hex may play well... but design for backup, easy system admin and general simplicity including documentation (documentation important characters might be > .vs. > for html documents, see also TeX, LaTeX, SGML, XML). Predictable file names have been central to some security risks so depending on the data value/ risk issues some attention may need to be given to the creation steps. -- T o m M i t c h e l l Found me a new hat, now what? -- fedora-list mailing list fedora-list@xxxxxxxxxx To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list Guidelines: http://fedoraproject.org/wiki/Communicate/MailingListGuidelines