Fedora Users — Re: How to find a needle in a haystack?

On Tue, 2010-05-18 at 16:49 -0400, aragonx@xxxxxxxxxx wrote:
> Hello all,
> 
> I need some ideas.
> 
> I have a backup server that contains 10 ext3 file systems each with 12
> million files scattered randomly over 4000 directories.  The files average
> size is 1MB.

So each filesystem is about 12*10^6 * 1MB = 12*10^12 or 12 terabytes?

>  Every day I expect to get 20 or so requests for files from
> this archive.  The files were not stored in any logical structure that I
> can use to narrow down the search.  This will be different moving forward
> but it does not help me for the old data.  Additionally, every day data is
> added and old data is removed to make space.
> 
> So, now that you know a little about the environment, I need ideas on how
> to find the file I want to restore fast.
> 
> Using find on the partition is slow.
> 
> I thought about using find and piping the output to a file.  I started it
> 50 minutes ago and it still isn't done on a single partition.  Plus the
> file is currently about 1.3GB and how would I maintain such a file?
> 
> Would putting the file names + path in a database be faster?

You don't say what the file contents are like, e.g. text, structured
data, unstructured binary, etc, nor do you say how you match the file
you want (e.g. is it equivalent to a text substring, a regular
expression, or what?). Knowing what the contents look like would help to
evaluate if it's worth e.g. generating a hash for subsections of the
file when it's being stored. Alternatively, it could conceivably make
sense to search for strings in the raw disk and work backwards to
calculate what files they belong to, who knows?

In short, more info is needed to give a sensible answer.

poc

-- 
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines