bruce wrote:
i know wget isn't as robust as nutch, but can someone tell me if wget keeps a track of the URLs that it's bben through so it doesn't repeat/get stuck in a never ending processs...
I don't know about the implementation details, but if I create two pages that link to each other, and tell wget to download them recursively, it does not loop. Maybe it does so if there are references that can't be detected by examining the "stack" leading back to the first page.
You may want to look at the section of the man page detailing the "-nc" option. I use the options "-r -nc" when downloading a complex set of pages.