Re: OT: autosave of google alert sites?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Oct 13, 2004 at 11:24:07PM +0100, James Wilkinson wrote:
> Dave Stevens wrote:
> > I get google news alerts 
> > In an ideal world, I would be able to have a daily program (script?) run 
> > that would examine that day's alerts, resolve the URLs and save the pages. 
> 
> Alan Peery suggested:
> > 1. Use wget to retrieve the google news alert page to a file
> > 2. parse the file with PERL, gaining URLs
> > 3. wget those URLs, putting them into subdirectories based on the day 
> > your script is running
> 
> I don't think stage 2 is necessary: man wget suggests:
>        -i file
> 
> Take a good look at the options in the wget man page, especially the
> examples under --page-requisites. You may need --span-hosts.
> 
> (I must admit that I've never really tried using these options, so
> you'll need to experiment.)

Of interest many of the interesting large sites (google, yahoo, Ebay,
etc) have tricks to foil the automated slurping of data.

If you are not greedy most wget tricks work but if you trigger their
'abuse' meter things go sideways quickly.  In addition to the
bandwidth abuse there are issues with copyright. Perhaps not a problem
for your personal use but do not 're-publish' inadvertantly.

-- 
	T o m  M i t c h e l l 
	May your cup runneth over with goodness and mercy
	and may your buffers never overflow.


[Index of Archives]     [Current Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [Yosemite Photos]     [KDE Users]     [Fedora Tools]     [Fedora Docs]

  Powered by Linux