Todd Zullinger writes:
Thank you for the detailed reply Sam. :) Excuse me for asking more questions (potentially dumb questions at that). Sam Varshavchik wrote:I don't know if you've ever upgraded Fedora from one release to the next. The upgrade process is as slow as molasses, even though all the metadata is right there.No, I avoid upgrades. I always do fresh installs as a matter of habit. Point taken though. I have read a lot of complaints of slow upgrades at the dependency resolution stage.A few years ago the base distro was much smaller than it it now. The size of a typical Linux distro has really balooned. Some of the algorithms in rpm scale horribly. It wasn't such a big deal when a typical linux distro was only a few hundred packages, but now it's a few thousand packages, with dependencies that are much more complicated, and rpm is now really blowing apart at the seams.I haven't looked at the code, but is it rpm or yum that's really bogging down? Or aren't you making much of a distinction when you say rpm?
I'd break it down as about 70% yum vs. 30% rpm. Yum is really taking its sweet time figuring out what it needs to do. But even after it's done that, and downloaded everything, rpm still tends to spin its wheels, if it has a large list of packages to chew through.
Furthermore, rpm, as is, does not implement remote repositories.Does it need to? Does dpkg do this?With a large repository, like Fedora, even a compressed XML file is going to end up being rather huge. Then, you have to uncompress it and parse it. And, XML parsing is also not exactly a light task.But somehow or another you need to deal with a sizable chunk of data to make reasonable decisions regarding dependencies. The tough part about rpm development is trying to be backward compatible and still make forward progress. I don't envy the guys hacking on rpm.
You do /not/ need that much info in the first step. All you need is a just a list of names of packages available on the remote repository. You reconcile that against the list of packages you already have downloaded the metadata for, and you then know what's new.
Meanwhile, primary.xml.gz is actually a voluminous XML file that contains not just each package's name and version, but also all sorts of extra info. And you have to download the whole thing every time. And, the current version of yum, sqlite-based, does not help. I see that primary.sqlite.bz2 is about twice as large as primary.xml.gz.
So, all this talk of a database-based yum, and it turns out that you end up having to download /twice/ as much data as you used to before? Someone explain to me what we're supposed to be doing here.
Let's look at repodata. Right now, for fedora updates, 7/i386/repodata, we have this:
total 16904 drwxr-xr-x 2 root root 4096 2007-07-30 12:23 . drwxr-xr-x 3 root root 159744 2007-07-30 12:23 .. -rw-r--r-- 1 root root 2676161 2007-07-30 12:23 filelists.sqlite.bz2 -rw-r--r-- 1 root root 2703076 2007-07-30 12:22 filelists.xml.gz -rw-r--r-- 1 root root 4603154 2007-07-30 12:23 other.sqlite.bz2 -rw-r--r-- 1 root root 5249048 2007-07-30 12:22 other.xml.gz -rw-r--r-- 1 root root 1122990 2007-07-30 12:23 primary.sqlite.bz2 -rw-r--r-- 1 root root 732021 2007-07-30 12:22 primary.xml.gz -rw-r--r-- 1 root root 1953 2007-07-30 12:23 repomd.xml
From what I see yum is doing, it download the primary, the other file, andpossibly filelists, /every/ time a single package gets added to the repository. Even though 99% of the content is the same as before.
This, in my opinion, does not really such an optimum design to me. You should /not/ have to download /everything/ every time a single package changes.
Remote package repositories could've been implemented much better. When I had some free time some time ago, I quickly hacked up a package manager for some of my internally-developed software. I found that I could do similar kind of package metadata synchronization much more efficiently than yum/rpm.Isn't the harder part doing this in a way that doesn't completely break backward compatibility though? And then you have to spend a bunch of years adding new code to deal with the odd sorts of deps that packagers come up with in the wild (versioned obsoletes on a multilib system sounds fun :). Someone posted to the fedora-devel list a month or so ago saying they'd created a super fast depsolver using php and mysql. Once all of the various cases they'd missed were explained, things didn't go much further. (And no, I'm not at all suggesting that applies to your work -- it's obvious that you know more than that and that you actually created a working system. :)
In my case, I had no intention of bending over backwards in order to stay compatible with rpm. The whole point was do this better, have a clean start, and a clean design, and then provide later a shim layer that imports rpm's dependencies. And my design has far more sophisticated dependency design than rpm. All the extra hackery that's done now with kernel packages, which support third party out-of-tree kernel modules using a yum plug in -- all of that is broomed away and the additional logic becomes incorporated in the overall design, rather than an aftermarket add-on hack. Ditto for the epoch hack -- my solution fixes the original underlying reason for having an epoch in the first place.
And, of course, php+mysql will always a lot of overhead. No matter what you do there, you will always be left in the dust by carefully-designed, compiled C++ code. No matter how you twist in turn, you'll always have to: compile php code, interpret php code, generate SQL, send the SQL over a communication channel to the mysql db engine, have your SQL parsed, query plan formed, then finally processed by the mysql engine, and finally returning the resulting data. The C++ equivalent: run already-compiled code. Done.
metadata file you want to download, you can use HTTP 1.1 partial chunk request feature to download just the bits of the metadata file that you want.Perhaps you should bribe someone to implement this in yum as a proof of concept?
Well, I can point them to how HTTP 1.1 chunking works, and how to gracefully autodetect if the HTTP server supports HTTP 1.1 chunking, and the logic to gracefully fall back to "Plan B", if the repository's HTTP server is running old Apache without HTTP 1.1 support, and what to do next. That's about all I can do. I won't write the code, I have plenty of other coding work that keeps me busy.
But then, after all is said and done, no amount of tweaks to rpm can compensate for stupid and broken packaging. Right now, due to indirect dependencies, grub requires *GTK* runtime libraries to be installed. On my headless machine, I now have to plop down a crapload of x.org and GTK RPMs, because grub requires them, due to its intermediate dependencies.Yeah. This was caused by policy more than by incompetence. The folks at Red Hat's legal department asked that all of the trademarked logos be kept in one package, for easier tracking and removal by downstream users of Fedora's packages (or something like that).
It's not that trademarked logos must be kept in one package. It's just that the package, for some reason that I still can't fathom, must depend on gtk2 code libraries. Why would a package that supposedly contain nothing more than a bunch of logo image files, have a needed dependency on a package that contains system libraries? That just does not compute.
I haven't really looked at it, but the probable story is that gtk2-engine or the gnome-themes package also includes some shell script that the logos package needs for some reason, so rather than separating it out into a subpackage, which would be the proper thing to do, you have to install the whole bloody thing, and because gtk2 requires all xorg core libraries, that ends up getting sucked down the drain as well.
Although this does not have any direct relevance to the overall issue of rpm's design, it is demonstrative, though, of the same kind of inefficient non-attention to detals.
Attachment:
pgpzFrtLJRU3E.pgp
Description: PGP signature