Rethinking how we make ready-to-use operating system images.

Unpacking RPM: package names

Have you ever noticed that Fedora keeps getting bigger and package names keep getting.. gnarlier? Let’s have a look!

Here’s some data I gathered about the number of RPMs in each Fedora release1 and the average, median, and longest package names:

Version  | Total RPMs | avg / med | max | Longest package name(s)
Fedora  1:  1477 RPMs,  10.1 /  9,   31: XFree86-ISO8859-14-100dpi-fonts, redhat-config-securitylevel-tui
Fedora  2:  1647 RPMs,  10.4 / 10,   32: xorg-x11-ISO8859-14-100dpi-fonts
Fedora  3:  1883 RPMs,  10.3 / 10,   31: selinux-policy-targeted-sources, system-config-securitylevel-tui
Fedora  4:  1981 RPMs,  11.4 / 10,   35: jakarta-commons-collections-javadoc
Fedora  5:  2422 RPMs,  11.8 / 11,   35: jakarta-commons-collections-javadoc
Fedora  6:  2931 RPMs,  12.3 / 12,   49: jakarta-commons-collections-testframework-javadoc
Fedora  7:  9334 RPMs,  12.3 / 12,   50: php-pear-Structures-DataGrid-DataSource-DataObject
Fedora  8: 10657 RPMs,  12.4 / 12,   50: php-pear-Structures-DataGrid-DataSource-DataObject
Fedora  9: 12444 RPMs,  12.5 / 12,   50: php-pear-Structures-DataGrid-DataSource-DataObject
Fedora 10: 14303 RPMs,  12.6 / 12,   50: php-pear-Structures-DataGrid-DataSource-DataObject
Fedora 11: 16577 RPMs,  12.8 / 12,   50: php-pear-Structures-DataGrid-DataSource-DataObject
Fedora 12: 19122 RPMs,  13.2 / 12,   50: php-pear-Structures-DataGrid-DataSource-DataObject
Fedora 13: 20840 RPMs,  13.4 / 13,   50: php-pear-Structures-DataGrid-DataSource-DataObject
Fedora 14: 22161 RPMs,  13.5 / 13,   50: php-pear-Structures-DataGrid-DataSource-DataObject
Fedora 15: 24085 RPMs,  13.6 / 13,   50: php-pear-Structures-DataGrid-DataSource-DataObject
Fedora 16: 25098 RPMs,  13.7 / 13,   50: php-pear-Structures-DataGrid-DataSource-DataObject
Fedora 17: 27033 RPMs,  13.8 / 13,   50: php-pear-Structures-DataGrid-DataSource-DataObject
Fedora 18: 33868 RPMs,  14.6 / 14,   57: gnome-shell-extension-sustmi-historymanager-prefix-search
Fedora 19: 36253 RPMs,  14.7 / 14,   57: gnome-shell-extension-sustmi-historymanager-prefix-search
Fedora 20: 38597 RPMs,  14.9 / 14,   57: gnome-shell-extension-sustmi-historymanager-prefix-search
Fedora 21: 42816 RPMs,  15.0 / 15,   57: gnome-shell-extension-sustmi-historymanager-prefix-search
Fedora 22: 44762 RPMs,  15.2 / 15,   58: perl-Archive-Extract-tbz-Archive-Tar-IO-Uncompress-Bunzip2
Fedora 23: 46074 RPMs,  15.3 / 15,   58: perl-Archive-Extract-tbz-Archive-Tar-IO-Uncompress-Bunzip2
Fedora 24: 49722 RPMs,  15.6 / 15,   61: golang-github-matttproud-golang_protobuf_extensions-unit-test
Fedora 25: 51669 RPMs,  15.8 / 15,   61: golang-github-matttproud-golang_protobuf_extensions-unit-test
Fedora 26: 53912 RPMs,  16.0 / 15,   64: golang-github-cloudfoundry-incubator-candiedyaml-unit-test-devel
Fedora 27: 54801 RPMs,  16.1 / 15,   64: golang-github-cloudfoundry-incubator-candiedyaml-unit-test-devel

Sure enough:

  1. Fedora gets bigger with every release (especially F7, when we merged Core and Extras), and
  2. The average package name gets longer with every release.2

OK, great. But why do package names keep getting longer? Well, the proximate cause seems obvious: after FC4, the longest names are all things that have been repackaged from some other software packaging ecosystem: Jakarta/Apache Commons for Java, PEAR for PHP, Gnome Shell Extensions for gnome-shell, Perl’s CPAN, and golang’s builtin module system. So we’re mapping the other ecosystem’s module namespace into RPM’s (single, flat) package namespace, and then adding a prefix to indicate which ecosystem it came from.

Another thing that makes package names longer is subpackages, like -devel, -javadoc, and -unit-test above. In these cases we’re adding a suffix to the package name to indicate that this part of the package is only needed for a particular purpose. For example: -devel means “install this if you’re doing development”, -javadoc means “install this if you want to read Java documentation”, -unit-test means “install this if you want to run unit tests”, etc.

So names keep getting longer because we keep adding suffixes and prefixes and stuffing entire other software module namespaces into our package names. But why do we have to cram everything into the names? Well, because RPM only looks at package names3. It doesn’t consider any other metadata when choosing between packages, even though there’s plenty of other relevant data you might want to consider:

  • Source software ecosystem: “this is a rubygem”
  • Package build environment: “this was built using python3, not python2”
  • Major-version API changes: “this can be installed in parallel with the new version”
  • Package build options: “this was built with debugging turned on”
  • File-level metadata: “these files are only needed for development”

And so on. RPM doesn’t give us a way to add extended/new metadata, so we’ve resorted to encoding all that metadata in the package name, with prefixes and suffixes: rubygem-, python3-, compat-, -debug, -devel.

So! Let’s take a swing at that first question from the intro:

What’s this actually trying to do?

To try to figure out what things people are actually trying to accomplish with ever-longer RPM names, let’s look at the most commonly-used “words”.

If you’re on a Fedora system and you’d like to play along, here’s a one-liner that’ll generate a list of words, sorted by number of occurrences:

dnf repoquery --qf '%{NAME}' --repo="fedora" --repo="updates" \
    | tr '-' '\n' | sort | uniq -c | sort -n | tac | less

I also wrote a little script to do slightly fancier analysis of the words in RPM names - keeping word pairs like “apache-commons” as a single word, counting each word’s use as a prefix, as a suffix, and per-specfile4, and labeling the “meaning” of common words.

Here’s the results from that:

Top 100 most common words/prefixes/suffixes in RPMs and specfiles (F27, x86_64)
wordspecfilesrpmsas prefixas suffixmeaning
perl29533140307361package prefix
nodejs1129139913980package prefix
python106912101043125programming language
rubygem635126112610package prefix
php62086182510programming language
golang4659519190package prefix
ghc433105510441package prefix
java15541181104programming language
R1471741703programming language
go13725056programming language
rust1321391370programming language
hunspell1261351282project name
horde11211402project name
erlang1111631611programming language
ocaml1052162094programming language
drupal799100990project name
jboss791561473project name
eclipse762132101project name
XML717206data format
HTML687006data format
ruby65995525programming language
php-pear6262610package prefix
js60734515programming language
globus601501482project name
lua58896016programming language
json56751524data format
emacs51796412project name
gap-pkg5153530package prefix
trytond4953520project name
php-pecl4857570package prefix
aspell4647442project name
xorg-x114175750project name
apache-commons3992920project name
octave3739343programming language
xml37621622data format
tcl37553220programming language
el3542034data format
mono3368402programming language
jenkins3370643project name
vim331069310project name
gnome-shell3346441project name
c2967322programming language
glassfish28101980project name
gimp2545404project name
plexus2553512project name
sharp2446023programming language
felix2140400project name
coin-or2161610project name
jackson2151471project name
glite2139390project name
sblim2041410project name
oslo199700project name
libvirt18855912project name
geronimo1735350project name
springframework1550490project name
nagios1582743project name
NetworkManager1337350project name
jetty1376750project name
qpid1076593project name
yum1041381project name
gcc994735project name
root71101027project name
pulp647340project name
aws-sdk680712project name
libreoffice51531520project name
geany544410project name
qemu459553project name
shrinkwrap450481project name
google-noto31461460project name
fence366660project name
collectd374660project name
glibc22012000project name
asterisk241400project name
pcp2100954project name
tesseract21141072project name
arquillian249380project name
uwsgi198961project name
gcompris142410project name
fawkes173720project name
fusionforge138370project name
soletta136350project name
asterisk-sounds-core190900project name
gambas3192920programming language
gallery2176750project name
opensips159571project name
openrdf-sesame174740project name
texlive1594659280project name
vdsm144430project name
autocorr133330project name

(full data set: rpm-name-word-counts.csv)

Looking over this list, I’d say there’s 4 main features that we’re trying to hack into RPM with all our name-mangling: namespaces, variant builds, new package relationships & metadata, and extended file-level metadata.

1. Language/project/vendor namespace prefixes

As a wise man once observed, “Namespaces are one honking great idea – let’s do more of those!”5 Sure enough, the 10 most common prefixes are our ad-hoc namespace markers for modules repackaged from the native packaging systems of some popular programming languages: perl, python, nodejs, rubygem, ghc, golang, and php. Oh, and texlive, which is kind of a packaging system but also a 220,000-line rpmbuild stress test6.

Anyway, RPM doesn’t actually have a way to create separate namespaces for things like that, so in reality every package gets crammed into one big heap and the user gets to figure out the rest - which is why “github” now shows up as the 15th most common word overall (thanks, golang!)

This also means that regardless of whether or not it uses texlive or Node.js or Ruby or Haskell or whatever, every Fedora system in the world still downloads complete metadata for all 10,000+ of those packages every time it runs DNF. Yikes.

One other note: python2 and python3 are actually pulling double duty. They’re kinda language prefixes, but they’re also variant-build markers!

2. Parallel-installable variant builds

There’s a lot of times that we want multiple variant builds of a project to be available and/or parallel-installable, but we can’t do that without modifying the RPM name.

Most commonly, we want to build the same source tarball more than once, using a different toolchain or build options. Since RPM doesn’t care about the build environment or build options when comparing packages, we have to change the name to make RPM consider them different builds - and that’s where we get -debug, -static, python2-/python3-, -qt4/-qt5, mingw64-/mingw32-, and so on.

Other times we’re building two different versions of the same sources - usually the newest one and an older one that’s required by some other package. RPM technically allows you to install multiple versions of the same package (as long as the package contents don’t overlap) but the default behavior (as enforced by yum and dnf) is to replace older versions of packages with newer ones. So rather than dealing with that, we add -compat or compat- to the older version to change its name, thus making RPM consider it a different package.

Interestingly, it seems like we’re not consistent in whether variant markers are prefixes (like python2- and mingw64-) or suffixes (like -debug or -static). Which isn’t surprising - we’re using these words in ways that are human-meaningful, not machine-parseable, so naturally we use them in ways that mirror human language.

In fact, as we see with python and friends: when a programming language name is used as a suffix, it typically has a different meaning: python-foo is probably “foo” (written in Python), but foo-python is probably Python bindings to “foo”. We’re using the package name to provide (informal) information about the relationship between two packages.

It turns out we do a lot of this!

3. New package relationships & metadata

Sure, we have soft dependencies now, but we still use a lot of unwritten conventions that imply certain relationships between projects. One interesting example is the different ways we use plugin/plugins - there’s different “phrasings” that can have slightly different meanings:

  • PROJ-plugin-NAME: A plugin for PROJ named NAME - yum-plugin-versionlock, gedit-plugin-commander, uwsgi-plugin-zergpool
  • PROJ-plugin-THING: A plugin for PROJ to handle/support THING - gedit-plugin-git, uwsgi-plugin-v8, abrt-plugin-bodhi
  • PROJ-THING-plugin: Java software seems to prefer this order - maven-stapler-plugin, jenkins-ldap-plugin
  • PROJ-plugins-GROUP: A named GROUP of plugins for PROJ: dnf-plugins-core, gstreamer-plugins-good, gedit-plugins-data,

Sometimes you just get an opaque NAME that suggests something about the purpose of the plugin, sometimes the THING is a concept or protocol, like ldap, and sometimes it’s a specific piece of software, like git or stapler. These would all be useful pieces of data for packaging software to have! But instead we can only pick one of those pieces of data, and then we encode it into the RPM name in a way that’s not machine-readable. Wouldn’t it be nice if these relationships were formalized, and also maybe we could store that metadata somewhere other than the package name?

You could argue that the informal metadata provided by plugin/plugins could be formalized using the Enhances: or Supplements: RPM tags, but there’s plenty of other examples where we’re using naming conventions to establish similar informal relationships between packages and higher-level concepts, like projects or languages - or basic concepts like web and gui.

I think this is one of the fundamental shortcomings of RPM’s dependency system: the only thing it lets you express easily is whether a given package requires another package7 when installed8. It has no inherent concept of anything other than a package, or of different sub-parts of a package, or of any purpose for those parts other than installation.

And that’s what we use subpackages for!

4. File-level metadata / tags and purposes

So! If we want to talk about something other than the entire build output, we have to divide it into subpackages. If we need to talk to RPM about one specific file, we have to put it into its own subpackage9.

When we break a build into subpackages, we’re usually doing it because part of the build is “optional” - that is, it’s not required for the default assumed “purpose”, which is basically “runtime”.

So to differentiate these “optional” parts from the “main” part, we once again turn to.. RPM name mangling!

Sometimes - most commonly - we mark the parts by what type of file they are: -devel, -doc, -javadoc, -help. Or we mark the purpose of those files: -tools, -utils, -unit-test.

Now, most systems probably don’t need unit tests installed. But what about documentation and help files? Or development headers? Alas, since RPM has no concept of file types or purposes, we have no way to tell it what kind of parts we might want - and so we have to manually install -doc and -help and -devel packages, or any other “optional” pieces.

And since this is all informal, it’s pretty inconsistent. If something has optional CLI tools, how do you find them? Is it under -cli, or -console, or -tools, or -utils, or -extras? Is an optional GUI tool written in GTK3 found under -gtk or -gtk3 or -gui?

We’ve also got various competing traditions for splitting up packages like git or vim or libreoffice that have a bunch of optional parts with some shared common code, but a typical “default” set of things that most people want:

  • LibreOffice: The libreoffice package itself is empty, but it Requires the standard suite set of apps: -calc, -draw, -impress, -writer, and -base, which isn’t the “base” set of apps or libraries - it’s a database frontend (ha ha!). They all depend on libreoffice-core, which has all the core libraries and tools and such.
  • Git: The git package only contains a few commonly-used utilities - git submodule, git am, git instaweb, and a couple others. git-core is the “minimal” core (which has everything else); other tools are in other git- packages.
  • vim: There is no default vim package, but you can pick vim-minimal or vim-enhanced, which both require vim-common and vim-filesystem.

It’s all kind of a mess, and you kind of just have to guess which pieces might be useful or relevant to you - would you be able to guess from the package names alone that gitk and gitg are git GUIs, and while gitweb is a web frontend, there’s already a web frontend (git instaweb) in the git package itself?

So what have we learned?

I think looking at all the weird stuff we’re doing with RPM names shows us a few things that our ideal software packaging ecosystem should handle:

  1. External packaging/module systems should probably have their own namespaces
  2. The build environment, build options, and build target are relevant pieces of metadata about a build, and should be part of its identity
  3. Packages should be able to declare different kinds of relationships between each other - formal and informal
  4. There are a lot of relationships other than “required to install/run” that people might want to know about
  5. It’s enormously helpful to be able to provide metadata at the level of individual files
  6. It’s also helpful if you can apply multiple tags to the same thing
  7. Using common type/purpose tags is a great idea, but users should definitely be able to define new tags when needed

You can probably see a theme here: more flexible metadata, and more of it! But how do we do that without overloading or breaking the stuff we already have? Well, to answer that question, I think we need to take a closer look at the metadata RPM and DNF already use, and how exactly it’s stored and used.

So, coming soon: join me as I dig deep into the horrors of the RPM header format itself! And bring a stiff drink, ‘cuz we’re both gonna need it.

  1. The table data is only for the x86_64 ‘Everything’ repo. The counts are slightly different if you include updates, but you get the point. 

  2. Except FC3, mostly because we renamed all the xorg-x11-XXX-fonts packages to fonts-xorg-XXX

  3. Technically it only cares about package Provides - see David’s excellent RPM Dependencies post if you want to know more. 

  4. The “per-specfile” count is: “how many source packages generate a subpackage with this word in it?” Two reasons for this: first, it cuts down on noise from packages like texlive or pmda that generate hundreds (or thousands!) of subpackages. More importantly, it’s a better proxy for the real question, which is: what are the users doing? What words do packagers use most commonly when describing the stuff that gets built? 

  5. python -c 'import this', also known as PEP20 

  6. Here’s a link to texlive’s RPM sources if you’re curious. Fun fact: each changelog entry is repeated across all subpackages, which means that about 160MB of the 2.5GB(!) of RPMs produced by each new texlive build is just 5,931 copies of the new changelog. That’s.. good, right? 

  7. Again, while technically RPM lets you do Requires: <FILENAME>, that just resolves to “whichever package(s) provide <FILENAME>”. You’ll still get the whole package installed, even if you literally only require that one file. 

  8. Okay, it also has BuildRequires:, because rpmbuild has to care about building packages so that rpm can install them. But that’s it! 

  9. There are currently 1,348 packages in F27 that contain exactly one file. Fun fact: the package payload is smaller than the RPM headers for about 2/3 of those (838/1348). 

Written by Will Woods on April 2, 2018