Skip to content
Tomas Mlcoch edited this page Jun 4, 2013 · 52 revisions

Set of tools that generate/merges differences between an old and a new version of a repodata.

Idea

  • Repodata could be pretty big (dozens or hundreds of megabytes).
  • Changes between two versions of repodata could be very small (a single deleted package).
  • Let's make a tool that could detect changes between two repodatas, generate its delta (diff) + tool that could apply this delta on the old repodata.

Design ideas

  • Be a maximal compatible.
  • Avoid significant changes in the repomd.xml or other repodata files.
  • Repodelta delta itself is a repository.
  • Plugins - Generic DeltaRepo can create and apply delta files for primary, filelists and other. Delta from other repodata files (groupfile, prestodelta, ..) could be done via plugins.

DeltaRepo

File structure of a single delta repo looks like:

repodata/
  |-primary.xml.gz
  |-filelists.xml.gz
  |-other.xml.gz
  |-removed.xml.gz
  |-repomd.xml

Where primary, filelists and other are in classical format but their content is composed only from changed or added packages.

removed.xml will be described bellow.

removed.xml.gz

This file contains:

  • Packages that were changed or removed.
  • Repodata files (files listed in repomd.xml) that were removed.

Example:

<?xml version="1.0" encoding="UTF-8"?>
<removed>
  <packages>
    <location href="packages/b/bar.rpm" base="http://barserver.com/repo/os" />
    <location href="packages/f/foo.rpm" />
  </packages>
  <files>
    <location href="repodata/pkgorigins.gz">
  </files>
</removed>

TODO: Relax NG schema

repomd.xml

Classical repomd.xml, but with DeltaRepoId (Where to store RepoId/DeltaRepoId in repomd is examined bellow).

DeltaRepoId

DeltaRepoId is an identifier of a DeltaRepo.

repoid_of_old_repo-repoid_of_new_repo

Eg: 5a8e6bbb940b151103b3970a26e32b8965da9e90a798b1b80ee4325308149d8d-b8d60e74c38b94f255c08c3fe5e10c166dcb52f2c4bfec6cae097a68fdd75e74

RepoId

RepoId is an identifier of a every non-delta repository.

It is a SHA256 hash calculated from all packages in repository.

Calculation algorithm:

pkgids = []
for pkg in repo:
  pkgids.append("%s%s%s", pkg.pkgId, pkg.location_href, pkg.location_base)
pkgids.sort()
repoid = sha256()
for pkgid in pkgids:
  repoid.update(pkgid)
return repoid.hexdigest()

Extra repodata files

Extra repodata files are files except of:

  • primary.xml
  • filelists.xml
  • other.xml
  • primary.sqlite
  • filelists.sqlite
  • other.sqlite
  • repomd.xml

Example of extra repodata files:

  • comps.xml
  • deltainfo
  • pkgorigins
  • prestodelta.xml
  • ...

For this files no delta is generated!

If extra file were removed, it will be listed in removed.xml.

If extra file were added, it will be included in DeltaRepo.

If extra file were changed, it will be listed in removed.xml and new version included in DeltaRepo.

TODO: In future, maybe use diff tool and only include diff, insted of full file.

Integration to the current repodata

Tree structure:

mirror/
  +-deltarepos
  |   +-ei7as764ly-043fds4red
  |   |   |-primary.xml
  |   |   |-filelists.xml
  |   |   |-other.xml
  |   |   |-removed.xml
  |   |   |-repomd.xml
  |   |
  |   +-0w78as1r9r-043fds4red
  |   |   |- ...
  |   |
  |   |-deltarepo.xml
  |
  +-Packages
  |   |- ...
  |
  +-repodata
      |-primary.sqlite.bz2
      |-primary.xml.gz
      |-filelists.sqlite.bz2
      |-filelists.xml.gz
      |-other.sqlite.bz2
      |-other.xml.gz
      |-repomd.xml

Notes:

  • deltarepos/ dir is outside of repodata/ intentionally. (Main reason is to prevent downloading of unwanted deltas. Eg. when someone is using wget do recursively download current repodata.)

deltarepo.xml

<?xml version="1.0" encoding="UTF-8"?>
<deltarepos>
  <deltarepo id-type="sha256" from="ei7as764ly" to="043fds4red">
    <location href="deltarepos/ei7as764ly-043fds4red" />
    <size>15432</size>
  </deltarepo>
  <deltarepo id-type="sha256" from="0w78as1r9r" to="043fds4red">
    <location href="deltarepos/0w78as1r9r-043fds4red" />
    <size>7869</size>
  </deltarepo>
</deltarepos>

Needed changes in the current state of repodata

repomd.xml

We need to store current RepoId/DeltaRepoId.

Possibilities:

Add new element "repoid"
<?xml version="1.0" encoding="UTF-8"?>
<repomd xmlns="http://linux.duke.edu/metadata/repo" xmlns:rpm="http://linux.duke.edu/metadata/rpm">
  <revision>1355393568</revision>
  <repoid type="sha256">5a8e6bbb940b151103b3970a26e32b8965da9e90a798b1b80ee4325308149d8d</repoid>
  <data type="primary">
    ....
Add new tag "repoid"
<?xml version="1.0" encoding="UTF-8"?>
<repomd xmlns="http://linux.duke.edu/metadata/repo" xmlns:rpm="http://linux.duke.edu/metadata/rpm">
  <revision>1355393568</revision>
  <tags>
    <content>binary-i386</content>
    <distro cpeid="cpe:/o:fedoraproject:fedora:17">r</distro>
    <repo>foorepotag</repo>
    <repoid type="sha256">5a8e6bbb940b151103b3970a26e32b8965da9e90a798b1b80ee4325308149d8d</repoid>
  </tags>
  <data type="primary">
    ....
Use current "repo" tag
<?xml version="1.0" encoding="UTF-8"?>
<repomd xmlns="http://linux.duke.edu/metadata/repo" xmlns:rpm="http://linux.duke.edu/metadata/rpm">
  <revision>1355393568</revision>
  <tags>
    <content>binary-i386</content>
    <distro cpeid="cpe:/o:fedoraproject:fedora:17">r</distro>
    <repo>repoid-sha256:5a8e6bbb940b151103b3970a26e32b8965da9e90a798b1b80ee4325308149d8d</repo>
  </tags>
  <data type="primary">
    ....

Use cases

./deltarepo repo1 repo2

  • Create dir which has a DeltaRepoId as name and contain repository delta information.

./deltarepo --apply delta repo

  • Applies delta to the repository

Options:

  • -h, --help - Show help.
  • --version - Show version
  • -q, --quiet - Quiet mode
  • -v, --verbose - Verbose mode
  • -l, --list-datatypes - List datatypes for which delta is supported.
  • -o, --outputdir - Output dir
  • Options for delta generation (default):
  • -s, --skip=DATATYPE - Don't do delta of the datatype. Could be specified multiple times. (E.g., --skip=comps)
  • -d --do-only=DATATYPE - Do delta only of the listed datatypes. Could be specified multiple times. (E.g., --do-only=primary)
  • -t, --id-type=HASHTYPE - Specify hash function for the ids (RepoId, DeltaRepoId). Default hash function is sha256 (E.g., --id-type=sha512)
  • Options for delta applying:
  • -a, --apply - Enable delta applying mode