850896 – [rfe] [metadata] smarter metadata format for quicker sync (delta/per-package/...)

Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 850896 - [rfe] [metadata] smarter metadata format for quicker sync (delta/per-package/...)

Summary: [rfe] [metadata] smarter metadata format for quicker sync (delta/per-package/...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	librepo
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Tomas Mlcoch
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (6):	529358 1038824 1086288 1163988 1276093 1295669 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-08-22 16:47 UTC by David Jaša
Modified:	2019-03-03 11:26 UTC (History)
CC List:	30 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-03-03 11:26:19 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	968006	0	unspecified	CLOSED	[rfe] download filelists only on demand	2022-05-16 11:32:56 UTC
Red Hat Bugzilla	1232342	0	unspecified	CLOSED	hawkey only works with XML repo metadata and not with sqlite databases	2022-05-16 11:32:56 UTC

Internal Links: 968006 1195036 1232342

Description David Jaša 2012-08-22 16:47:01 UTC

Change yum/dnf repository format to per-package one.

Current yum repository format that stores some kind of information about all the packages in a single repo in a single file has two scalability issues:
1) download of whole metadata file is required even if there is only one-liner repo update - that said, one package change in each of fedora* repository means that user downloads over 10 MB of "new" metadata
2) given 1), big repos don't include more than one version of package, preventing rollback possibility and all the associated goodies (have you ever thought about constantly-installable rawhide?).

Yum overhaul within DNF project is a great chance to get yum repo format in line with current needs. 


Package metadata reordering proposal
====================================

Actually, the current repo format isn't bad for one of common use cases: system installation or large updates, where download of all the metadata in just a few HTTP transfers is efficient. The current repo format also seems already to be quite ready for split to per-package files given its structure:
<metadata>
<package><name>pkg1</name><checksum>sum1</checksum> ... </package>
<package><name>pkg2</name><checksum>sum2</checksum> ... </package>
</metadata>

<otherdata>
<package pkgid="sum1"> ... </package>
<package pkgid="sum2"> ... </package>
</otherdata>

<filelists>
<package pkgid="sum1"> ... </package>
<package pkgid="sum2"> ... </package>
</filelists>

This can be easily split to 1-3 per-package file:

<package>... <pkgotherdata/> <pkgfilelist/> </package>
OR three files more mimicking current structure:
<metadata><package><name/><checksum/> ... </package></metadata
<otherdata><package pkgid="id"> ... </package></otherdata>
<filelists><package pkgid="id"> ... </package></filelists>

Yum/dnf can then easily create current tree structure of metadata from such files.


Package index
=============

For letting yum/dnf know what packages are in the repository, some package index has to be available. It seems to me that the best way to achieve this is (xz-compressed) plaintext file with a format like this (for one-metadata-file-per-package):
default_arch=arch1
pkg1 v1-r1 v2-r1 v2-r1.arch2
pkg2 v10-r123 v10-r124

that would translate into paths like:
repodata/<path>/pkg1-v1-r1.arch1.xml.xz
repodata/<path>/pkg1-v2-r1.arch1.xml.xz
repodata/<path>/pkg1-v2-r1.arch2.xml.xz
repodata/<path>/pkg2-v10-r123.arch1.xml.xz
repodata/<path>/pkg2-v10-r123.arch1.xml.xz

File like this could be around 1 MB uncompressed or 150-300 kB xz-compressed (assuming 12500 packages and average 80 chars per line). Other data in a typical updates are way smaller (few kB to tens of kB) so even this change alone would make "yum clean expire-cache ; yum makecache" download up to two orders of magnitude smaller amount of metadata.


What if repository size outgrows per-package metadata?
======================================================

if package count in Fedora ever reaches the current number of Debian (almost 80000 based on available info), even the structure above means MBs of data per incremental metadata update. That could be solved by having separate info about repository history, for example update index (containing timestamp and uuid of the updates) and per-update info files (containing timestamp and uuid of itself, timestamp and uuid of repository state that this update supersedes and package addition/removal changelog).

Move to an update-based model could actually make yum cache updates practically transparent (with no need to run 'yum clean something' to make sure that cache is really in sync with the repos).

Comment 1 Ales Kozumplik 2012-08-23 09:20:55 UTC

David,

this is a very interesting proposal and there's at least one more voice I've heard about a similar idea from. I have to tell you straight off there's a very low chance this happens in the nearest three or four fedora releases if ever.

Having said that: what is your reason for proposing this change? Do you mind the waiting for the metadata downloads before a DNF operation? Or is it more about the bandwidth used?

Comment 2 David Jaša 2012-08-23 09:46:32 UTC

(In reply to comment #1)
> David,
> 
> this is a very interesting proposal and there's at least one more voice I've
> heard about a similar idea from. I have to tell you straight off there's a
> very low chance this happens in the nearest three or four fedora releases if
> ever.
> 
> Having said that: what is your reason for proposing this change? Do you mind
> the waiting for the metadata downloads before a DNF operation? Or is it more
> about the bandwidth used?

there are several concerns:

1) when at location with slow internet (mobile, rural wifi providers etc.), any use of yum can one drive nuts - you do 'yum install some-minor-package' and the cache refresh itself takes minutes

2) I'd like to see more versions of the same package for rollback or for possiblity to leave "last non-broken package version set" in rawhide.
I'd really appreciate the latter last winter when I needed to test gnome-shell software rendering and the only way to get it in December-March period was to cherry pick older RPMs (and in some cases to rebuild them manually).

3) given default Daily frequency of check for updates and number of actual updates per day, I wouldn't be surprised if metadata would make tens of percent of total traffic on fedora mirrors, thus using donated resources inefficiently and in turn behaving impolitely to Fedora Project donors...

Comment 3 Ales Kozumplik 2012-08-23 10:44:35 UTC

(In reply to comment #2)
> (In reply to comment #1)
> > David,
> > 
> > this is a very interesting proposal and there's at least one more voice I've
> > heard about a similar idea from. I have to tell you straight off there's a
> > very low chance this happens in the nearest three or four fedora releases if
> > ever.
> > 
> > Having said that: what is your reason for proposing this change? Do you mind
> > the waiting for the metadata downloads before a DNF operation? Or is it more
> > about the bandwidth used?
> 
> there are several concerns:
> 
> 1) when at location with slow internet (mobile, rural wifi providers etc.),
> any use of yum can one drive nuts - you do 'yum install some-minor-package'
> and the cache refresh itself takes minutes

that is why dnf precaches everything (the check and downloads are done in an hourly cron job). have you experienced having to wait for the downloads with DNF?

> 
> 2) I'd like to see more versions of the same package for rollback or for
> possiblity to leave "last non-broken package version set" in rawhide.
> I'd really appreciate the latter last winter when I needed to test
> gnome-shell software rendering and the only way to get it in December-March
> period was to cherry pick older RPMs (and in some cases to rebuild them
> manually).

I agree here 100%. The fedora mirrors should hold many more old versions than they do now. At the same time there's nothing nobody in the packaging tools team can do anything about, this is decided by Fedora infrastructure.

> 3) given default Daily frequency of check for updates and number of actual
> updates per day, I wouldn't be surprised if metadata would make tens of
> percent of total traffic on fedora mirrors, thus using donated resources
> inefficiently and in turn behaving impolitely to Fedora Project donors...

Yep, they do and it with DNF this will increase further. It is a very unfortunate setup they have, but nothing I or DNF can do anything about.

Comment 4 Ales Kozumplik 2012-08-29 11:06:44 UTC

I have doubts this metadata arrangement will be a big improvement over the status quo. It is a lot of work at many different places: createrepo, libsolv, hawkey, dnf, yum potentially, fedora infrastructure (arguably the most painful). And at the same time we have still options to explore with the current format: e.g. better caching or longer periods the mirrors hold obsoleted packages (unfortunately this last thing is out of our control).

Let's keep this open for up to one year. If there are more opinions we should pursue this direction we will discuss this with fesco. Else I'll close this then.

Comment 5 David Jaša 2012-08-29 12:21:07 UTC

(In reply to comment #3)
...
> that is why dnf precaches everything (the check and downloads are done in an
> hourly cron job). have you experienced having to wait for the downloads with
> DNF?
> 

This RFE is based on my experience with yum on RHEL 6 and Fedora <= 17 so I didn't lay my hands on DNF yet.

> I agree here 100%. The fedora mirrors should hold many more old versions
> than they do now. At the same time there's nothing nobody in the packaging
> tools team can do anything about, this is decided by Fedora infrastructure.
> 

IMHO you're overlooking "communicating vessels" nature of this problem - Fedora infrastructure folks won't agree to make repos any larger because yum suck at them.

> Yep, they do and it with DNF this will increase further. It is a very
> unfortunate setup they have, but nothing I or DNF can do anything about.

Don't kill the messenger (fedora infrastructure) here please.

The yum repo format is very inefficient. Take an example of system with lifetime of 6 months, 2 GB downloaded of total RPMs over its lifetime (base install and updates combined) and downloading just <uuid>-primary.xml.gz with 3.0 MB in average on a weekly and daily basis - at weekly basis, the metadata is 3.66 % of total repo traffic, at daily basis, it 20.9 % of total traffic - and this is pretty much best case for yum, because sqlite.bz2 that is actually downloaded by yum is almost twice bigger and yum often downloads other similarly-sized metadata files as well.


(In reply to comment #4)
> I have doubts this metadata arrangement will be a big improvement over the
> status quo. It is a lot of work at many different places: createrepo,
> libsolv,

Why libsolv and other internals? IMO it should be pretty straightforward to convert new metadata XML tree:
<package>
  ...
  <filelist/>
  <otherdataperpkg/>
<package>
to set of current <metadata>, <filelists> and <otherdata> trees (as noted in original report)

> hawkey, dnf, yum potentially, fedora infrastructure (arguably the
> most painful).

Tens of % of total traffic taken by metadata is a good enough reason to go through such pain IMO (fedora infrastructure team could have much more precise data than my rough estimates).

> And at the same time we have still options to explore with
> the current format: e.g. better caching or longer periods the mirrors hold
> obsoleted packages (unfortunately this last thing is out of our control).
> 

I can't see how can be something better than use of expire-cache by default.

> Let's keep this open for up to one year. If there are more opinions we
> should pursue this direction we will discuss this with fesco. Else I'll
> close this then.

FESCo should be contacted on this matter because from my POV, yum inefficiency translates directly into huge unneccessary traffic and prevention of more desirable repository contents.

Comment 6 David Jaša 2012-08-29 12:38:49 UTC

Created FESCo ticket: https://fedorahosted.org/fesco/ticket/943

Comment 7 Ales Kozumplik 2012-08-29 12:41:07 UTC

(In reply to comment #5)
> The yum repo format is very inefficient. Take an example of system with
> lifetime of 6 months, 2 GB downloaded of total RPMs over its lifetime (base
> install and updates combined) and downloading just <uuid>-primary.xml.gz
> with 3.0 MB in average on a weekly and daily basis - at weekly basis, the
> metadata is 3.66 % of total repo traffic, at daily basis, it 20.9 % of total
> traffic - and this is pretty much best case for yum, because sqlite.bz2 that
> is actually downloaded by yum is almost twice bigger and yum often downloads
> other similarly-sized metadata files as well.

Don't get me wrong, I agree having the metadata take tens of percents of the mirrors' bandwidth is a terrible fact.

> Why libsolv and other internals? IMO it should be pretty straightforward to
> convert new metadata XML tree:
> <package>
>   ...
>   <filelist/>
>   <otherdataperpkg/>
> <package>
> to set of current <metadata>, <filelists> and <otherdata> trees (as noted in
> original report)

I'd like to avoid such approach: first it would result in either authoring new tool or patching all places that deal with the downloaded files (my original point). Second having an intermediate format is further slowing things down.

> Tens of % of total traffic taken by metadata is a good enough reason to go
> through such pain IMO (fedora infrastructure team could have much more
> precise data than my rough estimates).

That can only be evaluated in relation to other requests the current packaging tools stack is facing:)

> I can't see how can be something better than use of expire-cache by default.

You mean handling this already on HTTP level? I don't know much about urlgrabber but I think they already do that with metalinks. But let's stay on the topic with this bug which is what format should we use to distribute the metadata. I'd like to note that the least bandwidth-intensive format (yours could be a candidate for such) is not necessarily the one fastest to work with from yum. And speed as perceived by the user is what I primarily target in DNF now. Again, this is the first report complaining about the bandwidth load.

Comment 8 seth vidal 2012-08-29 14:48:03 UTC

(In reply to comment #3)
> > 
> > 1) when at location with slow internet (mobile, rural wifi providers etc.),
> > any use of yum can one drive nuts - you do 'yum install some-minor-package'
> > and the cache refresh itself takes minutes
> 
> that is why dnf precaches everything (the check and downloads are done in an
> hourly cron job). have you experienced having to wait for the downloads with
> DNF?

Ales, really? Cmon. Yum caches the data too. If you run any of the yum cron jobs you get the same result.

DNF didn't invent caching.
 
> I agree here 100%. The fedora mirrors should hold many more old versions
> than they do now. At the same time there's nothing nobody in the packaging
> tools team can do anything about, this is decided by Fedora infrastructure.

Not by Fedora Infrastructure - by Fedora Release Engineering and by budgets.

But considering this hasn't been asked to anyone in Releng or even in the FI - I don't see how you know this.


> 
> > 3) given default Daily frequency of check for updates and number of actual
> > updates per day, I wouldn't be surprised if metadata would make tens of
> > percent of total traffic on fedora mirrors, thus using donated resources
> > inefficiently and in turn behaving impolitely to Fedora Project donors...
> 
> Yep, they do and it with DNF this will increase further. It is a very
> unfortunate setup they have, but nothing I or DNF can do anything about.

Again - ask about it - don't assume you know what's being used and why.

Comment 9 James Antill 2012-08-29 14:56:13 UTC

> if package count in Fedora ever reaches the current number of Debian
> (almost 80000 based on available info), even the structure above means MBs
> of data per incremental metadata update

Go read:

http://yum.baseurl.org/wiki/apt2yum#Generalpoints

...a new repo. format can certainly help, but in some ways the problem is rpm vs. dpkg. A given the fact we still ship with file requires (an optional huge optimisation), I'm not going to place any bets on getting rid of unneeded provides.

> FESCo should be contacted on this matter because from my POV, yum
> inefficiency translates directly into huge unneccessary traffic and
> prevention of more desirable repository contents.

Whenever I've spoken to rel-eng people the reason for "no old versions" was always "we don't want to make mirrors commit to storing N versions of a package".
Nobody ever said anything about bandwidth usage or the difference between a 10MB primary and a theoretical 1-5MB variant.
But, sure, yum is trash and was written by idiots. Generally accepted wisdom.

Comment 10 seth vidal 2012-08-29 15:06:53 UTC

From 2 yrs ago:

http://lists.baseurl.org/pipermail/rpm-metadata/2010-August/001209.html

If anyone is thinking about restructuring repodata - which is a fine idea, lets be clear - read up on some important considerations for new formats.

Comment 11 seth vidal 2012-08-29 15:10:30 UTC

and
http://yum.baseurl.org/wiki/dev/NewRepoDataIdeas

Comment 12 Ales Kozumplik 2012-08-29 15:32:55 UTC

Thanks for the links, Seth, we definitely need to explore all the things that people already proposed etc.

And I didn't want to sound like DNF inventing caching (or anything else), just saying it is there and for someone who doesn't mind the bandwidth it should be a good enough solution.

Comment 13 David Jaša 2012-08-29 15:57:48 UTC

Thanks all for the input, I felt like I was reinventing the wheel but I didn't find prior art. :/


I feel that there might be two changes to the createrepo and yum/dnf though that could alleviate the problem somewhat:

1) move to xz compression - while it is noticeably more CPU-heavy for compression, it has much better compression ratio (tested on latest fedora 17 updates primary xml.gz and sqlite bz2):
$ PRIMARY=<uuid>-primary.xml
$ du $PRIMARY.gz ; gunzip $PRIMARY.gz ; xz $PRIMARY ; du $PRIMARY.xz
3124	<uuid>-primary.xml.gz
1804	<uuid>-primary.xml.xz

$ PRIMARY=<uuid>-primary.sqlite
$ du $PRIMARY.bz2 ; bunzip2 $PRIMARY.bz2 ; xz $PRIMARY ; du $PRIMARY.xz
5660	<uuid>-primary.sqlite.bz2
4404	<uuid>-primary.sqlite.xz

2) make yum/dnf download the smaller of sqlite/xml version of each metadata, provided they contain equal information.

As you can see from comparison above, just these two would cut metadata traffic to around a third of status quo (my yum defaults to downloading sqlite).

Comment 14 James Antill 2012-08-29 20:32:51 UTC

(In reply to comment #13)
> Thanks all for the input, I felt like I was reinventing the wheel but I
> didn't find prior art. :/
> 
> 
> I feel that there might be two changes to the createrepo and yum/dnf though
> that could alleviate the problem somewhat:
> 
> 1) move to xz compression - while it is noticeably more CPU-heavy for
> compression, it has much better compression ratio (tested on latest fedora
> 17 updates primary xml.gz and sqlite bz2):

 Yeh, we wanted to do this before ... the main problems were two fold:

1. decompression+reading was linked in a bunch of the code paths (Eg. libxml had support to open .xml.gz ... so we just passed the compressed files in).

2. No python APIs to do decompression, and we didn't want to just do system("unxz foo").

...#2 mostly got fixed a while ago, but #1 only got fixed for F18 (IIRC).
 I'm pretty sure we could turn xz compression on for everything in F18 (untested).

> $ PRIMARY=<uuid>-primary.xml
> $ du $PRIMARY.gz ; gunzip $PRIMARY.gz ; xz $PRIMARY ; du $PRIMARY.xz
> 3124	<uuid>-primary.xml.gz
> 1804	<uuid>-primary.xml.xz
> 
> $ PRIMARY=<uuid>-primary.sqlite
> $ du $PRIMARY.bz2 ; bunzip2 $PRIMARY.bz2 ; xz $PRIMARY ; du $PRIMARY.xz
> 5660	<uuid>-primary.sqlite.bz2
> 4404	<uuid>-primary.sqlite.xz

 Yeh, and IIRC primary gets the least benefit from it too.

> 2) make yum/dnf download the smaller of sqlite/xml version of each metadata,
> provided they contain equal information.

AFAIK .sqlite is always bigger, and you can already (maybe even in F17) tell yum to choose .xml. Set mddownloadpolicy=xml in yum.conf (I now realize this isn't documented ... my bad *goes to fix*).

Comment 15 Ales Kozumplik 2012-09-10 10:58:42 UTC

Our latest thought on this: let's start with a basic xml diffs on the mirror side and patching on the dnf side. check how much this improves the situation, if it even makes sense, then decide about other tools and infrastructure changes.

Comment 16 Bill Nottingham 2013-02-23 12:11:12 UTC

FWIW, the 'only keep one version of packages' is mainly a function of:

1) data to push the mirrors
2) size of metadata created
3) *time to create the repositories*

createrepo's cache for untouched packages helps, but it's still scanning extremely large amounts of data across network-attached storage. (The NAS part is an infrastructure detail, but it needs to be kept in mind.)

Also, deltas make the baby kittens cry.

Comment 17 Ales Kozumplik 2013-03-25 11:44:38 UTC

(In reply to comment #16)
> 3) *time to create the repositories*
> 
> createrepo's cache for untouched packages helps, but it's still scanning
> extremely large amounts of data across network-attached storage. (The NAS
> part is an infrastructure detail, but it needs to be kept in mind.)

Perhaps the C reimplementation of createrepo can help that https://fedorahosted.org/createrepo_c/ ?

Comment 18 Bill Nottingham 2013-03-25 14:15:50 UTC

Would be worth a try, but there are limits to how much you can speed up reading 20G from the NAS.

Comment 19 Jan Pazdziora 2013-04-03 08:56:30 UTC

If there is discussion of per-package metadata, could this bugzilla be also extended to address the erratas, namely having single updateinfo.xml file which needs to be parsed again and again, during synces to Spacewalk or Katello?

Comment 20 Ales Kozumplik 2013-04-03 11:13:38 UTC

(In reply to comment #19)
> If there is discussion of per-package metadata, could this bugzilla be also
> extended to address the erratas, namely having single updateinfo.xml file
> which needs to be parsed again and again, during synces to Spacewalk or
> Katello?

Yes, we should optimally arrive at a generic enough approach that covers the errata too.

(Note it's very probably not going to be the per-package format outlined in comment 0.)

Comment 21 Ales Kozumplik 2013-05-23 05:45:31 UTC

one thing we've failed to mention so far is that imposing determinate ordering on the packages data within xml files would allow rsyncing them, giving us a lot of speedup for little effort.

Comment 22 Jan Pazdziora 2013-05-23 09:39:18 UTC

(In reply to Ales Kozumplik from comment #21)
> one thing we've failed to mention so far is that imposing determinate
> ordering on the packages data within xml files would allow rsyncing them,
> giving us a lot of speedup for little effort.

But rsync is only used on some distribution paths, not all of them. We'd really need solution which could work reasonably well for all transfer protocols.

Comment 23 Ales Kozumplik 2013-08-13 15:05:47 UTC

(not working on this now, and not actively participating in the metadata improvement project. releasing this now)

Comment 24 Zdeněk Pavlas 2013-08-13 16:11:02 UTC

Hi guys watching this thread.. could you comment on this proposal?
http://lists.baseurl.org/pipermail/rpm-metadata/2013-July/001480.html

It's a simple old_repodata + delta_file => new_repodata idea, but it can be seen as a variation of the per-package repodata idea, since the delta file bundles repodata of recently added packages and hashes of repodata likely to be cached on clients.  There's no index, however.  The actual implementation:

1) unlike traditional diffs, the same delta file can be used to sucessfully update from a range of old repodata snapshots, so the number of deltas does not explode when repository rebuilds are much more frequent than client check-ins.
(the delta hash-references the common subset, and bundles everything else)

2) When deltas reference cached repodata, the shortest possible hash prefix is used. In the common case when packages are sorted and there's nothing to be skipped before package X in all old_repodata versions, the hash could be left out completely. This results in deltas that are proportional to number of new packages only, there's no fixed "index" overhead.

Comment 25 Hedayat Vatankhah 2014-08-16 08:59:43 UTC

Hi Zdenek,
What happened with this patch? It seems nice, but no discussion seems to happened about it (even against it). While IMHO worrying to much about 'exploding number of deltas' is not needed (they'll be always much less than the number of packages, and I wonder if having e.g. 5000 deltas will bother mirrors or clients (they can retrieve deltas using http keep alive/pipeline so that they won't bother servers with many TCP connections). IMHO, having a constant minimum size for delta files is enough (so that when creating new deltas, we check the size of the last delta file. If it is less than minimum, we will update the delta. If larger or equal, we will create a new delta file). 

Anyway, even with your approach things will be improved a lot. Specially, it will encourage frequent metadata updates (if you don't get new metadata early enough, you'll need to download the full metadata, and as you get new metadata more frequently, the amount of download will decrease) which is completely in contrast with the current situation where if you try to update frequently, you'll be penalized by downloading whole metadata again which can (not rare!) be larger than the actual update/install requirements. Also, it'll play nicely with the default DNF policy of updating metadata automatically very often.

Comment 26 Honza Silhan 2014-08-18 09:16:33 UTC

Hi, Zdenek no longer works there. AFIAK some team is working on reducing metacache downloads in DNF now but I don't know the details.

Comment 27 Jan Zeleny 2014-08-18 09:59:54 UTC

That would be Tomas Mlcoch who is on CC of this bugzilla. The proof-of-concept is done IIUIC, however integration to DNF is another matter. This feature is not imminently important, getting dnf into state where it can completely replace yum is more important before F22.

Comment 28 Daniel Mach 2014-10-03 13:14:40 UTC

Tomas has finished some proof-of-concept code, but we need a new one to match new specs. The difference is that we originally wanted deltas between 2 existing repos. That would speed things up on client side, but createrepo would remain slow (creating a brand new repo + delta). The new use case is to boost createrepo as well. More details follow...


Create Repo Without Delta (server)
----------------------------------
* This is current (pre-delta) behaviour.
* Create brand new repodata without delta.
* Deltas could be used as --update-md-path when --update enabled

Command:
$ createrepo_c [--update] [--no-delta]

Output:
repodata
repodata/repodelta is *removed*


Create Repo With Delta (server)
-------------------------------
* Preserve existing repodata, add delta describing changed packages.
* Delta is:
  * small repo with added packages
  * list of removed RPMs
  * new comps, updateinfo, etc.
* Requires repodata/primary.xml loading
  * ~10x faster than loading whole repo
  * ~30x faster than loading and saving it (load, save to xml and sqlite)

Command
$ createrepo_c [--update] --delta

Output:
repodata
 +- <repodata_files>
 +- repodelta
    +- deltamd.xml
    +- $content_hash_from-$content_hash_to or $timestamp (or another ID)
       +- <repodata_files>
       +- removed.xml

$content_hash is an unique and deterministic checksum of repo content (file names, locations, checksums, etc.)


deltamd.xml format
------------------
<repodelta>
  <latest>$content_hash</latest>
  # is latest content hash sufficient or do we need track more?
  <repo>
    <from>$content_hash_from</from>
    <to>$content_hash_to</to>
    <repodata_size>size of primary, filelists, others</repodata_size>
    # how to deal with xml/sqlite?
    <extras_size>size of comps, updateinfo and other non-repodata files</extras_size>
    <location href="$content_hash_from-$content_hash_to" />
  </repo>
</repodelta>


tmlcoch: comps changes, content_hash does not; need timestamp to identify new repos


Consume Repo Without Delta (client)
-----------------------------------
* This is current (pre-delta) behaviour.
* Download and use repomd.


Consume Repo With Delta (client)
--------------------------------
* Download repodata or use cached
* Download deltamd.xml and download all needed deltas
* Compute upgrade path for repo
* Apply deltas to repodata (in-memory mergerepo)
* Incompatible, requires updated client (yum, dnf/librepo, etc.) :/
* TODO: when a new remote repo is made, somehow reconstruct local repo so new deltas can be applied

Comment 29 Honza Silhan 2014-11-19 12:14:55 UTC

*** Bug 1086288 has been marked as a duplicate of this bug. ***

Comment 30 Honza Silhan 2014-11-20 15:38:11 UTC

Tomas, Dan: Please, consider fetching just list of security updates in metadata if possible (Bug 1086288)

Comment 31 Honza Silhan 2014-11-24 12:33:41 UTC

*** Bug 1038824 has been marked as a duplicate of this bug. ***

Comment 32 Megh Parikh 2015-01-05 13:03:47 UTC

So what is status of this bug?

I would like to say something but if a format is available no need to propose a new format.

Also I did like to learn more about repo file content. Could you point me to some links?

Comment 33 Jan Zeleny 2015-01-05 14:01:43 UTC

(In reply to Megh Parikh from comment #32)
> So what is status of this bug?
> 
> I would like to say something but if a format is available no need to
> propose a new format.

AFAICT there already is an implementation and we plan to propose this feature as a Fedora Change either for F22 or F23. If you have any comments / requirements, feel free to use this bugzilla to communicate them to us.

Comment 34 Niclas Moeslund Overby 2015-01-29 21:27:05 UTC

Are there any place to read about this implementation?

Comment 35 Megh Parikh 2015-02-22 11:48:22 UTC

As nobody was talking about an implementation not dealingexactly with delta but dealing with the huge size of metadataI proposed one of my own in bug 1195036 . Please provide feedback. Also if thedelta format is ready it may be used in conjunction with my proposal for system packages and library metadata

Comment 36 Megh Parikh 2015-02-22 12:26:31 UTC

My other proposal is to not zip the metadata files but to serve them as a git repo. Also we will have to use XMLinsteadof binary sqlite format.
AFAIK git can decide whther it is better to pull changes from the repo or pull the whole repo again. Also it uses compression before sending the data to the user.
But git depends on diff which cant properly diff XML files unless the XML file is formatted properly such that each line contains only one node. Even then it cant be as efficient as other tools like XMLDiff but I dont know if git can be used in sync with XMLdiff and patch tools or if any specialised VCS exists. If it can be used,we must use them.

Comment 37 Hedayat Vatankhah 2015-02-22 12:28:14 UTC

AFAIK, unlike yum, dnf already downloads xml metadata rather than sqlite ones.

Comment 38 Megh Parikh 2015-02-22 12:32:07 UTC

So is the above proposal valid? To me the solution can be using XML prettifying tools and then using git

Comment 39 Megh Parikh 2015-02-22 12:33:07 UTC

That would eliminate the need to modify createrepo

Comment 40 Hedayat Vatankhah 2015-02-22 18:21:43 UTC

hmm... the only problem which comes to my mind right now is that git doesn't have any kind of 'resume' functionality. But, it might be possible to provide the first 'clone' as a .tar.xz file to workaround this limitation.

Comment 41 Megh Parikh 2015-02-23 05:20:25 UTC

The problem with using git is that it will keep unnessary stuff. For example never will a user need to use revert and go back a commmit.

Thus we should use git shallow clone `git clone --depth 1 remote-url` for the first clone

We cant use .tar.xz till my knowledge for the first clone as that doesnt make it a git repo and git pull cant be used later.

Also we shoould think about repos being different for different fedora versions or they being different branches. If later only the needed branch should be pulled/cloned.

See http://blogs.atlassian.com/2014/05/handle-big-repositories-git/ for more info.

Also I think that the initial Fedora repos should be present by default, so there is no need to clone the Fedora repos which are the biggest.

Most other repos are small enough so the pause functionality doesnt matter

Comment 42 Megh Parikh 2015-02-23 05:33:11 UTC

Pehaps I should file a new bug for this?

Comment 43 Hedayat Vatankhah 2015-02-23 09:53:32 UTC

I think so!
BTW, .tar.xz won't work if it is taken from contents, but it'll work if it is .tar.xz of the repo itself. I've used it to workaround git resume problem myself. Currently, git itself has no solution for this problem. Shallow clone won't fix it since the main Fedora repo doesn't get large over time (it'll probably have a single commit anyway, since AFAIK it is not updated). So, you can run `git clone --depth 1 repo-url` at repo building *server* (where you create the repo), create a .tar.xz from the resulting directory (including its .git dir), and put it for download.

Clients can download this compressed git repo, and run git pull there.

Comment 44 Jan Pazdziora 2015-03-03 12:48:39 UTC

(In reply to Megh Parikh from comment #36)
> But git depends on diff which cant properly diff XML files unless the XML
> file is formatted properly such that each line contains only one node.

Neither git's dependency on diff or the ability of diff to diff XML files seem to be valid points.

Comment 45 Frank Ch. Eigler 2015-03-03 13:40:58 UTC

(See also [man gitattributes], specifically the ability to select custom "diff drivers".  But diffs aren't used in transport of repositories.)

Comment 46 Megh Parikh 2015-03-04 14:53:40 UTC

So, git doesnot depend on diffs for transport as I understand from http://git-scm.com/book/en/v2/Git-Internals-Packfiles . I am pretty much confused between this and svn's methodology.

So please ignore my comments about XML diff

I am not so knowledgeable and dont understand much about VCS internals.

So can somebody research and find out the best VCS which fulfils the following aim: When fetching deltas, it should decide if fetching deltas is too heavy so that it should redo a shallow clone.

From what I can find we can run `git pull --depth=1` too for the above result.

(I have exams and so little busy and therefore late reply)

Comment 47 Frank Ch. Eigler 2015-03-04 15:05:25 UTC

A newly installed client could always -start- with a --depth=1 pull, i.e., stating that they are not interested in prior history, and always do a normal git-pull afterwards.

Comment 48 Megh Parikh 2015-03-04 16:42:38 UTC

(In reply to Frank Ch. Eigler from comment #47)
> A newly installed client could always -start- with a --depth=1 pull, i.e.,
> stating that they are not interested in prior history, and always do a
> normal git-pull afterwards.

I think you meant that a newly installed client should start with clone --depth=1

But I was asking about successive metadata updates, whether it is possible/appropriate to do git pull --depth=1 afterwards.

Comment 49 Frank Ch. Eigler 2015-03-04 16:46:58 UTC

I guess that depends on whether it's desirable to use git for history tracking at the client side (to go back in time), or using git only for its efficient file-change-distribution capabilities (and keep only near-tip history).

Comment 50 Megh Parikh 2015-03-04 16:55:27 UTC

Also can this be made as a plugin? From what I have read, we should put the code to synchronize the metadata in our way in the __init__ but there is no API by which we can stop dnf downloading metadata. This would be needed by Download manager plugins as well. (I am not able to find a dnf fastestmirror plugin too, otherwise I can monkeypatch as my knowledge of python and any programming language is very less. I can just do some HTML+JS).

We would get all repos and search them for git ones, else fallback to native downloading.

Comment 51 Megh Parikh 2015-03-04 16:57:08 UTC

(In reply to Frank Ch. Eigler from comment #49)
> I guess that depends on whether it's desirable to use git for history
> tracking at the client side (to go back in time), or using git only for its
> efficient file-change-distribution capabilities (and keep only near-tip
> history).

I guess normal users have no use for going back in time and we only want to use git's efficient file-change-distribution capabilities

Comment 52 Megh Parikh 2015-03-04 17:02:12 UTC

(In reply to Megh Parikh from comment #50)
> Download manager plugins as well. (I am not able to find a dnf fastestmirror
> plugin too :
Oh dnf has native fastestmirror not as a plugin, so no monkeypatching the plugins code

Comment 53 Penelope Fudd 2015-05-12 05:26:21 UTC

I hope I'm not intruding too much into this conversation, but I'd like to make a suggestion or three:

Instead of downloading the metadata file every time, try rsync, as it sends only the changes between the two files.  It seems that everyone reinvents rsync when they decide they don't want to download a fairly static file every time.  Skip the lineup, take the shortcut to rsync now.

Dnf loads sqlite3 modules, but doesn't appear to use them for regular operations. /var/lib/rpm/* will live forever because dnf wants /usr/bin/rpm compatibility, but Yum (and Smart) had its own database outside of rpm (/var/lib/yum/yumdb/*/*).

Also, wouldn't it be nice if 'dnf provides */foobar' were as fast as 'locate */foobar'?  In fact, you could reuse the code from locate to get that speed; just replace 'find' with 'cat', and give it a list of all the filenames in the repository with the package name prepended: 'findutils-4.5.14-3.fc22.x86_64: /usr/bin/find'.  You could even use it for 'rpm -qf' and 'rpm -ql' queries.

Cheers!

Comment 54 Penelope Fudd 2015-05-12 05:52:00 UTC

After posting, I went back and re-read the older comments again.

Apparently rsync was considered already and rejected back in 2013 (comment 22).  Well, revisit it again, because it can't make things any worse.  If it doesn't work for a particular distribution method, fall back to copying, but don't let "perfect" be the enemy of "good".

If changes to the repository insert data at the start of the file, rsync will slow down to the speed of scp.  If that's the case, use diff or xdelta to record changes to a monotonically-increasing patch file, and use rsync to copy *that*.  Apply the patches on the client side.  Start a new patch file whenever the old one grows beyond a certain size.  Clients can then download the current master file or patch files going back as far as is desired.  You could even publish the name and size of the current patch file as a DNS TXT record (gpg signed), to save wasted HTTP connections.

Comment 55 Honza Silhan 2015-07-20 08:44:00 UTC

I am changing this report to librepo component if you don't mind. In the best case this should be done transparently in librepo.

Comment 56 Hedayat Vatankhah 2015-07-20 09:45:50 UTC

While librepo should certainly support it, I'm pretty sure this can't happen transparently specially if the change is something more than delta metadata. Even currently, while (IIRC) librepo supports lazy downloading of filelists metadata; but still DNF always downloads it even if it is not required. I hope DNF stops always downloading filelists which has make it be worse than yum in terms of bandwidth utilization.

Comment 57 Susi Lehtola 2015-09-07 05:51:00 UTC

*** Bug 1163988 has been marked as a duplicate of this bug. ***

Comment 58 Susi Lehtola 2015-09-07 05:51:48 UTC

*** Bug 529358 has been marked as a duplicate of this bug. ***

Comment 59 Michal Luscon 2015-11-02 15:28:09 UTC

*** Bug 1276093 has been marked as a duplicate of this bug. ***

Comment 60 Honza Silhan 2016-01-11 12:43:21 UTC

*** Bug 1295669 has been marked as a duplicate of this bug. ***

Comment 61 Daniel Mach 2019-03-03 11:26:19 UTC

I believe zchunk covers this scenario.

Note You need to log in before you can comment on or make changes to this bug.

bitlord0xff
bugzilla.redhat.com
dennisml
dgregor
dmach
fche
hedayatv
herrold
james.antill
johannespfrang
jzeleny
meghprkh
mkolman
noverby
praiskup
rabin
rcyriac
reklov
robatino
rodd
rtalur
samuel-rhbugs
sundaram
tcallawa
tim
tmlcoch
tuksgig
vondruch
yaroslav.sapozhnik
zkabelac