Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 520189

Summary: yum should use LOW_SPEED_{LIMIT,TIMEOUT} for timeout
Product: [Fedora] Fedora Reporter: Mads Kiilerich <mads>
Component: python-urlgrabberAssignee: James Antill <james.antill>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 12CC: ackistler, dant, ffesti, james.antill, jason, kdudka, khchanel, martin.nad89, maxamillion, pmatilai, tim.lauridsen, wolfgang.rupprecht
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-04-29 20:59:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
fixed several problems with the transfer progress meter (upstream patch) none

Description Mads Kiilerich 2009-08-28 20:00:00 UTC
Description of problem:

I have seen a couple of times that f12 yum hangs when a download is almost complete and some unrealitic ETAs are shown:

(17/45): glib2-debuginfo-2.21.5-1.fc12.i686.rpm (12%) 99% [====================================-]  0.0 B/s | 1.9 MB 617899121847317640429317521174432251629594092066:08 ETA 

Ctrl-c and retrying works.


Version-Release number of selected component (if applicable):

yum-3.2.23-14.fc12.noarch

Comment 1 seth vidal 2009-08-28 20:06:37 UTC
what ver of python-urlgrabber?

Comment 2 Mads Kiilerich 2009-08-28 20:15:32 UTC
python-urlgrabber-3.9.0-8.fc12.noarch

Comment 3 seth vidal 2009-09-03 16:04:59 UTC
heh. Well, How do you know it wasn't going to take a nearly infinite amount of time? :)

I'll see what I can do to make it less nutty but it's not a super high priority.

Comment 4 Allen Kistler 2009-09-06 01:45:59 UTC
I've seen this bug, too.

yum-3.2.24-2
python-urlgrabber-3.9.0-8

It's more than a nutty ETA.  yum stops downloading.
It can happen at any point in a file, not just at the end.
From that point on, the ETA just gets more and more spectacular.

I've verified that there's no network traffic upstream of yum.
I'm not convinced that it's the server's or network's fault.

As the reporter noted, the only recovery is to kill yum, try again, and hope for something better.

Comment 5 seth vidal 2009-09-06 11:33:33 UTC
So the download is slowing down and eventually stopping but never aborting and you're thinking that urlgrabber is doing it and not your network connection?

Comment 6 Allen Kistler 2009-09-07 02:12:43 UTC
(In reply to comment #5)
> So the download is slowing down and eventually stopping but never aborting and
> you're thinking that urlgrabber is doing it and not your network connection?  

It's more accurate to say I haven't observed the occurrences carefully enough to exclude anything, yet.  So far the only data I have is that ridiculous numbers of ETA digits means downloading has stopped.

Comment 7 Mads Kiilerich 2009-09-07 11:07:52 UTC
Re comment #5:
Ok. There are two problems: The hanging download and wild estimates.

Right now the wild estimates clobbers the progress and making it harder to find out when the download is hanging.

The estimates are so high that there must be either a allmost-division-by-zero error or some other bug in the estimate calculation. 

For now I am fine with blaming the network connection.

Comment 8 Mads Kiilerich 2009-10-31 00:45:40 UTC
I got it again:
(6/11): empathy-debuginfo-2.28.1.1-3.fc12.i686.rpm                                                                                                   | 1.3 MB     00:06     
(7/11): gcc-debuginfo-4.4.2-7.fc12.i686.rpm                (19%) 12% [======                                           ]  0.0 B/s | 9.4 MB 2017959278923735059258845:52 ETA 


(7/11): gcc-debuginfo-4.4.2-7.fc12.i686.rpm           (19%) 12% [=====                                       ]  0.0 B/s | 9.4 MB 76257225264984816572947599078439867:44 ETA 

But this time it was trigged by a restart of NetworkManager. I assume that that caused the tcp connection to break, and apparently curl or yum got that wrong somehow.

Comment 9 seth vidal 2009-11-10 15:53:00 UTC
can anyone here routinely make this happen? If so let me know - I need someone to test a minimum rate patch that will help.

Comment 10 Bug Zapper 2009-11-16 11:49:16 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 11 Nelson Chan 2009-11-18 04:34:52 UTC
it seems to me that this bug will be triggered when the network is out during download. that explains #8

Comment 12 Mads Kiilerich 2009-11-28 00:16:39 UTC
I think I tracked it down in libcurl-7.19.7-1.fc12.i686. 

The download with the broken tcp socket keeps spinning in libcurl every 1000 ms in the while loop at transfer.c line 1887, and the only way out is in line 1948 if Curl_socket_ready returns -1. But when the poll in select.c line line 218 returns POLLERR then Curl_socket_ready doesn't return -1 but something with CURL_CSELECT_ERR set.

Another issue discussed in http://www.mail-archive.com/curl-library@cool.haxx.se/msg02450.html references http://lists.danga.com/pipermail/memcached/2003-October/000336.html which perhaps also can explain this issue.

I can imagine something like the following could solve it - but it is completely untested and I neither know nor understand the curllib code.

--- /usr/src/debug/curl-7.19.7/lib/transfer.c	2009-09-27 23:37:24.000000000 +0200
+++ transfer.c	2009-11-28 01:02:54.000000000 +0100
@@ -1945,7 +1945,7 @@
     else
       timeout_ms = 1000;
 
-    switch (Curl_socket_ready(fd_read, fd_write, timeout_ms)) {
+    switch ((res = Curl_socket_ready(fd_read, fd_write, timeout_ms))) {
     case -1: /* select() error, stop reading */
 #ifdef EINTR
       /* The EINTR is not serious, and it seems you might get this more
@@ -1955,14 +1955,16 @@
 #endif
       return CURLE_RECV_ERROR;  /* indicate a network problem */
     case 0:  /* timeout */
+      break; /* loop to allow throttle fds to be selectable again */
     default: /* readable descriptors */
-
+      if (res & CURL_CSELECT_ERR)
+          return CURLE_RECV_ERROR;  /* indicate a network problem */
       result = Curl_readwrite(conn, &done);
+      if(result)
+        return result;
       /* "done" signals to us if the transfer(s) are ready */
       break;
     }
-    if(result)
-      return result;
 
     first = FALSE; /* not the first lap anymore */
   }


Does this make sense? Should this issue be reassigned to curl?

Still, I don't understand how this can cause the crazy estimates, but I think it is related.

Comment 13 Kamil Dudka 2009-12-01 17:46:10 UTC
Created attachment 375131 [details]
fixed several problems with the transfer progress meter (upstream patch)

Attached is a patch for the progress matter written by Daniel Stenberg. Please give it a try:

+Daniel Stenberg (4 Nov 2009)
+- I fixed several problems with the transfer progress meter. It showed the
+  wrong percentage for small files, most notable for <1000 bytes and could
+  easily end up showing more than 100% at the end. It also didn't show any
+  percentage, transfer size or estimated transfer times when transferring
+  less than 100 bytes.

Is that actually the case?

Anyway I don't understand what exactly is the bug about?

1) Is it about the broken progress meter as says the summary?
2) Or is it about a hanging transfer as stated in comment #0?
3) Do we have any curl based minimal example?

Once I am able to reliably reproduce the behavior I am happy to review the patch and/or write another one.

Thanks in advance for bringing some light at this!

Comment 14 Mads Kiilerich 2009-12-01 23:09:29 UTC
> +Daniel Stenberg (4 Nov 2009)
> +- I fixed several problems with the transfer progress meter. It showed the
> +  wrong percentage for small files, most notable for<1000 bytes and could
> +  easily end up showing more than 100% at the end. It also didn't show any
> +  percentage, transfer size or estimated transfer times when transferring
> +  less than 100 bytes.
> 
> Is that actually the case?

No. The percentage stays ok, so looking at the patch I cannot imagine that it makes any difference, so I haven't tried it. Ok?

> Anyway I don't understand what exactly is the bug about?
> 
> 1) Is it about the broken progress meter as says the summary?

It is only the ETA that is crazy, as the summary says. The percentage and progress meter is fine - but fixed, because ...

> 2) Or is it about a hanging transfer as stated in comment #0?

yes, the download has apparently stopped, but the ETA keeps increasing. (Obviously the ETA should neither decrease nor stay, but getting so high doesn't make sense.)

> 3) Do we have any curl based minimal example?

No, sorry. I am just a random yum user who noticed the problem - and attached gdb to a failing yum and tries to draw conclusions without knowing anything.

> Once I am able to reliably reproduce the behavior I am happy to review the
> patch and/or write another one.

I can't reproduce what I saw the other day. But please try to follow my reasoning: select.c Curl_socket_ready() can return CURL_CSELECT_ERR according to the docstring, and that is an error situation which should stop the download, but that error situation is not handled by transfer.c Transfer() and it keeps spinning forever. I am pretty sure that is what happened, but I cannot rule out that I might have been tricked by gdb on optimized code...


However, here is something which seems to come pretty close and which might have happened when I have been using a flacky wireless network.
Start a yum download:
  yumdownloader wesnoth-data
and wait for it to start the download. 
While it is downloading close the connection on the server side without notifying the client side:
  iptables -A INPUT -p tcp --sport 80 -j REJECT --reject-with tcp-reset
The client now sits waiting forever and ETA now starts increasing and gets crazy while the download rate approaches 0.
Once the server has dropped the tcp connection the reject can be dropped:
  iptables -D INPUT -p tcp --sport 80 -j REJECT --reject-with tcp-reset

I can see that that it in some cases can be how some would like curl to work, but in yums case the connection should be dropped and it should try another mirror. I don't know if libcurl (and whatever is in the path between yum and curl) has a good way to set a passiveness-timeout for a download, or if the caller (yum) should detect the situation through the progress callback?

Comment 15 Kamil Dudka 2009-12-01 23:34:13 UTC
(In reply to comment #14)
> yes, the download has apparently stopped, but the ETA keeps increasing.
> (Obviously the ETA should neither decrease nor stay, but getting so high
> doesn't make sense.)

If the transfer hangs, ETA grows. It does not sound like a bug to me.

Or are you saying the transfer hangs in case it should not?

> No, sorry. I am just a random yum user who noticed the problem - and attached
> gdb to a failing yum and tries to draw conclusions without knowing anything.

Great! Then attach the backtrace please.
 
> ... but I cannot
> rule out that I might have been tricked by gdb on optimized code...

Are you able to recompile libcurl without optimization? Should I prepare such a build for you?

> ...
>   iptables -D INPUT -p tcp --sport 80 -j REJECT --reject-with tcp-reset

Thanks! I'll try it myself. What are you actually expecting to be done by curl in this case?

> I can see that that it in some cases can be how some would like curl to work,
> but in yums case the connection should be dropped and it should try another
> mirror. I don't know if libcurl (and whatever is in the path between yum and
> curl) has a good way to set a passiveness-timeout for a download, or if the
> caller (yum) should detect the situation through the progress callback?

If you don't want to wait indefinitely for the connection to become ready, I think setting timeout is the way to go.

Comment 16 Mads Kiilerich 2009-12-02 00:43:28 UTC
Trying to understand and answer the questions I read more of the code, and now I see that the status from Curl_socket_ready intentionally is ignored and the real error handling happens in Curl_readwrite. So my observations _must_ have been wrong, and what I saw was probably just the case I reproduced in #14. Curl was spinning (slowly, once a second) on a stalled connection, not a closed connection.

So I conclude that curl does what it is told to do, but that yum either should cancel a stalled download from the progress callback or that it should have set a timeout. This issue should thus be sent back to yum. Do you agree?

Comment 17 Nelson Chan 2009-12-02 03:12:27 UTC
(In reply to comment #13)

> Once I am able to reliably reproduce the behavior I am happy to review the
> patch and/or write another one.

Try to unplug the network during transfer

Comment 18 Kamil Dudka 2009-12-02 10:18:50 UTC
Now I hopefully see your point. It just hangs too long on a dead connection. That's exactly what CURLOPT_TIMEOUT is for. I completely agree this is a bug of yum, reassigning back. Let me know if you need some additional information.

Comment 19 Kamil Dudka 2009-12-02 10:20:15 UTC
Comment on attachment 375131 [details]
fixed several problems with the transfer progress meter (upstream patch)

The proposed patch does not fix the reported problem. The bug has to be fixed within yum.

Comment 20 seth vidal 2009-12-02 18:46:20 UTC
we USED to set curlopt_timeout - but when we do curl aborts ANY download which takes longer than curlopt_timeout.

any ideas?

Comment 21 Kamil Dudka 2009-12-02 18:58:29 UTC
(In reply to comment #20)
> we USED to set curlopt_timeout - but when we do curl aborts ANY download which
> takes longer than curlopt_timeout.
> 
> any ideas?  

No problem. You can of course resume the transfer, thus don't have to download the already downloaded part again. That's IMO the most common way how yum like tools usually work. You can even download different parts from different mirrors and finally only check their size and hash and eventually move back to the step zero ;-)

Comment 22 seth vidal 2009-12-02 19:19:39 UTC
No, you don't understand.

if I set curlopt_timeout in python to say 300s. I would assume that means if the download stalls for more than 300s then it timesout.

What happens is: if the download is actively downloading data, but the download takes > 300s to come down then the whole download  aborts.

which makes no sense at all.

Comment 23 seth vidal 2009-12-02 19:27:34 UTC
for a bit more info
https://bugzilla.redhat.com/show_bug.cgi?id=515497

Comment 24 Kamil Dudka 2009-12-02 19:52:51 UTC
(In reply to comment #22)
> if I set curlopt_timeout in python to say 300s. I would assume that means if
> the download stalls for more than 300s then it timesout.

That's only your wrong assumption, not a curl bug. Please read the documentation properly:

http://curl.haxx.se/libcurl/c/curl_easy_setopt.html#CURLOPTTIMEOUT

> What happens is: if the download is actively downloading data, but the download
> takes > 300s to come down then the whole download  aborts.
> 
> which makes no sense at all.

This ^^^ is the documented behavior.

You are free to implement your own heuristic to abort (or to not abort) the transfer on application level, but I can't see your point.

We have well tested (and widely used) network protocols and you are going to come with something tricky which solves all problems caused by unreliable network? This drives me crazy :-D

(In reply to comment #23)
> for a bit more info
> https://bugzilla.redhat.com/show_bug.cgi?id=515497

IMO the approach described in comment #21 solves it better than what you have (probably) done.

Comment 25 seth vidal 2009-12-02 20:01:59 UTC
I understand the docs. The problem was what curl lists as a timeout and what python socket used for timeout are not the same thing. python socket was saying 'if the socket is open but nothing is going on after N seconds, abort'

curl is saying 'if the socket is open, AT ALL for more than N seconds, abort'.

I had not yet, but was planning on implementing a minimum speed using:
http://curl.haxx.se/libcurl/c/curl_easy_setopt.html#CURLOPTLOWSPEEDLIMIT

Comment 26 Kamil Dudka 2009-12-02 20:25:18 UTC
(In reply to comment #25)
> curl is saying 'if the socket is open, AT ALL for more than N seconds, abort'.

+ you can set the connection timeout separately, which usually makes sense. It has been broken for a long time because of migration to NSS, but it's slowly starting to work ;-)

> I had not yet, but was planning on implementing a minimum speed using:
> http://curl.haxx.se/libcurl/c/curl_easy_setopt.html#CURLOPTLOWSPEEDLIMIT  

Sure, go ahead and try to set it. That's maybe what you are looking for, though I've never used the option myself.

Nevertheless consider also the transfer resuming if it is not implemented already. It can be pretty annoying to download an RPM of e.g. OpenOffice several times on a broken network...

Comment 27 seth vidal 2009-12-02 20:29:19 UTC
connection restarting is already implemented.

urlgrabber has done since just about ever.

Comment 28 James Antill 2009-12-02 20:31:21 UTC
Seth, I doubt that'll work well as a replacement for a timeout.

Can we reset the timeout while for the curl object in the middle of a callback? That seems like the best fix, if it works.

Comment 29 seth vidal 2009-12-02 20:37:51 UTC
No, you can't touch curlopts after perform() has been called.

Comment 30 Kamil Dudka 2009-12-02 20:49:51 UTC
(In reply to comment #27)
> connection restarting is already implemented.

However I am talking about transfer *resuming*, not connection restarting.


(In reply to comment #28)
> Can we reset the timeout while for the curl object in the middle of a callback?

http://permalink.gmane.org/gmane.comp.web.curl.library/24861

Generally you can't rely on anything beyond the documented cURL API:

http://curl.haxx.se/libcurl/c

Comment 31 seth vidal 2009-12-02 21:06:19 UTC
(In reply to comment #30)
> (In reply to comment #27)
> > connection restarting is already implemented.
> 
> However I am talking about transfer *resuming*, not connection restarting.
> 

I am too, I misspoke.
We resume using byte-ranges.

 
> (In reply to comment #28)
> > Can we reset the timeout while for the curl object in the middle of a callback?
> 
> http://permalink.gmane.org/gmane.comp.web.curl.library/24861
> 

Curiuously the items mentioned in that email I ran into when porting urlgrabber to pycurl. Specifically, there was no way to handback a sensible progress callback that included a total expected size unless you parsed/accessed the header yourself.

if there are better ways of doing this I'm all ears. I've found the python bindings somewhat frustrating. They aren't hard to understand but hard to know which way is 'better' or 'suggested'.

Comment 32 Kamil Dudka 2009-12-02 21:41:11 UTC
(In reply to comment #31)
> We resume using byte-ranges.

Then it might work fairly well even with the fixed timeout set.

> Curiuously the items mentioned in that email I ran into when porting urlgrabber
> to pycurl. Specifically, there was no way to handback a sensible progress
> callback that included a total expected size unless you parsed/accessed the
> header yourself.
> 
> if there are better ways of doing this I'm all ears. I've found the python
> bindings somewhat frustrating. They aren't hard to understand but hard to know
> which way is 'better' or 'suggested'.

http://curl.haxx.se/libcurl/c/curlgtk.html

Are you saying it's not possible to write the same using pycurl?

Comment 33 seth vidal 2009-12-02 21:50:25 UTC
 part of the requirement of the port from urllib to pycurl for urlgrabber was to do so w/o changing the urlgrabber interface.

with urllib I could urlopen the url and get back the header info to do urlgrabbers progress object setup.

http://yum.baseurl.org/gitweb?p=urlgrabber.git;a=blob;f=urlgrabber/grabber.py;h=0023fedbd99c8b90147c58204a9b9d9fcdf35c8f;hb=e86d27a4a7a72a8832ad4e1e63996ed8ac616621#l1039

that's where the urlgrabber code using pycurl starts. If you have the time I'd be happy to get some feedback on ways to improve things. (but maybe off this bug)

Comment 34 Kamil Dudka 2009-12-02 22:03:13 UTC
(In reply to comment #33)
> that's where the urlgrabber code using pycurl starts. If you have the time I'd
> be happy to get some feedback on ways to improve things. (but maybe off this
> bug)  

Sure. The best place to discuss this is the curl-library mailing list:

http://cool.haxx.se/mailman/listinfo/curl-library

Most of the libcurl hackers hang around there, response time is mostly close to zero. pycurl has (probably) its own community, but I don't think your problem is somehow python specific.

Comment 35 seth vidal 2009-12-02 22:10:54 UTC
I will subscribe to the list again. I unsubscribed after a couple of days of extremely disgusting spam.

Comment 36 Mads Kiilerich 2009-12-04 00:27:09 UTC
FWIW this seems to do what I would like:

--- grabber.py.org	2009-12-04 01:13:16.000000000 +0100
+++ /usr/lib/python2.6/site-packages/urlgrabber/grabber.py	2009-12-04 01:20:44.000000000 +0100
@@ -1170,10 +1170,11 @@
         self.curl_obj.setopt(pycurl.MAXREDIRS, 5)
         
         # timeouts
-        timeout = 300
         if opts.timeout:
             timeout = int(opts.timeout)
             self.curl_obj.setopt(pycurl.CONNECTTIMEOUT, timeout)
+            self.curl_obj.setopt(pycurl.LOW_SPEED_LIMIT, 1)
+            self.curl_obj.setopt(pycurl.LOW_SPEED_TIME, timeout)
 
         # ssl options
         if self.scheme == 'https':


Even more FWIW, I think that the 30 s used by is a bit high - I think 10 s would be more appropriate.

BTW, I noticed and wonder if it is intentional that many of the initial downloads made by yum doesn't use a timeout at all.

Comment 37 Kamil Dudka 2009-12-22 22:51:36 UTC
*** Bug 539563 has been marked as a duplicate of this bug. ***

Comment 38 Dan Thurman 2010-02-27 20:07:56 UTC
I am having this problem.  I have updated to the
latest and this problem is still occurring for me.

I find that every time I run yum, the network
connection gets dropped and it is more than random.
I have tried two different NICs and it has no effect,
the problem still remains.  It is very difficult to
do update and installs and based on luck after
repeated retries.

# uname -r
2.6.31.12-174.2.22.fc12.i686

# rpm -qa | grep yum
anaconda-yum-plugins-1.0-5.fc12.noarch
PackageKit-yum-0.5.6-1.fc12.i686
yum-3.2.25-1.fc12.noarch
PackageKit-yum-plugin-0.5.6-1.fc12.i686
yum-plugin-fastestmirror-1.1.26-1.fc12.noarch
yum-utils-1.1.26-1.fc12.noarch
yum-presto-0.6.2-1.fc12.noarch
yum-metadata-parser-1.1.2-14.fc12.i686

I have tried removing each of the yum plugins
and it seems to have no effect.

If there is any information I can provide to help
resolve this issue, please let me know.

Comment 39 Kamil Dudka 2010-02-27 20:34:11 UTC
(In reply to comment #38)
> I am having this problem.  I have updated to the
> latest and this problem is still occurring for me.

Thank you for heads up!  Please define 'this problem'.  AFAIK this bug is only about missing timeout in yum downloads.  Is your problem somehow dedicated to yum?  Other network transfers work fine?  Have you tried the curl(1) tool?

Comment 40 Dan Thurman 2010-02-27 21:45:55 UTC
It seems to be dedicated only to Yum (and its components)
and the associated 'Update Software' & 'Add/New software'
sort of thing.

What I noticed is my latest new minimally installed OS,
yum installs/updates are disconnecting quite often and
so I gave up trying. Note however, I previously installed
an F12 several weeks ago (and in a different partition),
fully installed, and yet I don't recall yum acting up this
badly, but I do recall it was not smooth - because normally,
I use 'Software Update' and 'Add/Remove Software' and yet
it did hang, so I finished off the rest by using yum
directly, but not without having hang problems. It seemed
at the time it was just a simple annoyance, it wasn't THAT
bad, but somehow my mind was set into getting a working F12!

I have gkrellm installed and I can see an immediately
dropped network connection from which Yum spins
it's wheels.  Most of the time, the ETA spins up fast
and ends up as "Infinite" and other times it simply shows
--:--.  But in all cases, the transfer rate incrementally
drops to 0b.  It seems to be random, but always breaks given
a long enough file list. Strangely enough, I have seen a
hang by doing a 'yum clean all' followed by 'yum update'
with a hang on attempts to download the repo databases!

I seem to recall that at least with previous releases of
yum (F9/10/11) that there was a built-in network timeout
mechanism that would drop the mirror and try another mirror
and not once have I seen this behaviour with F12's yum program.
It seems like "robustness code" was removed or is prevented
from kicking in?

I have pulled the network cable out to see how yum
responds, and sure enough - it hangs. Dunno, this is
just an observation.

I have no idea what what curl(1) is, but perhaps you
can tell me what I can do to nail this problem down?

Comment 41 Kamil Dudka 2010-02-27 22:06:03 UTC
(In reply to comment #40)
> I seem to recall that at least with previous releases of
> yum (F9/10/11) that there was a built-in network timeout
> mechanism that would drop the mirror and try another mirror
> and not once have I seen this behaviour with F12's yum program.
> It seems like "robustness code" was removed or is prevented
> from kicking in?

+1 for allowing the timeout in yum.  From the comments above you can see it's already on my wish-list.  Has anybody at least considered the solution from comment #36?

> I have pulled the network cable out to see how yum
> responds, and sure enough - it hangs. Dunno, this is
> just an observation.

That's unfortunate.

> I have no idea what what curl(1) is, but perhaps you
> can tell me what I can do to nail this problem down?    

It's a tool for download/upload content using various network protocols.  It uses libcurl as well as yum indirectly does.  So that you can try to use it to download the the remote stuff directly and compare the behavior.

Comment 42 Dan Thurman 2010-02-28 18:15:51 UTC
OK, I have done what you have asked me to do
with curl, and soon after running curl, the
network connection is dropped. curl behaves
in a similar way as yum.

# curl -LO 'http://mirror.uoregon.edu/fedora/linux/releases/12/Everything/i386/os/Packages/kdelibs-apidocs-4.3.2-4.fc12.noarch.rpm'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  3  242M    3 9027k    0     0   103k      0  0:39:56  0:01:27  0:38:29     0

In the above the 'Average Dload' keeps dropping until
0 is reached.  Seems clear to me that somehow there
is no recovery when the network is dropped at least
from curl or yum standpoint.  I mean, my network is
working for everything else as far as I can tell.

Comment 43 Dan Thurman 2010-02-28 18:27:58 UTC
Additionally, I tried using wget to copy the entire
Fedora packages over sequentially and the network
connection gets dropped.

# wget -nc -r 'http://mirror.uoregon.edu/fedora/linux/releases/12/Everything/i386/os/Packages'

[...]

--2010-02-28 10:23:15--  http://mirror.uoregon.edu/fedora/linux/releases/12/Everything/i386/os/Packages/CodeAnalyst-gui-2.8.54-19.fc12.i686.rpm
Connecting to mirror.uoregon.edu|128.223.157.9|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7852468 (7.5M) [application/x-rpm]
Saving to: “mirror.uoregon.edu/fedora/linux/releases/12/Everything/i386/os/Packages/CodeAnalyst-gui-2.8.54-19.fc12.i686.rpm”

67% [=========================>             ] 5,270,454   --.-K/s  eta 45s

Comment 44 Kamil Dudka 2010-02-28 18:38:55 UTC
(In reply to comment #42)
> In the above the 'Average Dload' keeps dropping until
> 0 is reached.  Seems clear to me that somehow there
> is no recovery when the network is dropped at least
> from curl or yum standpoint.  I mean, my network is
> working for everything else as far as I can tell.    

Well, curl probably can't fix your unreliable network :-)  But as for the "recovery" you wanted, something like that is indeed there.  You want to play with --max-time, --retry, --retry-delay, --retry-max-time, etc.

Comment 45 Jason Merrill 2010-02-28 19:46:00 UTC
...but the existence of curl command-line arguments doesn't help with yum.

Comment 46 Kamil Dudka 2010-02-28 19:57:52 UTC
I am only saying libucrl has the ability.  It's yum/urlgrabber team's turn now to apply the mentioned 3-lines patch or so :-)  Seth said that urlgrabber supported even transfer resuming.  It would be great to bring it to reality and enable in yum.

Comment 47 Dan Thurman 2010-02-28 20:13:36 UTC

Interesting.  Thanks for that comment above.

I have switched over to F11 (on the same system) and noticed
with the wget command line, the network connection is dropped,
but after some delay of several seconds to several minutes,
the network connection is retried, picks up where it was
dropped (resumes) and continues on.  This was the behaviour
I was expecting.

I am not seeing this with F12.  With F12, the dropped network
connection is not timing out, nor is it retried, so it hangs.

Keep in mind, that since I have tried curl/wget and see the
same "hang" problem, I wonder if it is more than just a
yum issue?

Comment 48 Kamil Dudka 2010-02-28 20:24:57 UTC
(In reply to comment #47)
> Keep in mind, that since I have tried curl/wget and see the
> same "hang" problem, I wonder if it is more than just a
> yum issue?    

The networking problem itself can't be a bug in yum.  Nevertheless we may improve it to work better in case of unreliable network.

Comment 49 seth vidal 2010-03-01 15:44:48 UTC
For downloading packages yum already uses regets. If it is not regetting then it is possible the mirror we're talking to doesn't support byte-ranges.

I'm not sure what bug we're dealing with here anymore with all the noise of the last few days.

Comment 50 Mads Kiilerich 2010-03-01 16:13:33 UTC
Noise is noise.

The core of the issue is IMHO (and with some local authority due to being the reporter) that the yum/rpm mirroring system builds on the sound "don't scale up - scale out" mantra and utilizes a lot of unreliable servers with limited bandwith instead of one central resource. Yum as the client thus has to do fail-over seamlessly (if possible) whenever any download fails or hangs or misbehaves in any way. The user experience is currently that yum isn't good enough at that.

One specific problem and solution has been pointed out: If an rpm download stalls (for example because of temporary network problems) then it sometimes hangs forever and neither fails nor do failover. Setting LOW_SPEED_LIMIT and LOW_SPEED_TIME seems to solve this specific problem.

Comment 51 James Antill 2010-03-01 16:52:59 UTC
Ok, I just put the patch from comment #36 into upstream.

Comment 52 Mads Kiilerich 2010-03-01 17:00:00 UTC
Note that it seemed to me like yum calls curl from several places (probably for downloading different kinds of meta data) and uses different timeout settings different places. The other places probably also needs fixing - or a general abstraction layer.

Comment 53 Dan Thurman 2010-03-01 17:45:35 UTC
Well... then there is the question of why does
yum/rpm behave correctly (w/ failover) on F11,
but not on F12 and on the same hardware?  My HW
network setup has not changed the last couple
of years...

This is what stymies me...

While the patch is good practice, but it does not
explain why F11 works and F12 does not, unless the
timeout code was dropped?

On F11, I latest I have installed:

# rpm -qa | grep yum
anaconda-yum-plugins-1.0-4.fc11.noarch
PackageKit-yum-plugin-0.4.9-1.fc11.i586
yum-presto-0.6.2-1.fc11.noarch
yum-utils-1.1.23-1.fc11.noarch
yum-3.2.24-2.fc11.noarch
yum-arch-2.2.2-8.fc11.noarch
yum-metadata-parser-1.1.2-12.fc11.i586
PackageKit-yum-0.4.9-1.fc11.i586
yum-plugin-protect-packages-1.1.23-1.fc11.noarch
yum-plugin-fastestmirror-1.1.23-1.fc11.noarch
yum-updatesd-0.9-2.fc11.noarch

# yum whatprovides */grabber.py
PyQt4-devel-4.7-1.fc11.i586 : Files needed to build other bindings based on Qt4
Repo        : updates
Matched from:
Filename    : /usr/share/doc/PyQt4-devel-4.7/examples/opengl/grabber.py

yum-arch-2.2.2-8.fc11.noarch : Extract headers from rpm in a old yum repository
Repo        : updates
Matched from:
Filename    : /usr/share/yum-arch/urlgrabber/grabber.py

python-urlgrabber-3.0.0-15.fc11.noarch : A high-level cross-protocol url-grabber
Repo        : installed
Matched from:
Filename    : /usr/lib/python2.6/site-packages/urlgrabber/grabber.py

Comment 54 Mads Kiilerich 2010-03-01 17:54:56 UTC
(In reply to comment #53)
> While the patch is good practice, but it does not
> explain why F11 works and F12 does not, unless the
> timeout code was dropped?

$ rpm -q --changelog python-urlgrabber-3.9.1-4.fc12.noarch|grep -A1 3.9.0-1
* Thu Jul 30 2009 Seth Vidal <skvidal at fedoraproject.org> - 3.9.0-1
- new version - curl-based

Comment 55 James Antill 2010-03-01 18:54:16 UTC
In reply to comment #52, I think where the change is done now (in urlgrabber) should affect all code paths from yum. If you want to test it, and have F12, I think this should apply cleanly:

http://yum.baseurl.org/gitweb?p=urlgrabber.git;a=commitdiff;h=8e57ad3fbf14c55434eab5c04c4e00ba4f5986f9

Comment 56 Dan Thurman 2010-03-02 17:39:53 UTC
Ok, thanks for supplying the codefix for testing.  I have
manually added the changes to the grabber.py and I have
checked out:

+ rpm
+ yum
+ Add/Remove Software
+ Update Software

In all the above apps, there is at least one or
more network disconnects / app, but in every case
disconnect retries works, although it can take
seconds to minutes between drops and retries.
I have not seen a complete retry failure.

In yum, there would appear visually, a "frozen state"
from seconds to minutes, followed by a message indicating
a "mirror switch", followed by a retry on file downloading
& curses text status

Overall, connection connect/retries works much better at
least for the above apps.

However, I would like to mention, that with F8/9/11, I have
never seen these network disconnects/retries.  I suspect
that there is an underlying problem causing these disconnects
in the first place. This ought to be looked into.

Outside of scope of this bug, but to comment:

+ `Add/Remove Software' could be improved to show more status/
   activity than a simple "bouncing download icon".  When hung,
   the "bouncing download icon" implied it was still working.
   Perhaps additional info/status should be shown as to the file
   count/total being worked on or something similar to the app
   below.  Need more visual feedback so that one can estimate
   how long downloads might take, if it is working at all.

+ `Update Software' has better statistics reporting, however,
   it has strange activity "jumping around" as to the file being
   worked on instead of displaying sequential activity? As it is,
   I was starting to get vertigo just by watching it. :P

Comment 57 Mads Kiilerich 2010-03-02 18:19:53 UTC
Dan: It looks like you have done some impressive testing. I haven't. Thank you!

I think that your test supports my unsubstantiated claim that other code paths don't apply any limits. IIRC there was a 5 minutes timeout in some places.

I would suggest configuring a timeout value of 10 seconds in yum.conf. 30 seconds is long time to wait for a fail over. IMHO it would make sense to change the default.

Comment 58 Dan Thurman 2010-03-02 18:46:51 UTC
So, are you suggesting that I put the
following under [main] in /etc/yum.conf:

timeout=10

Please advise.

Comment 59 Mads Kiilerich 2010-03-03 17:43:28 UTC
Yes, I think timeout=10 is better. But is a matter of personal taste and preference - not something that will make a huge difference or make things work. So use whatever you want - and if you think it makes a big difference then suggest to the maintainers that the default is changed ;-)

Comment 60 James Antill 2010-04-29 20:59:12 UTC
This should be fixed upstream and in rawhide.