nixguix: fails to finish as downloading artifacts step hangs
Another issue exists, sometimes the worker just hangs forever... For example, right now a nixguix process (runs on worker0.internal.staging.swh.network) is currently hanging on a download connection [1]
The only solution i see is to kill the process. Which will result in an unfinished visit state (stuck in ongoing state). Which gives credits to the origin visit reaper proposition btw T2310#43199.
Adding some timeout to the download connection sounds sensible [2] to avoid this kind of caveat [3]. Quoting the documentation of requests [2], "Failure to do so can cause your program to hang indefinitely". Well we had been warned :D
Note: It's probably shared to other package loaders. Right now, it's more obvious with this one as it treats a lot of artifacts in one round.
- [1]
Last log entry as of now:
Apr 09 17:37:09 worker0 python3[1914]: [2020-04-09 17:37:09,838: DEBUG/ForkPoolWorker-1] package_info: {'url': 'http://ftp.ebi.ac.uk/pub/software/vertebrategenomics/exonerate/exonerate-2.4.0.tar.gz', 'raw': {'url': 'http://ftp.ebi.ac.uk/pub/software/vertebrategenomics/exonerate/exonerate-2.4.0.tar.gz', 'integrity': 'sha256-+EkmHcfJfvHxXyIulVsNPa+ZTsE8nbd2bxrH53uqQEI='}}
Stracing the issue, it's currently waiting on file descriptor 95:
# strace -p 1914
strace: Process 1914 attached
recvfrom(95,
Which leads to socket:
# file /proc/1914/fd/95
/proc/1914/fd/95: symbolic link to socket:[74794390]
Indeed, it's stuck at the http connection.
root@worker0:~# lsof -p 1914 | grep 74794390
python3 1914 swhworker 95u IPv4 74794390 0t0 TCP worker0.internal.staging.swh.network:58952->hx-xfer-prod.ebi.ac.uk:http (ESTABLISHED)
-
[2] https://2.python-requests.org/en/master/user/quickstart/#timeoutsnn
-
[3] Also, relatedly to download, we discussed with @lewo a possibility to improve the download process to be done in parallel.
Migrated from T2357 (view on Phabricator)