scheduler/origin: Add scheduling origins cli from file/stdin to celery
It lived in the snippet repository for years (2017 and prior) and used regularly since then. It's been recently reused for the osdn origins scheduling. And as usual, improvments on it ensued. It's time to make it an official cli.
It's now migrated under the the 'swh scheduler origin' subcommand. It's name is
'send-origins-from-file-to-celery'. The file to read is either a local file or the
standard input. This expects a list of origins (urls) to be pushed directly in the
proper queue according to the task-type
argument.
Help message:
Usage: swh scheduler origin send-origins-from-file-to-celery
[OPTIONS] TASK_TYPE [FILE_INPUT] [OPTIONS]...
Send origins directly from file/stdin to celery, filling the queue according
to its standard configuration (and some optional adjustments).
Arguments:
TASK_TYPE: Scheduler task type (e.g. load-git, load-svn, ...)
INPUT: Dataset file of origins to schedule, use '-' when piping to the
cli.
OPTIONS: Extra options (in the key=value form, e.g. base_git_url=<foo>)
passed directly to the task to be scheduled
Options:
--queue-name-prefix TEXT Prefix to add to the default queue name (if
needed). Usually needed to treat special origins
(e.g. large repositories, fill-in-the-hole
datasets, ... ) in another dedicated queue.
--threshold INTEGER Threshold override for the queue.
--limit INTEGER Number of origins to send. Usually to limit to a
small number for debug purposes.
--waiting-period INTEGER Waiting time between checks
--dry-run Print only messages to send to celery
--debug Print extra messages
-h, --help Show this message and exit.
For example:
export SWH_CONFIG_FILENAME=~/.config/swh/scheduler.yml; \
head -20 /tmp/20230509-1539-priority.list.github | \
shuf | \
swh scheduler -C $SWH_CONFIG_FILENAME origin send-origins-from-file-to-celery \
--queue-name-prefix large-repository \ # optional
--debug \
--dry-run \
load-git
or directly:
swh scheduler -C $SWH_CONFIG_FILENAME origin send-origins-from-file-to-celery \
--queue-name-prefix large-repository \ # optional
--debug \
--dry-run \
--limit 10 \
load-git
/tmp/20230509-1539-priority.list.github # note that here, the full file is sent,
# without shuffling it
Origins can be routed to extra queue with the help of the --queue-name-prefix
flag.
This will use standard (configured in the scheduler) queue name with the dedicated
prefix (queue-name-prefix
:standard-queue-name
). This also expects that the
destination queue is being consumed on the infra side (ping sysadm for it if not).
The cli can be parametric to limit the number of messages with the --limit
flag. It
can also just be tested with --dry-run
to do nothing but print actions. Some extra
logging can be triggered with the --debug
flag.
Manual test shot (debug & dry-run mode, with a limit number of messages of 3):
export SWH_CONFIG_FILENAME=~/.config/swh/scheduler.yml; head -10 ~/downloads/20230509-1539-priority.list.github | shuf | my-swh scheduler -C $SWH_CONFIG_FILENAME origin send-origins-from-file-to-celery --queue-name-prefix add_forge_now --debug --dry-run --limit 3 load-git
{'type': 'load-git', 'description': 'Update an origin of type git', 'backend_name': 'swh.loader.git.tasks.UpdateGitRepository', 'default_interval': datetime.timedelta(days=64), 'min_interval': datetime.timedelta(seconds=43200), 'max_interval': datetime.timedelta(days=64), 'backoff_factor': 2.0, 'max_queue_length': 1000, 'num_retries': 3, 'retry_delay': None}
** DRY-RUN ** call app.send_task with: {'name': 'swh.loader.git.tasks.UpdateGitRepository', 'task_id': '9ef7bef0-2391-4804-99ba-6df54d5e0d01', 'args': (), 'kwargs': {'url': 'https://github.com/0xADE1A1DE/MeasureSuite'}, 'queue': 'add_forge_now:swh.loader.git.tasks.UpdateGitRepository'}
** DRY-RUN ** call app.send_task with: {'name': 'swh.loader.git.tasks.UpdateGitRepository', 'task_id': '371c0d06-63b9-466a-8202-dee33edf7da5', 'args': (), 'kwargs': {'url': 'https://github.com/0ssamaak0/labelmm'}, 'queue': 'add_forge_now:swh.loader.git.tasks.UpdateGitRepository'}
** DRY-RUN ** call app.send_task with: {'name': 'swh.loader.git.tasks.UpdateGitRepository', 'task_id': 'fecffc04-7992-445a-8d92-1dee5db12dcc', 'args': (), 'kwargs': {'url': 'https://github.com/0xJepsen/CRC_Research'}, 'queue': 'add_forge_now:swh.loader.git.tasks.UpdateGitRepository'}
Manual test shot with unknown task type, this fails:
export SWH_CONFIG_FILENAME=~/.config/swh/scheduler.production.yml; cat ~/downloads/only-20-20230509-1539-priority.list.github | shuf | my-swh scheduler -C $SWH_CONFIG_FILENAME origin send-origins-from-file-to-celery --queue-name-prefix large-repository --debug --dry-run --limit 10 unknown-stuff
Usage: swh scheduler origin send-origins-from-file-to-celery
[OPTIONS] TASK_TYPE [FILE_INPUT] [OPTIONS]...
Try 'swh scheduler origin send-origins-from-file-to-celery -h' for help.
Error: Could not find scheduler <unknown-stuff> task type
TODO:
-
documentation -
tests -
Find acceptable name for the new subcommand