Redesign list-by-earliest-revision to be parallel
The old algorithm was:
- Read all revisions from
.orc
and sort them by date (SortRevrelByDate) - Initialize an array of timestamps, one for each node
- for each revision in increasing timestamp order, traverse all contents and
directories from that revision.
If their timestamp is unset, set it and print the
(date, revrel, cntdir)
triple - dump timestamps
This was intrinsically sequential, and took 2 days (1 for SortRevrelByDate, 1 for the others).
The new algorithm takes 1.5 hours. Here is how it works:
- Initialize an array of [
AtomicOccurrence
], one for each node, to the maximum timestamp and an undefined node id - For each revision/release (in parallel):
a. Get its author date (if none, skip the revision/release)
b. traverse all contents and directories contained by that revision/release.
For each content/directory, atomically set the [
AtomicOccurrence
] to the current rev/rel and its timestamp if the [AtomicOccurrence
] has a higher timestamp than the revision/release - For each content/directory (in parallel):
a. Get the
(timestamp, revrel)
pair represented by its [AtomicOccurrence
]. b. If the timestamp is not the maximum (ie. if it was set as step 2.b.), then printthe(timestamp, revrel, cntdir)
triple - Convert the array of [
AtomicOccurrence
] to an array of timestamps and dump it.