Replace orcxx with datafusion-orc + Arrow + ar_row
orcxx
builds and links with the Apache ORC C++ library, which is a recurring source of issues:
- linking failures we do not understand
- dependencies (libsnappy, libzstd, ...) need to either be installed on the system or built as part of Apache ORC, the latter being another source of linking failures
- Starting with v2.0, Apache ORC C++ will start downloading a downloading
orc-format
as part of its build process, from a location (archive.apache.org
) that bans our CI; and v2.1 will only partially mitigate it by first downloading from another location (dlcdn.apache.org
) where the file will disappear as soon as the next version oforc-format
is released
This change replaces orcxx
with three components:
-
datafusion-orc
, a library to parse ORC files into Arrow structures - Arrow, an in-memory columnar format
-
ar_row
, a new crate I forked fromorcxx
by removing all the ORC-parsing code to keep only the "columnar arrays -> vector of row structures" deserialization and adapted it to Arrow.ar_row
is pure Rust, and with much lessunsafe
code thanorcxx
.
In terms of functionality, this is mostly equivalent, with a ~2% performance penalty on the first step of extract-swhids
and a 0.7 -> 17GB increase in memory usage of the Rust part of that first step. Given this first step includes 96 instances of sort
each with a 100MB in-memory buffer, followed by a second step (merging sorted lists) that takes even longer, these extra resources should not matter in practice.
Edited by vlorentz