The schema is only known at runtime, which prevents ahead-of-time code generation. JITing would be probably be an improvement though (at least in our use case, stream processing, where we're basically always willing to pay higher upfront costs for higher long-term performance).
(Actually, the original version of Arroyo was purely based around ahead-of-time code generation, and used serde_json for deserialization. I wrote at length why we decided to move away from that approach here: https://www.arroyo.dev/blog/why-arrow-and-datafusion).
I'm not very familiar with the internals of serde_json, but it's largely solving a different problem. Serde_json is deserializing individual records (rows), and typically using compile-time code generation. (Alternatively, it can also deserialize into fully-dynamic structures like serde_json::Value).
Arrow-json, by contrast, is doing columnar deserialization against a schema that's only known at runtime. To me, the interesting aspects of the design are how it does that performantly.
The benchmark section ("But is it fast?") contains a common error when trying to represent ratios as percentages.
For the "Tweets" case, it reports a speedup of 229%. The old value is 11.73 and the new is 5.108. That is a speedup of 2.293 (i.e. the new measurement is 2.293 times faster), but that is a difference of -56%, not 229%, so it's 129% faster, if you really want to use a comparative percentage.
Because using percentages to express ratio of change can be confusing or misleading, I always recommend using speedup instead, which is a simple ratio. A speedup of 2 is twice as fast. A speedup of 1 is the same. 0.5 is half as fast.
Formulas:
speedup(old, new) = old / new
relativePercent(old, new) = ((new / old) - 1) * 100
differenceInPercent(old, new) = (new - old) / old * 100
It would be great if someone could implement the schema discovery algorithm from the DB research GOAT, Thomas Neumann, and add it to Apache Arrow: https://db.in.tum.de/~durner/papers/json-tiles-sigmod21.pdf
Given that schema is known, should be able to avoid general JSON parsing. Would be much faster.
The schema is only known at runtime, which prevents ahead-of-time code generation. JITing would be probably be an improvement though (at least in our use case, stream processing, where we're basically always willing to pay higher upfront costs for higher long-term performance).
(Actually, the original version of Arroyo was purely based around ahead-of-time code generation, and used serde_json for deserialization. I wrote at length why we decided to move away from that approach here: https://www.arroyo.dev/blog/why-arrow-and-datafusion).
How does it compare with serde, which AFAIK uses the same approach
I'm not very familiar with the internals of serde_json, but it's largely solving a different problem. Serde_json is deserializing individual records (rows), and typically using compile-time code generation. (Alternatively, it can also deserialize into fully-dynamic structures like serde_json::Value).
Arrow-json, by contrast, is doing columnar deserialization against a schema that's only known at runtime. To me, the interesting aspects of the design are how it does that performantly.
The benchmark section ("But is it fast?") contains a common error when trying to represent ratios as percentages.
For the "Tweets" case, it reports a speedup of 229%. The old value is 11.73 and the new is 5.108. That is a speedup of 2.293 (i.e. the new measurement is 2.293 times faster), but that is a difference of -56%, not 229%, so it's 129% faster, if you really want to use a comparative percentage.
Because using percentages to express ratio of change can be confusing or misleading, I always recommend using speedup instead, which is a simple ratio. A speedup of 2 is twice as fast. A speedup of 1 is the same. 0.5 is half as fast.
Formulas:
Thanks for pointing that out! I've updated the table to use speedups rather than (incorrectly computed) percentages.