Blog Logo
TAGS

PyArrow Strings in Dask DataFrames - Improved PyArrow data type support for faster and more compact memory

PyArrow strings have become a major part of pandas 2.0 release as they are faster and more compact in memory than Python object strings. Dask DataFrame users stand to benefit from using PyArrow strings as they allow more data processing on less hardware, leading to significant overall cost reduction. Support for PyArrow strings still requires upstream fixes for some operations, but most operations work fine. The PyData community can learn more about PyArrow strings by referring to resources such as the pandas 2.0.0 release notes and an Uber/Lyft dataset tutorial. Early adopters are encouraged to report their experience, both positive and negative, to help improve PyArrow strings for the PyData community. Contributors to this effort include Matthew Roeschke (NVIDIA), Patrick Hoefler (Coiled), and Rick Zamora (NVIDIA).