Cassandra performance in Python: Avoid namedtuple
Sun, Dec 10, 2017 Companion code for this post available on GithubTLDR: Use dict_factory
instead of named_tuple_factory
for the python
cassandra driver.
Late last year, namedtuple
was fingered
as a culprit causing slow startup time in larger python applications. After
some back and forth (covered by LWN), it
appears that some improvements will come in later versions of Python 3, but
namedtuple
creation remains expensive.
For applications with a fixed number of named tuples, this is a startup penalty but not a drag on operations while running - once the tuples are made, there is no need to remake them. Unfortunately, there are some libraries that do repeatedly create named tuples at runtime. Including, as I discovered while looking for something else entirely, the DataStax python driver for Cassandra.
If you have used Cassandra from python before, you’re likely familiar with
the fact that for each select query, you are returned a Row
object with a
field for each column. These Row
objects are in fact named tuples, generated
by the library after it receives the raw results from the Cassandra nodes,
according to the following factory implementation (somewhat edited for brevity):
def named_tuple_factory(colnames, rows):
"""
Returns each row as a namedtuple
https://docs.python.org/2/library/collections.html#collections.namedtuple
This is the default row factory.
[...]
"""
clean_column_names = map(_clean_column_name, colnames)
try:
Row = namedtuple('Row', clean_column_names)
except Exception:
# Create list because py3 map object will be consumed by first attempt
clean_column_names = list(map(_clean_column_name, colnames))
# [...]
Row = namedtuple('Row', _sanitize_identifiers(clean_column_names))
return [Row(*row) for row in rows]
As we can see, a new named tuple class is created every time you make a query returning a result set. If your application makes a large number of small queries, the named tuple processing time can quickly add up. Even if you are reading the same table repeatedly, there is no cache mechanism for re-using the previously generated tuple type. For the program I was profiling, a staggering 34% of worker run time was spend in named tuple creation. Luckily, the row factory for the Cassandra client can be easily changed, and others are available - raw tuples, dictionaries, and ordered dictionaries. I consider ordinary tuples to be non-ideal since the return data will be position-dependent instead of keyed, so that leaves the dictionary factory as the best alternative on paper. If we take a look at it, it’s rather simple:
def dict_factory(colnames, rows):
"""
Returns each row as a dict.
[...]
"""
return [dict(zip(colnames, row)) for row in rows]
Based on our assumptions about the relative performance of dictionaries and
named tuple creation, we can assume that dict_factory
will be more performant
than named_tuple_factory
. But by how much? In order to get a ballpark, I’ve
constructed a small benchmark that repeatedly queries a random subset of data
from Cassandra, using different row factories. The source code can be found on
GitHub
if you wish to test for yourself.
Run against all of the built-in row factories, here are my local results (some
output removed for clarity):
(venv) ross@mjolnir:/h/r/P/cass_speedtest$ python --version
Python 2.7.13
(venv) ross@mjolnir:/h/r/P/cass_speedtest$ python main.py
Loaded 10000 rows of test data.
Warming cassandra up a bit first...done
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction tuple_factory at 0x7ff9b6e44a10>
Benchmark complete.
Runtime avg: 0.884321 seconds (stddev: 0.000252)
QPS avg: 1131.533536 seconds (stddev: 405.529567)
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction named_tuple_factory at 0x7ff9b6e44ad0>
Benchmark complete.
Runtime avg: 1.480597 seconds (stddev: 0.000065)
QPS avg: 675.442898 seconds (stddev: 13.412463)
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction dict_factory at 0x7ff9b6e44b90>
Benchmark complete.
Runtime avg: 0.876114 seconds (stddev: 0.000070)
QPS avg: 1141.611256 seconds (stddev: 118.118469)
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction ordered_dict_factory at 0x7ff9b6e44c50>
Benchmark complete.
Runtime avg: 0.945361 seconds (stddev: 0.000033)
QPS avg: 1057.873886 seconds (stddev: 40.724691)
Even before we table that up, we can see the named tuple clearly lags behind the others by a significant margin:
Row Factory | Run Time (Seconds) | Queries Per Second |
---|---|---|
Tuple | 0.884321 | 1131.533536 |
Named Tuple | 1.480597 | 675.442898 |
Dict | 0.876114 | 1141.611256 |
Ordered Dict | 0.945361 | 1057.873886 |
A real-world application will likely see less significant (as there is
presumably other business logic vying for CPU time), but likely
appreciable gains from switching from namedtuple
to dict
rows. To change it
in your application is simple - just set the row_factory
property on your
Cassandra session instance, as detailed in the Cassandra docs
here.