Cassandra performance in Python: Avoid namedtuple

Sun, Dec 10, 2017 Companion code for this post available on Github

TLDR: Use dict_factory instead of named_tuple_factory for the python cassandra driver.

Late last year, namedtuple was fingered as a culprit causing slow startup time in larger python applications. After some back and forth (covered by LWN), it appears that some improvements will come in later versions of Python 3, but namedtuple creation remains expensive.

For applications with a fixed number of named tuples, this is a startup penalty but not a drag on operations while running - once the tuples are made, there is no need to remake them. Unfortunately, there are some libraries that do repeatedly create named tuples at runtime. Including, as I discovered while looking for something else entirely, the DataStax python driver for Cassandra.

If you have used Cassandra from python before, you’re likely familiar with the fact that for each select query, you are returned a Row object with a field for each column. These Row objects are in fact named tuples, generated by the library after it receives the raw results from the Cassandra nodes, according to the following factory implementation (somewhat edited for brevity):

def named_tuple_factory(colnames, rows):
    """
    Returns each row as a namedtuple
    https://docs.python.org/2/library/collections.html#collections.namedtuple
    This is the default row factory.
    [...]
    """
    clean_column_names = map(_clean_column_name, colnames)
    try:
        Row = namedtuple('Row', clean_column_names)
    except Exception:
        # Create list because py3 map object will be consumed by first attempt
        clean_column_names = list(map(_clean_column_name, colnames))
        # [...]
        Row = namedtuple('Row', _sanitize_identifiers(clean_column_names))

    return [Row(*row) for row in rows]

As we can see, a new named tuple class is created every time you make a query returning a result set. If your application makes a large number of small queries, the named tuple processing time can quickly add up. Even if you are reading the same table repeatedly, there is no cache mechanism for re-using the previously generated tuple type. For the program I was profiling, a staggering 34% of worker run time was spend in named tuple creation. Luckily, the row factory for the Cassandra client can be easily changed, and others are available - raw tuples, dictionaries, and ordered dictionaries. I consider ordinary tuples to be non-ideal since the return data will be position-dependent instead of keyed, so that leaves the dictionary factory as the best alternative on paper. If we take a look at it, it’s rather simple:

def dict_factory(colnames, rows):
    """
    Returns each row as a dict.
    [...]
    """
    return [dict(zip(colnames, row)) for row in rows]

Based on our assumptions about the relative performance of dictionaries and named tuple creation, we can assume that dict_factory will be more performant than named_tuple_factory. But by how much? In order to get a ballpark, I’ve constructed a small benchmark that repeatedly queries a random subset of data from Cassandra, using different row factories. The source code can be found on GitHub if you wish to test for yourself. Run against all of the built-in row factories, here are my local results (some output removed for clarity):

(venv) ross@mjolnir:/h/r/P/cass_speedtest$ python --version
Python 2.7.13
(venv) ross@mjolnir:/h/r/P/cass_speedtest$ python main.py
Loaded 10000 rows of test data.
Warming cassandra up a bit first...done
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction tuple_factory at 0x7ff9b6e44a10>
Benchmark complete.
Runtime avg: 0.884321 seconds (stddev: 0.000252)
QPS avg: 1131.533536 seconds (stddev: 405.529567)
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction named_tuple_factory at 0x7ff9b6e44ad0>
Benchmark complete.
Runtime avg: 1.480597 seconds (stddev: 0.000065)
QPS avg: 675.442898 seconds (stddev: 13.412463)
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction dict_factory at 0x7ff9b6e44b90>
Benchmark complete.
Runtime avg: 0.876114 seconds (stddev: 0.000070)
QPS avg: 1141.611256 seconds (stddev: 118.118469)
--------------------------------------------------------------------------------
Beginning test for row factory <cyfunction ordered_dict_factory at 0x7ff9b6e44c50>
Benchmark complete.
Runtime avg: 0.945361 seconds (stddev: 0.000033)
QPS avg: 1057.873886 seconds (stddev: 40.724691)

Even before we table that up, we can see the named tuple clearly lags behind the others by a significant margin:

Row Factory	Run Time (Seconds)	Queries Per Second
Tuple	0.884321	1131.533536
Named Tuple	1.480597	675.442898
Dict	0.876114	1141.611256
Ordered Dict	0.945361	1057.873886

A real-world application will likely see less significant (as there is presumably other business logic vying for CPU time), but likely appreciable gains from switching from namedtuple to dict rows. To change it in your application is simple - just set the row_factory property on your Cassandra session instance, as detailed in the Cassandra docs here.

Cassandra performance in Python: Avoid namedtuple

Comments