Persistance / Caching

Once you have a few queries chained together, you may start to get annoyed by how long one or two big things at the start take, and wonder if there’s a way to stop them being executed each time.

For example,

q1 = Q(f'select id, date, rank() over (partition by name order by date) as rank from customers')
q2 = Q(f'select date, count(*) from {q1}')
print(q2.preview_pd(con))
# takes 2 mins becuase q1 is so slow
print(q2.preview_pd(con))
# same thing again, also takes 2 mins
q3 = Q(f'select max(date) from {q2}')
print(q3.preview_pd(con))
# also takes 2 mins because q1 is so slow

The solution is to use csql.Query.persist() on the slow query you want to re-use. Above, we could either do this on q1 or q2, depending on what works best with our database. I’ll demonstrate q2:

q1 = Q(f'select id, date, rank() over (partition by name order by date) as rank from customers')
cache = TempTableCacher(con)
q2 = Q(f'select date, count(*) from {q1}').persist(cache) # <--- !!
print(q2.preview_pd(con))
# still takes 2 mins
print(q2.preview_pd(con))
# now this is fast!
q3 = Q(f'select max(date) from {q2}')
print(q3.preview_pd(con))
# now this is fast as well!

The only general builtin caching method is csql.contrib.persist.TempTableCacher, however it’s straightforward to write your own. You may want to also see csql.contrib.persist as there is a Snowflake-specific example in there as well.

How it Works

The way this works is that when a query marked with .persist(cache) is built/rendered, it is passed to the cache to save, and the cache returns a replacement retrieval query to access its result. The retrieval query is then used whenever the query is rendered, on its own or downstream.

For the above example:

>>> con = some_connection()
>>> # define cache
>>> cache = csql.contrib.persist.TempTableCacher(con)
>>> # define query, no execution yet
>>> q1 = Q(f'select id, date, rank() over (partition by name order by date) as rank from customers')
>>> q2 = Q(f'select date, count(*) from {q1}').persist(cache, 'q2') # note this time we gave a tag.
>>> # reference it, still no execution
>>> q3 = Q(f'select count(*) from {q2}')
>>> # build it in some way:
>>> q3.preview_pd(con) # or q3.db, q3.build(), etc. 

Now the fun happens:

  1. we are rendering SQL of q3

  2. when the q3 renderer gets to the reference to q2, it calls cache.persist(q2)

    1. which runs, roughly,

      temp_name = f'csql_cache_{tag}_{key}'
      con.execute(f'create temp table {temp_name} as {q2.sql}')
      

      which is the slow query we’re wanting to save the results of to avoid re-execution.

    2. and returns, roughly return Q(f'select * from {temp_name}')

  3. so instead of the q3 renderer seeing the original q2, it sees Q(f'select * from csql_cache_q2_asdf1234')

  4. and so gets rendered into

    with _subQuery0 as (
       select * from csql_cache_q2_asdf1234
    )
    select count(*) from _subQuery0
    

Additionally, queries are keyed by their content and parameter values, so previously cached queries can be detected and re-used by the cacher where possible.

csql.persist

class csql.persist.Cacher[source]

Abstract Base Class to represent a persistence/caching method.

To define your own persistance method, you can create a concrete subclass of this. Your implementation only needs to define a single method, _persist().

For example, you might want to write a cacher for SAP HANA - HANA has special syntax for temp tables, so the builtin csql.contrib.persist.TempTableCacher won’t work.

class MyCoolHanaCacher(Cacher):
    def __init__(self, con):
        self.con = con
    def _persist(self, rq: RenderedQuery, key: str, tag: Optional[str]):
        # name our temp table - arbitrary, but if we make it some stable function of `key`
        #   then we can avoid re-computing in future executions as well.
        # Additionally, we put `tag` in there too to be nice to the user, but this isn't
        # strictly needed.
        table_name = f'csql_cache_{tag}_{key}'
        with con.cursor() as c:
            try:
                c.execute(
                    f'create local temporary table #{table_name} as {rq.sql}',
                    rq.params
                )
            except Exception as e:
                if 'existing table' in e: pass # hana has no 'if exists' clause
                else: raise
        return Q(f'select * from #{table_name}')
abstract _persist(rq, key, tag)[source]

This should take a RenderedQuery, save it (keyed by the given key), and return a Query that returns the saved data.

Parameters
  • rq (csql.RenderedQuery) – the csql.RenderedQuery you need to save. Remember, RenderedQuery already has its SQL and parameters prepared and ready to go to a database.

  • key (csql.persist.Key) – a unique key to identify the query you’ve been given.

  • tag (Optional[str]) – a tag supplied by the user, e.g. if they call q.persist(your_cacher, 'some_tag'). If you want, you can include this in the name of your cached data, to make it easy for curious users e.g. to poke around and see what query resulted in what table.

Return type

csql.Query

key is a query content hash that is stable across sessions, so you can avoid re-executing expensive queries if the given key has already been saved in the database (e.g. create table if not exists my_table_{key}).

csql already maintains a record of key-s saved in the current process, but this won’t persist if the python process is restarted - however your tables potentially could, which is where using key becomes helpful.

For example, you could write

def _persist(self, rq, key, tag):
    table_name = f'csql_cache_{tag}_{key}'
    with self.con.cursor() as c:
        c.execute(f'create table if not exists my_user.{table_name} as {rq.sql}', params)
    return Q(f'select * from my_user.{table_name}')

if you were comfortable with leaving permanent tables around in your database.

class csql.persist.Key

A cache key. Type alias of str.