Basic Usage

csql is a Python library to help you write more manageable SQL queries. You can write your queries as small, self-contained chunks, preview the results without pulling a whole result-set down from your database, then refer to them in future queries.

There are also useful features for handling database parameters properly.

The intended use-case is for data analysis and exploration.

Full documentation is available at https://csql.readthedocs.io/en/latest/api.html .

Example

from csql import Q, Parameters
import pandas as pd
from datetime import date

con = your_database_connection()

Start with a straightforward query:

p = Parameters(
	created_on = date(2020,1,1)
)
q1 = Q(f"""
select
	customers.id,
	first(customers.name) as name,
	first(created_on) as created_on,
	sum(sales.value) as sales
from
	customers
	join sales on customers.id = sales.customer_id
where created_on > {p['created_on']}}
group by customers.id
""")

print(q1.preview_pd(con))

	id	name	created_on	sales
1	111	John Smith	2020-02-05	32.0
2	112	Amy Zhang	2020-05-01	101.5
3	115	Tran Thanh	2020-03-02	100000.0

The preview will pull down 10 rows to a) sanity-check the result of what you’ve just written, and b) validate your sql.

Now, try building some new queries that build on your previous queries:

q2 = Q(f"""
select
	ntile(100) over (order by sales)
		as ntile_100,
	name,
	sales
from {q1}
""")

print(q2.preview_pd(con))

	ntile_100	name	sales
1	29	John Smith	32.0
2	50	Amy Zhang	101.5
3	99	Tran Thanh	100000.0

q3 = Q(f"""
select
	ntile_100,
	min(sales),
	max(sales)
from {q2}
group by ntile_100
order by ntile_100
""")

# this time, we'll pull the whole result instead of just previewing:
result = pd.read_sql(**q3.pd(), con=con)
print(result)

	ntile_100	min(sales)	max(sales)
28	29	25	33.3
49	50	98	120
98	99	5004	100000.0

Cool! But, how does it work?

The basic idea is to turn your queries into a CTE by keeping track of what builds on top of what. For example, for the last query shown, q3, what actually gets sent to the database is:

with _subQuery0 as (
	select
		customers.id,
		first(customers.name) as name,
		first(created_on) as created_on,
		sum(sales.value) as sales
	from
		customers
		join sales on customers.id = sales.customer_id
	where created_on > :1
	group by customers.id
),
_subQuery1 as (
	select
		ntile(100) over (order by sales)
			as ntile_100,
		name,
		sales
	from _subQuery0
)
select
	ntile_100,
	min(sales),
	max(sales)
from _subQuery1
group by ntile_100
order by ntile_100

which is exactly the sort of unmaintainable and undebuggable monstrosity that this library is designed to help you avoid.

Easy Parameters

Using proper SQL prepared statements is great to do, but can be annoying to maintain. Additionally, it can be incredibly annoying when you are trying to use a list from Python:

con = my_connection()
ids_i_want = [1, 2, 3]
with con.cursor() as c:
	# uh oh, you can't do this
	c.execute('select * from customers where id in :1', (ids_i_want,))

	# you need to do something like this instead
	c.execute('select * from customers where id in (:1, :2, :3), (ids_i_want[0], ids_i_want[1], ids_i_want[2],))

csql makes this much easier - you can embed your parameters naturally with string interpolation, and they will still be sent as proper parameterized statements.

p = Parameters(ids_i_want=[1, 2, 3], name="Jarrad")

get_customers = Q(f"""
	select * from customers
	where
		ids in {p["ids_i_want"]}
		or name = {p["name"]}
""")

with con.cursor() as c:
	c.execute(*get_customers.db)

That final statement is actually equivalent to:

with con.cursor() as c:
	c.execute(
		"""
		select * from customers
		where
			ids in (:1, :2, :3)
			or name = :4
	""",
		[1, 2, 3, "Jarrad"],
	)

Changing Parameter Values

Parameters aren’t super useful if they are set in stone, but csql wants you to give values at the query definition time! How can you pass different values later?

This is achieved by passing newParams to csql.Query.build():

p = Parameters(start=datetime.now() - timedelta(days=3), end=datetime.now())
q = Q(f"select count(*) from events where start <= date and date < end")
pd.read_sql(**q.pd, con=con)
# 42 # 3 days ago to now, as per `p`.
newParams = {"start": date(2010, 1, 1)}
pd.read_sql(**q.build(newParams=newParams).pd, con=con)
# 42000 # 2010 to now, with new value for `start` provided.

`csql`

Q()

csql.Q(sql, dialect=InferOrDefault(dialect=SQLDialect(paramstyle=ParamStyle.numeric, limit=Limit.limit)), overrides=InferOrDefault(overrides=None))[source]

Create a csql.Query.

Usage:

>>> p = Parameters(created_on=date(2020,1,1))
>>> q_cust  = Q(f'''select name, customer_type from customers where created_on > {p['created_on']}''')
>>> q_count = Q(f'select customer_type, count(*) from {q_cust} group by rollup(type)')

See: Basic Usage

Parameters:

sql (str) – A string with a SQL query. The string is designed to be built with an f'f-string', so you can interpolate Parameters and other Queries inside in a natural way.
dialect (csql.dialect.SQLDialect | csql.dialect.InferOrDefault) – A default dialect to use when building this Query. By default, if this Query references another Query, the references Query’s dialects will be used.
overrides (csql.overrides.Overrides | None | csql.overrides.InferOrDefault) – A default set of overrides to use when building this Query. By default, if this Query references another Query, the references Query’s overrides will be used.

Return type:

csql.Query

Parameters()

class csql.Parameters(**kwargs)[source]

Parameters let you quickly initialize a bunch of params to pass into your queries.

Once parameters have been added in the Parameters constructor or with add(), they can be pulled out by their p['parameter name'], for use in a Query.

Usage:

>>> p = Parameters(
...   start=date(2019,1,1),
...   end=date(2020,1,1)
... )
>>> q = Q(f"select * from customers where {p['start']} <= date and date < {p['end']}")

See: Changing Parameter Values

Parameters:: kwargs (ParameterValue)

add(value='_csql_novalue', /, **kwargs)[source]

Adds a single parameter into this Parameters, and returns it. You don’t normally need this (just add them directly when building Parameters), but it can be useful in loops where you need to build a query based on an unknown number of params.

Can be called as

>>> p.add('value')
... # which will add a single parameter with an autogenerated name.

Can also be called as

>>> p.add(key='value')
... # which will add a named parameter.

Useful in loops:

>>> p = Parameters()
>>> licence_cancellations = [
...   ('Shazza', date(2019, 1,  1)),
...   ('Bazza',  date(2019, 1, 26)),
...   ('Azza',   date(2022, 1,  3))
... ]
>>> where_clause = ' or '.join(
...   f'(name = {p.add(name)} and timestamp > {p.add(date)})'
...  for name, date in licence_cancellations
... )
>>> query = Q(f'select * from frankston_traffic_log where {where_clause}')

Parameters:

value (csql.ParameterValue) – A single parameter to add: add(123). Cannot be used with kwargs.
kwargs (csql.ParameterValue) – A single key and parameter to add: add(my_fav_number=123). Cannot be used with value.

Return type:

csql.ParameterPlaceholder

Query

class csql.Query[source]

A Query is CSQL’s structured concept of a SQL query. You should not create these directly, instead you should use csql.Q().

Parameters:

queryParts (tuple[str | QueryBit, ...])
default_dialect (csql.dialect.SQLDialect | csql.dialect.InferOrDefault)
default_overrides (Overrides | csql.overrides.InferOrDefault | None)
_extensions (frozenset[QueryExtension])

preview_pd(con, rows=10, dialect=None, newParams=None, overrides=None)[source]

Return a small dataframe to preview the results of this query.

Usage:

>>> c = my_connection()
>>> q = Q(f'''select 123 as val''')
>>> print(q.preview_pd(c))
   val
0  123

Parameters:

con (Any) – A DBAPI-compliant connection, passed directly to con arg of pandas.read_sql().
rows (int | None) – The number of rows to pull.
dialect (csql.dialect.SQLDialect | None)
newParams (Mapping[str, ParameterValue] | None)
overrides (csql.overrides.Overrides | None)

Return type:

pandas.DataFrame

preview_pl(con, rows=10, dialect=None, newParams=None, overrides=None)[source]

Return a small polars DataFrame to preview the results of this query.

Return type:

polars.DataFrame

Parameters:

con (Any)
rows (int | None)
dialect (csql.dialect.SQLDialect | None)
newParams (Mapping[str, ParameterValue] | None)
overrides (csql.overrides.Overrides | None)

build(*, dialect=None, newParams=None, overrides=None)[source]

Build this csql.Query into a csql.RenderedQuery.

While you can specify paramters to manually override how this Query is rendered, it’s normally better to just supply these as defaults when you create your Queries in the first place. See: SQL Dialects.

Parameters:

dialect (csql.dialect.SQLDialect | None) – An optional csql.dialect.SQLDialect to render as. See SQL Dialects.
newParams (Mapping[str, ParameterValue] | None) – A dictionary of {'key': value} to override any parameters. See: Changing Parameter Values.
overrides (csql.overrides.Overrides | None) – An optional csql.overrides.Overrides to override how rendering workd. See: Advanced Customization.

Return type:

csql.RenderedQuery

property pd: dict[str, Any]

Convenience wrapper for Query.build().pd.

Returns a dict of {'sql':sql, 'params':params}, for usage like:

>>> import pandas as pd
>>> con = my_connection()
>>> q = Q('select 123')
>>> pd.read_sql(**q.pd, con=con)

property db: tuple[str, tuple[Hashable, ...]]

Convenience wrapper for Query.build().db.

Returns a tuple of (sql, params), for usage like:

>>> con = my_connection()
>>> q = Q('select 123')
>>> con.cursor().execute(*q.db)

property pl: PolarsQueryArgs

Convenience wrapper for Query.build().db.

Returns a dict of {'query':query, 'execute_options':{'parameters':params}}, for usage like:

>>> import polars as pl
>>> con = my_connection()
>>> q = Q('select 123')
>>> pl.read_database(**q.pl, connection=con)

property ddb: DuckDBQueryArgs

Convenience wrapper for Query.build().db.

Returns a dict of {'query':query, 'params':params}, for usage like:

>>> import duckdb
>>> con = duckdb.connect()
>>> q = Q('select 123')
>>> con.query(**q.ddb).pl

property ch: ClickhouseQueryArgs

Convenience wrapper for Query.build().db.

Returns a dict of {'query':query, 'parameters':params}, for usage like:

>>>
>>> import clickhouse_connect
>>> ch = clickhouse_connect.create_client(password='asdf')
>>> q = Q('select 123')
>>> ch.query_df_arrow(**q.ch, dataframe_library='polars')

persist(cacher, tag=None)[source]

Marks this query for persistance with the given csql.persist.Cacher.

See: Persistance / Caching

Usage:

>>> con = some_connection()
>>> cache = csql.contrib.persist.TempTableCacher(con)
>>> q = Q(f'select 123 from something_slow').persist(cache)
>>> q.preview_pd(con) # slow
>>> q.preview_pd(con) # fast
>>> q2 = Q(f'select count(*) from {q}')
>>> q2.preview_pd(con) # also fast

Parameters:

cacher (csql.persist.Cacher)
tag (str | None)

Return type:

csql.Query

RenderedQuery

class csql.RenderedQuery[source]

A RenderedQuery is a pair of (sql, parameters), ready to be passed directly to a database.

They are obtained by using Query.build().

Parameters:

sql (str)
parameters (tuple[Hashable, ...])
parameter_names (tuple[str | None, ...])

sql: str: The rendered SQL, ready to be passed to a database.

parameters: tuple[Hashable, ...]: A tuple of parameters, to go along with the SQL.

parameter_names: tuple[str | None, ...]: A tuple of parameter names that the parameters were passed as.

property pd: dict[str, Any]

Gives dict of {'sql':sql, 'params':params}, for usage like:

>>> con = my_connection()
>>> q = Q('select 123')
>>> pd.read_sql(**q.build().pd, con=con)

property db: tuple[str, tuple[Hashable, ...]]

Returns a tuple of (sql, params), for usage like:

>>> con = my_connection()
>>> q = Q('select 123')
>>> con.cursor().execute(*q.build().db)

property params_dict: dict[str, Hashable]

property ch: ClickhouseQueryArgs

property ddb: DuckDBQueryArgs

property pl: PolarsQueryArgs

Other

class csql.ParameterValue

Valid parameter value. You can use any hashable value here (so str, int, float, date, .. ) all fine. You can also use a Sequence of the above.

Type alias for Hashable | Sequence[Hashable].

class csql.ParameterPlaceholder[source]

A ParameterPlaceholder is what you get when you get an individual parameter by name from a Parameters object, like p[‘param_you_want’]. The only thing you should need to do with it is interpolate it into a query:

>>> p = Parameters(param_you_want=123)
>>> q = Q(f'select {p["param_you_want"]}')
>>> q.db
('select :1', (123,))

Parameters:

key (str | AutoKey)
value (csql.ParameterValue)
_key_context (int | None)
fmt (str)

fmt: str

class csql.ClickhouseQueryArgs[source]

query: str

parameters: dict[str, Hashable]

class csql.DuckDBQueryArgs[source]

query: str

params: tuple[Hashable, ...]

class csql.PolarsQueryArgs[source]

query: str

execute_options: dict[str, Any]

class csql.QueryBit[source]

class csql.QueryExtension[source]

Basic Usage

Example

Cool! But, how does it work?

Easy Parameters

Changing Parameter Values

csql

Q()

Parameters()

Query

RenderedQuery

Other

`csql`