import duckdb
= duckdb.connect() dbcon
Do you have large datasets that you simply can not load into memory to analyse with Pandas? Or do you feel more comfortable expressing operations in SQL instead of python?
Fret not, for you have DuckDB
now! ✨🦆✨
Introduction
Most of us data scientists are familiar with Pandas, the de-facto data processing library in Python. It’s simple, powerful, able to handle a myriad data formats & sizes and for the most part, very intuitive. And people familiar with SQL as well know how powerful it too is in its ability to handle complex data transformations with a concise syntax. But sometimes, you have to trade off execution speeds in favor of intuitivity/expressiveness or vice versa. This especially becomes painful when you either have large (OOM) datasets or need to perform complex data wrangling on your data. At times, it makes more sense to handle that in SQL rather than in Pandas.
DuckDB aims to bring these together with a fast & efficient SQL query execution engine that can run complex queries on large sets of data. It integrates tightly with Pandas DataFrames and allows us to run these queries directly on top of them without needing to take data in & out of it.
In this tutorial, we shall see how we can leverage DuckDB in various analytical workloads.
Setup
DuckDB is very lightweight and has no external dependencies and runs within the host process itself. Simply install it with:
pip install duckdb==0.3.4
To initialize it, run:
That’s it! Now you can test it by running:
'select 1, 2, 3').fetchall() dbcon.execute(
[(1, 2, 3)]
Next step is to run pip install pyarrow
to add support for reading/writing parquet data.
Jupyter Notebook setup
If in case you wish to explore it in Jupyter Notebooks, install a few additional libraries for a better experience:
pip install ipython-sql SQLAlchemy duckdb-engine
Import them once installed:
import pandas as pd
import sqlalchemy
%load_ext sql
Set a few config options to prettify the output and return it as Pandas DataFrame
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False
DuckDB is primarily designed to be an in-memory DB. You can however persist your data to disk.
%sql duckdb:///:memory:
# %sql duckdb:///path/to/file.db
#collapse-output
# we can also access the current config & settings of DuckDB by running the following:
%sql SELECT * FROM duckdb_settings();
name | value | description | input_type | |
---|---|---|---|---|
0 | access_mode | automatic | Access mode of the database (AUTOMATIC, READ_O... | VARCHAR |
1 | checkpoint_threshold | 16.7MB | The WAL size threshold at which to automatical... | VARCHAR |
2 | debug_checkpoint_abort | NULL | DEBUG SETTING: trigger an abort while checkpoi... | VARCHAR |
3 | debug_force_external | False | DEBUG SETTING: force out-of-core computation f... | BOOLEAN |
4 | debug_many_free_list_blocks | False | DEBUG SETTING: add additional blocks to the fr... | BOOLEAN |
5 | debug_window_mode | NULL | DEBUG SETTING: switch window mode to use | VARCHAR |
6 | default_collation | The collation setting used when none is specified | VARCHAR | |
7 | default_order | asc | The order type used when none is specified (AS... | VARCHAR |
8 | default_null_order | nulls_first | Null ordering used when none is specified (NUL... | VARCHAR |
9 | disabled_optimizers | DEBUG SETTING: disable a specific set of optim... | VARCHAR | |
10 | enable_external_access | True | Allow the database to access external state (t... | BOOLEAN |
11 | enable_object_cache | False | Whether or not object cache is used to cache e... | BOOLEAN |
12 | enable_profiling | NULL | Enables profiling, and sets the output format ... | VARCHAR |
13 | enable_progress_bar | False | Enables the progress bar, printing progress to... | BOOLEAN |
14 | explain_output | physical_only | Output of EXPLAIN statements (ALL, OPTIMIZED_O... | VARCHAR |
15 | force_compression | NULL | DEBUG SETTING: forces a specific compression m... | VARCHAR |
16 | log_query_path | NULL | Specifies the path to which queries should be ... | VARCHAR |
17 | max_memory | 26.9GB | The maximum memory of the system (e.g. 1GB) | VARCHAR |
18 | memory_limit | 26.9GB | The maximum memory of the system (e.g. 1GB) | VARCHAR |
19 | null_order | nulls_first | Null ordering used when none is specified (NUL... | VARCHAR |
20 | perfect_ht_threshold | 12 | Threshold in bytes for when to use a perfect h... | BIGINT |
21 | preserve_identifier_case | True | Whether or not to preserve the identifier case... | BOOLEAN |
22 | profiler_history_size | NULL | Sets the profiler history size | BIGINT |
23 | profile_output | The file to which profile output should be sav... | VARCHAR | |
24 | profiling_mode | NULL | The profiling mode (STANDARD or DETAILED) | VARCHAR |
25 | profiling_output | The file to which profile output should be sav... | VARCHAR | |
26 | progress_bar_time | 2000 | Sets the time (in milliseconds) how long a que... | BIGINT |
27 | schema | Sets the default search schema. Equivalent to ... | VARCHAR | |
28 | search_path | Sets the default search search path as a comma... | VARCHAR | |
29 | temp_directory | .tmp | Set the directory to which to write temp files | VARCHAR |
30 | threads | 12 | The number of total threads used by the system. | BIGINT |
31 | wal_autocheckpoint | 16.7MB | The WAL size threshold at which to automatical... | VARCHAR |
32 | worker_threads | 12 | The number of total threads used by the system. | BIGINT |
33 | binary_as_string | In Parquet files, interpret binary data as a s... | BOOLEAN | |
34 | Calendar | gregorian | The current calendar | VARCHAR |
35 | TimeZone | Europe/Berlin | The current time zone | VARCHAR |
From now on, you can run SQL directly by prefixing %sql
(or %%sql
for multiline statements) to your code cell and get your output returned as pandas dataframe :man_dancing:.
%sql select 1 as a;
a | |
---|---|
0 | 1 |
DuckDB vs traditional Databases
With pandas.read_sql
command, one can already run SQL queries on an existing DB connection, read tables and load data as pandas DataFrames in memory for processing in python. While this is fine for lightweight operations, it is not optimized for heavy data processing. Traditional RDBMSs such as Postgres, MySQL, etc. process each row sequentially which apart from taking long time to execute, also induce a lot of overhead on CPU. DuckDB on the other hand is built with OLAP in mind and is Column-Vectorized. This helps massively parallelize disk I/O and query executions.
DuckDB uses the Postgres SQL parser under the hood, and offers many of the same SQL features as Postgres 1
Accessing DataFrames with DuckDB
One of the major limitations of Pandas is that it’s single threaded. Hence we can’t take advantage of the multi-core CPUs which every computer ships with these days. DuckDB supports multi-threading natively and hence can take advantage of all the cores of your CPU.
There exist a few some external libraries that enable parallel processing but only on specific operations.
While this is turned on by default, we also can explicitly set the number of threads:
#collapse-output
# specify no. of threads for multi-threading support
# dbcon.execute("PRAGMA threads=32");
# you can also set this up with the `%sql` line magic
%sql SET THREADS to 32;
We can also write a short snippet to run an SQL query, and return the result as a Pandas DataFrame in a pythonic way:
def dbrun(query: str) -> pd.DataFrame:
"""run the query with duckdb"""
= dbcon.query(query).to_df()
result
# you can also use `fetch` instead
# result = dbcon.execute(query).fetch_df()
# or just simply `.df()`
# result = dbcon.execute(query).df()
return result
Now create a DataFrame & run a simple query:
= pd.DataFrame(range(10), columns=['a'])
df
# querying with python function
= dbrun("""
first_half_df SELECT *
FROM df
WHERE a < 5
""")
Alternatively, you can also compute the result with the line magics and store the result as a python variable that you can access elsewhere:
#collapse-output
%%sql
<<
second_half_df *
SELECT
FROM df>= 5 WHERE a
Returning data to local variable second_half_df
Notice that we didn’t declare df
as a view, nor did we create any table. DuckDB simply looks for any python variable that was declared before if the table name doesn’t exist.
Accessing files with DuckDB
Often times we have data stored either in DB or in parquet/csv
formats and they could be huge or may not/barely fit in memory. Loading them with pandas incur significant memory overhead and also possibly creates multiple copies when performing certain operations. With DuckDB, such files can be queried directly without any copying because of direct integration with Apache Arrow.
To demonstrate this, we’ll download an airline delay & cancelation dataset from kaggle (and place it under data/airline_data/
folder). The compressed archive is about ~2GB
and when deflated, creates a couple of CSVs totalling ~8GB
of data.
Let’s have a quick look at the contents by loading a single csv file:
%%sql
*
select from './data/airline_data/2009.csv'
5 limit
FL_DATE | OP_CARRIER | OP_CARRIER_FL_NUM | ORIGIN | DEST | CRS_DEP_TIME | DEP_TIME | DEP_DELAY | TAXI_OUT | WHEELS_OFF | ... | CRS_ELAPSED_TIME | ACTUAL_ELAPSED_TIME | AIR_TIME | DISTANCE | CARRIER_DELAY | WEATHER_DELAY | NAS_DELAY | SECURITY_DELAY | LATE_AIRCRAFT_DELAY | Unnamed: 27 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2009-01-01 | XE | 1204 | DCA | EWR | 1100 | 1058.0 | -2.0 | 18.0 | 1116.0 | ... | 62.0 | 68.0 | 42.0 | 199.0 | None | None | None | None | None | None |
1 | 2009-01-01 | XE | 1206 | EWR | IAD | 1510 | 1509.0 | -1.0 | 28.0 | 1537.0 | ... | 82.0 | 75.0 | 43.0 | 213.0 | None | None | None | None | None | None |
2 | 2009-01-01 | XE | 1207 | EWR | DCA | 1100 | 1059.0 | -1.0 | 20.0 | 1119.0 | ... | 70.0 | 62.0 | 36.0 | 199.0 | None | None | None | None | None | None |
3 | 2009-01-01 | XE | 1208 | DCA | EWR | 1240 | 1249.0 | 9.0 | 10.0 | 1259.0 | ... | 77.0 | 56.0 | 37.0 | 199.0 | None | None | None | None | None | None |
4 | 2009-01-01 | XE | 1209 | IAD | EWR | 1715 | 1705.0 | -10.0 | 24.0 | 1729.0 | ... | 105.0 | 77.0 | 40.0 | 213.0 | None | None | None | None | None | None |
5 rows × 28 columns
we can load efficiently with pandas too by specifying the nrows
parameter:
'./data/airline_data/2009.csv', nrows=5) pd.read_csv(
FL_DATE | OP_CARRIER | OP_CARRIER_FL_NUM | ORIGIN | DEST | CRS_DEP_TIME | DEP_TIME | DEP_DELAY | TAXI_OUT | WHEELS_OFF | ... | CRS_ELAPSED_TIME | ACTUAL_ELAPSED_TIME | AIR_TIME | DISTANCE | CARRIER_DELAY | WEATHER_DELAY | NAS_DELAY | SECURITY_DELAY | LATE_AIRCRAFT_DELAY | Unnamed: 27 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2009-01-01 | XE | 1204 | DCA | EWR | 1100 | 1058.0 | -2.0 | 18.0 | 1116.0 | ... | 62.0 | 68.0 | 42.0 | 199.0 | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2009-01-01 | XE | 1206 | EWR | IAD | 1510 | 1509.0 | -1.0 | 28.0 | 1537.0 | ... | 82.0 | 75.0 | 43.0 | 213.0 | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2009-01-01 | XE | 1207 | EWR | DCA | 1100 | 1059.0 | -1.0 | 20.0 | 1119.0 | ... | 70.0 | 62.0 | 36.0 | 199.0 | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 2009-01-01 | XE | 1208 | DCA | EWR | 1240 | 1249.0 | 9.0 | 10.0 | 1259.0 | ... | 77.0 | 56.0 | 37.0 | 199.0 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2009-01-01 | XE | 1209 | IAD | EWR | 1715 | 1705.0 | -10.0 | 24.0 | 1729.0 | ... | 105.0 | 77.0 | 40.0 | 213.0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 28 columns
While the first operation with DuckDB took ~250ms
, the second one above took only ~50ms
. It’s possibly because such operations are optimized in pandas and also for computing simple metrics across rows (such as mean, min, max etc.), pandas can make use of highly optimized numpy routines under the hood. But, as we’ll see later, these would only go so far.
To load all these CSVs into one dataframe with pandas is very inefficient and takes a lot of time. And it’ll most likely fail if your computer doesn’t have a beefy RAM. If you still gotta do it, you’d have to iterate through all CSVs and concatenate like so:
from glob import glob
= glob('data/airline_data/*.csv')
csvs = pd.concat(pd.read_csv(csv) for csv in csvs) df
A better alternative would be to merge all those CSVs into a single parquet file. Unlike CSVs, parquet files are structured and as such are unambiguous to read.2 They are compressed columnar files and hence take much less space than individual CSVs combined because of the efficient Arrow data format.
%sql copy 'data/airline_data/*.csv' to 'data/airline_data/data.parquet'
Count | |
---|---|
0 | 61556964 |
Notice that we directly copied all the CSVs using regex to a single parquet file without ever copying data into pandas. Now, we ended up from a 7.5GB combined CSV files to a 2.5GB parquet file which now easily fits into memory. So we can load it with pandas as:
= pd.read_parquet('data/airline_data/data.parquet')
df df.shape
(61556964, 28)
Because of the efficient parquet format, we could load the entire 61 Million rows of data in mere seconds.
Projection & Filter Pushdowns
Now let’s do a simple filter operation on our dataset. Let’s count the total number of rows that satisfy the condition TAXI_OUT > 10
. We’ll try with both pandas & duckdb.
'TAXI_OUT'] > 10].shape df[df[
(45209245, 28)
%%sql
*) as count
select count(from df
> 10 where TAXI_OUT
count | |
---|---|
0 | 45209245 |
While the earlier operation took ~9.5s
, the latter just took ~250ms :zap:. There’s just no comparison.
This is because duckdb automatically optimizes the query by selecting only the required column(s) (aka projection pushdown
) and then applies the filtering to get a subset of data (aka filter pushdown
). Pandas instead reads through all the columns. We can optimize this in pandas by doing these pushdowns ourselves.
= df[['TAXI_OUT']]
projection_pushdown_df = projection_pushdown_df[projection_pushdown_df['TAXI_OUT'] > 10]
filter_pushdown_df filter_pushdown_df.shape
(45209245, 1)
We managed to bring this down from several seconds to almost a second. But using duckdb is still about 70-90%
faster than this.
Using Groupby
Now let’s calculate a few aggregates using groupby
with projection & filter pushdowns combined.
Here, we compute a few simple metrics with a certain airline carrier grouped by two origin & destination airports and finally sort the results by the origin airport.
= df[['ORIGIN', 'DEST', 'TAXI_OUT',
projection_df 'AIR_TIME', 'DISTANCE', 'OP_CARRIER']]
= projection_df[
origin_df 'ORIGIN'].isin(('DCA', 'EWR'))) &
(projection_df['DEST'].isin(('DCA', 'EWR'))) &
(projection_df['OP_CARRIER'] == 'XE')]
(projection_df[
(origin_df'ORIGIN', 'DEST'])
.groupby([
.agg(=('TAXI_OUT', 'mean'),
avg_taxi_out=('AIR_TIME', 'max'),
max_air_time=('DISTANCE', 'sum'))
total_distance=0)
.sort_index(level )
avg_taxi_out | max_air_time | total_distance | ||
---|---|---|---|---|
ORIGIN | DEST | |||
DCA | EWR | 22.116009 | 87.0 | 828835.0 |
EWR | DCA | 23.675481 | 93.0 | 831024.0 |
We can make it a bit more concise by using .query
for filtering pushdown.
(df'OP_CARRIER == "XE" and ORIGIN in ("DCA", "EWR") and DEST in ("DCA", "EWR")')
.query('ORIGIN', 'DEST'])
.groupby([
.agg(=('TAXI_OUT', 'mean'),
avg_taxi_out=('AIR_TIME', 'max'),
max_air_time=('DISTANCE', 'sum'))
total_distance )
avg_taxi_out | max_air_time | total_distance | ||
---|---|---|---|---|
ORIGIN | DEST | |||
DCA | EWR | 22.116009 | 87.0 | 828835.0 |
EWR | DCA | 23.675481 | 93.0 | 831024.0 |
This approach took only about half the time (~3s) compared to our earlier one because .query
uses a modified syntax of python and also indexing thus resulting in more efficient evaluation. We can now compare that to our SQL counterpart…
%%sql
select
ORIGIN,
DEST,as avg_taxi_out,
AVG(TAXI_OUT) as max_air_time,
MAX(AIR_TIME) as total_distance
SUM(DISTANCE)
from df
where= 'XE' and
OP_CARRIER in ('DCA', 'EWR') and
ORIGIN in ('DCA', 'EWR')
DEST
group by ORIGIN, DEST order by ORIGIN
ORIGIN | DEST | avg_taxi_out | max_air_time | total_distance | |
---|---|---|---|---|---|
0 | DCA | EWR | 22.116009 | 87.0 | 828835.0 |
1 | EWR | DCA | 23.675481 | 93.0 | 831024.0 |
This ~400ms execution with duckdb above is around an order of magnitude faster and also a lot cleaner, I’d say. :wink:
Notice that the data is already loaded under df
and hence we don’t need to read from the source parquet file.
In the same way, we can also improve the performance of our queries drastically when using joins across multiple tables. I leave this as an exercise to the reader.
But why actually load data into memory in the first place when we can process it more efficiently with it being just on disk? Often times, the data is too big to load into memory anyways.
To do that, we just need to create a VIEW
to our data which lets us query the table directly without loading onto memory and update the source from the dataframe df
to the newly created view instead.3
Accessing data with Views
%%sql
as select * from 'data/airline_data/data.parquet';
create view airlinedata
select
ORIGIN,
DEST,as avg_taxi_out,
AVG(TAXI_OUT) as max_air_time,
MAX(AIR_TIME) as total_distance
SUM(DISTANCE)
from airlinedata
where= 'XE' and
OP_CARRIER in ('DCA', 'EWR') and
ORIGIN in ('DCA', 'EWR')
DEST
-- you can also use `ALL` instead of each column
group by ALL order by ORIGIN
ORIGIN | DEST | avg_taxi_out | max_air_time | total_distance | |
---|---|---|---|---|---|
0 | DCA | EWR | 22.116009 | 87.0 | 828835.0 |
1 | EWR | DCA | 23.675481 | 93.0 | 831024.0 |
This is such a powerful feature and allows us to explore lot more data than traditionally possible with pandas.
We can also check the datatypes and other metadata with describe view/table name
#collapse-output
%sql describe airlinedata
column_name | column_type | null | key | default | extra | |
---|---|---|---|---|---|---|
0 | FL_DATE | DATE | YES | None | None | None |
1 | OP_CARRIER | VARCHAR | YES | None | None | None |
2 | OP_CARRIER_FL_NUM | INTEGER | YES | None | None | None |
3 | ORIGIN | VARCHAR | YES | None | None | None |
4 | DEST | VARCHAR | YES | None | None | None |
5 | CRS_DEP_TIME | INTEGER | YES | None | None | None |
6 | DEP_TIME | DOUBLE | YES | None | None | None |
7 | DEP_DELAY | DOUBLE | YES | None | None | None |
8 | TAXI_OUT | DOUBLE | YES | None | None | None |
9 | WHEELS_OFF | DOUBLE | YES | None | None | None |
10 | WHEELS_ON | DOUBLE | YES | None | None | None |
11 | TAXI_IN | DOUBLE | YES | None | None | None |
12 | CRS_ARR_TIME | INTEGER | YES | None | None | None |
13 | ARR_TIME | DOUBLE | YES | None | None | None |
14 | ARR_DELAY | DOUBLE | YES | None | None | None |
15 | CANCELLED | DOUBLE | YES | None | None | None |
16 | CANCELLATION_CODE | VARCHAR | YES | None | None | None |
17 | DIVERTED | DOUBLE | YES | None | None | None |
18 | CRS_ELAPSED_TIME | DOUBLE | YES | None | None | None |
19 | ACTUAL_ELAPSED_TIME | DOUBLE | YES | None | None | None |
20 | AIR_TIME | DOUBLE | YES | None | None | None |
21 | DISTANCE | DOUBLE | YES | None | None | None |
22 | CARRIER_DELAY | DOUBLE | YES | None | None | None |
23 | WEATHER_DELAY | DOUBLE | YES | None | None | None |
24 | NAS_DELAY | DOUBLE | YES | None | None | None |
25 | SECURITY_DELAY | DOUBLE | YES | None | None | None |
26 | LATE_AIRCRAFT_DELAY | DOUBLE | YES | None | None | None |
27 | Unnamed: 27 | VARCHAR | YES | None | None | None |
Using approximations
At times, it suffices just to get an estimate of certain data rather than a precise answer. Using approximations would help us to just that.
%%sql
select
OP_CARRIER,as approx_num_unique_destinations
approx_count_distinct(DEST)
from airlinedata
1
group by 1
order by
10 limit
OP_CARRIER | approx_num_unique_destinations | |
---|---|---|
0 | 9E | 186 |
1 | AA | 116 |
2 | AS | 77 |
3 | B6 | 73 |
4 | CO | 85 |
5 | DL | 171 |
6 | EV | 205 |
7 | F9 | 130 |
8 | FL | 75 |
9 | G4 | 126 |
%%sql
select
OP_CARRIER,-- takes more time to compute
as num_unique_destinations
count(distinct DEST)
from airlinedata
1
group by 1
order by
10 limit
OP_CARRIER | num_unique_destinations | |
---|---|---|
0 | 9E | 185 |
1 | AA | 116 |
2 | AS | 77 |
3 | B6 | 73 |
4 | CO | 85 |
5 | DL | 170 |
6 | EV | 205 |
7 | F9 | 129 |
8 | FL | 75 |
9 | G4 | 126 |
Our approximation query earlier ran about 3-4 times faster than the precise one in this case. This is crucial when responsiveness is more important than precision (esp. for larger datasets).
Using Window functions
Finally, let’s wrap our analysis by showing off a bit more of what duckdb can do using some advanced SQL operations.
We create two CTEs (Common Table Expressions) to calculate a couple of features. We do filter & projection pushdowns in one CTE and compute our desired features in another. The first feature is a simple demo to showcase if-else
support. The second feature is a bit advanced where we find out the last destination a given air carrier has flown to, sorted by flying date. And when it doesn’t exist, replace it with NA
. We then take a sample from the final resultant set.
%%sql
with limited_data as (
select
FL_DATE,
ORIGIN,
DEST,
DISTANCE,
OP_CARRIER,from airlinedata
>= '2015-01-01'
where FL_DATE
),
as (
last_destination_data *,
select
case*1.60934 > 500 then 'yes'
when DISTANCEelse 'no'
as distance_more_than_500_km,
end
coalesce(last_value(DEST) over (
partition by OP_CARRIER
order by FL_DATEand 1 preceding
rows between unbounded preceding 'NA') as last_destination_flown_with_this_carrier
),
from limited_data
)
*
select from last_destination_data
10; using sample
FL_DATE | ORIGIN | DEST | DISTANCE | OP_CARRIER | distance_more_than_500_km | last_destination_flown_with_this_carrier | |
---|---|---|---|---|---|---|---|
0 | 2018-07-10 | DCA | LGA | 214.0 | YX | no | DCA |
1 | 2015-05-08 | DAL | BWI | 1209.0 | WN | yes | BWI |
2 | 2018-03-30 | LAS | SJC | 386.0 | WN | yes | SJC |
3 | 2015-07-10 | BOS | MSP | 1124.0 | DL | yes | DTW |
4 | 2016-06-01 | DTW | BWI | 409.0 | DL | yes | DTW |
5 | 2017-07-26 | GEG | MSP | 1175.0 | DL | yes | SAN |
6 | 2017-01-10 | DFW | ACT | 89.0 | EV | no | DFW |
7 | 2015-01-01 | PHX | DRO | 351.0 | OO | yes | BFL |
8 | 2018-05-06 | DFW | MCO | 985.0 | AA | yes | IAH |
9 | 2018-04-30 | MSY | LAX | 1670.0 | WN | yes | LAX |
Nice, isn’t it?! The same operation is unimaginably complex (for me, at least) in pandas. 🤯
With DuckDB, we can combine one or more of many of such complex operations and execute in one go without worrying much about manual optimizations.
Conclusion
We have noticed how performant DuckDB is and how it brings the whole SQL ecosystem into Pandas. Its simple installation and light footprint means that we can integrate this into our existing workflows with minimal effort and achieve maximum gains in terms of execution speeds . We can also continue using pandas on larger datasets without loading them into memory or jumping onto a full blown distributed computing setup (for a reasonable extent).
Thanks so much for reading this far :heart:. Now you know most of the important stuff about DuckDB to get yourself started. Bear in mind that we hardly scratched the surface. DuckDB offers so much more. You can use correlated subqueries, nested types, etc. apart from its many user friendly features such as column aliasing in group by/having
, auto-incrementing duplicate columns, better string slicing and so on. Their documentation is very clean & beginner friendly and the blog has very informative posts as well. I encourage you to check those out.
Sidenote
- Modin tries to parallellize pandas workflow by distributing the workload to multiple CPU cores. Have a look and am guessing that we can also combine modin & DuckDB for an even faster runtimes.
- Fugue is a new framework that provides a unified interface so that users can execute their Python, Pandas, and SQL code on various distributed computing platforms such as Spark and Dask without rewrites. Please check here for a nice introduction from its maintainer Kevin and also checkout this article from Khuyen to get a feel of using it with DuckDB.
I hope you have enjoyed this post and learnt something from it. Please let me know your thoughts/suggestions (or any mistakes) in the comments below. :)
Happy Ducking! 💚🦆💚
Footnotes
Footnotes
The exact execution times might vary a bit depending on the load & build of your computer. I also noticed that the operations are cached and the first computation takes a bit of time but running it again or after changing the values of the columns in the
WHERE
clause would only take a couple ofms
later on.↩︎