from PIL import Image
from IPython.display import display
= "./docs/architecture.png"
img_f = Image.open(img_f)
img
display(img)
End-to-end Data Engineering with DuckDB & Python
This is a ‘do it along’ following DuckDB material. We will source data from PyPi, a repository of python packages providing logs of how given libraries have been used across platforms, transform it into relevant tables, and feed it into a dashboard using SQL, dtb - a tool for building modular, maintainable data pipelines to power analytics, and DuckDB - an in-process SQL online analytical processing (OLAPs) relational database management system; then use evidence to create a dashboard utilizing SQL and markdown.
Part 1: Ingestion pipeline
Setup & requirements
- Python 3.12
- Poetry for dependency management
- Make to run Makefile commands
- A Google Cloud account to fetch the source data. Free tier covers any computing cost.
- A host of libraries:
poetry
duckdb
google-cloud-bigquery
google-auth
google-cloud-bigquery-storage
pyarrow
pandas
fire
loguru
pydantic
pytest
ruff
The architecture woul look like below
Exploring the data
Head to your google cloud console, then under BigQuery search “file_downloads” and click SEARCH ALL PROJECTS
.
= "./docs/big_query_search.png"
img_f = Image.open(img_f)
img display(img)
Limit your query to a given timestamp as the data is very big.
= "./docs/query.png"
img_f = Image.open(img_f)
img display(img)
The ingestion part is bundled in /ingestion
directory with the following primary files:
!tree ./ingestion
./ingestion ├── bigquery.py ├── duck.py ├── models.py ├── pipeline.py └── __pycache__ ├── bigquery.cpython-312.pyc └── pipeline.cpython-312.pyc 2 directories, 6 files
bigquery.py
has BigQuery helpers to get the client, get query results and query the public datasetduck.py
contains DuckDB helpers to create a table from a dataframe, and write to a local or MotherDuck databasemodels.py
defines table schemapipeline
runs the ingestion pipeline usepydantic
syntax
The pipeline was successfuly executed and data written locally and to motherduck:
= "./docs/motherduck_db.png"
img_f = Image.open(img_f)
img display(img)