Portfolio
R package development
Taking on package development inculcates in one best practices in programming: using a secluded development environment made possible by {renv} package and Rstudio’s project initialization functionality, documenting code, shared data and publishing user guides to get people started with the software. Package development workflow included {fussen}, {roxygen2}, and {devtools}
The packaged is named pbwrangler, it’s goal to document functions used in reading, processing and writing field experimental data in potato breeding.
Shiny web app
{shiny} is undoubtedly a go to tool for building a web app that runs in production or just for presenting a proof of concept. I have been reading about using {golem} in packaging and modularizing shiny applications for better project level management (being a package you able to document, check, and test every function), and other reproducubility benefits (for instance docker images). I was able to modularize parts of a COVID-19 dashboard initially developed by a team from LSHTM, into a golem-framework application called COVID19Dash. It was challenging navigating a large codebase and breaking it into parts, troubleshooting instances of some lines of code not running in an isolated environment, the kind we work with during package package development and docker image building/serving.
Prior to this, I had also developed a small app as proof that I understand the underlying framework to be able to build an app that can be used in production.
Reproducible analytical pipelines in R
This was a ‘do it yourself too’ as I was reading an online version of Rodrigues’ reproducible piplines text . It reinforced my understanding of package development powered by {fusen}, reproducibility of package versions using {renv}, reproducible pipelines with {targets}, building and sharing docker containers in dockerhub and github, and continuous integration/development (CI/CD) using github actions. A branch with CI/CD running a docker container can be found here.
End-to-end Machine Learning (MLflow + Docker + Google cloud)
This is an account of my learning journey aided by this tutorial to grasp the nitty-gritties of building, logging, saving and serving machine learning models. The code available at this repo and a write-up is here.
Web scraping: getting data from the internet
I set out to understand how to scrape data using Python. I explored beautifulsoup, requests, scrapingBee API, and scrapy to scrape data from Google news and a product listing page. I wrote a piece about it too. I have also scheduled this to run daily using github actions.
Project-based Data Engineering
This is an accompanying “do-it-yourself” as I go through data engineering material from DuckDb. It is my initiative to learn how to to build data pipelines with Python, SQL & DuckDB. The first part is about ingestion, involving reading public data from Google Cloud, writing it locally as .csv or to MotherDuck database. Github for project materials and a post
ELT pipeline with dbt, snowflake, and dagster
Created an ELT pipeline that uses a dbt project to read and write tables to a snowflake warehouse database, then orchestrated the workflow with dagster. Github repo.