Coders {cookies}

Overview

This track restructures your FinTech data engineering roadmap and keeps every topic. It runs from foundations through core tools, advanced platforms, and expert integration, ending with certifications. The primary cloud is AWS and the primary lakehouse is Databricks with Delta Lake, with PySpark as the processing engine. Each level ends with a checkpoint project.

Level 0: Foundations (weeks 1 to 4)

Week 1: Linux and the terminal

0 of 5

Filesystem navigation: ls, cd, find, grep
Permissions and users: chmod, chown, sudo, umask
Process management: ps, top, kill, nohup, systemctl, tmux
Shell scripting: variables, arrays, conditionals, loops, functions, cron, trap
Text processing: awk, sed, cut, sort, uniq, xargs, pipes

Build

A bash script that processes a CSV, filters rows, and schedules itself with cron

Week 2: SQL foundations

0 of 4

Core queries: SELECT, WHERE, JOIN types, GROUP BY, HAVING, CASE, COALESCE
Advanced querying: CTEs, recursive CTEs, correlated subqueries, set operations, EXISTS and IN
Window functions: ROW NUMBER, RANK, DENSE RANK, LAG, LEAD, FIRST VALUE, frames
DDL, DML, and optimization: CREATE, indexes, views, MERGE, EXPLAIN, B tree and composite indexes

Build

A query joining three tables with window functions to compute a running account balance

Week 3 to 4: Git and GitLab

0 of 3

Core git: init, clone, commit, branch, merge, rebase, cherry pick, stash, reflog
GitLab platform: projects, merge requests, issues, branch protection, code owners
GitLab CI and CD: stages, runners, artifacts, cache, environments, masked variables, schedules

Build

Push a Python script with a GitLab pipeline that lints, tests, and deploys to staging

Level 1: Core tools (weeks 5 to 12)

Weeks 5 to 12: PySpark

0 of 4

Architecture: driver, executors, the DAG, SparkSession, RDD versus DataFrame, cluster managers
Core transforms: select, filter, withColumn, groupBy, agg, cast, when and otherwise, read and write Parquet and Delta, null handling
Intermediate: broadcast and shuffle and sort merge joins, window functions, UDFs and pandas UDFs with Arrow, Spark SQL, Delta Lake basics
Advanced: structured streaming, watermarks, adaptive query execution, partitioning and bucketing, cache and persist, explain plans, the S3A connector

Build

A structured streaming pipeline reading financial events, transforming, and writing a Delta table on object storage

Weeks 5 to 7: AWS tier one, storage and serverless

0 of 4

S3: buckets, lifecycle policies, versioning, encryption, S3 Select, event notifications
Lambda: handlers, layers, memory and timeout tuning, triggers, container images, cold starts
CloudWatch: log groups, Insights queries, alarms, dashboards, SNS
IAM: roles, policies, trust relationships, least privilege, resource based versus identity based

Weeks 8 to 10: AWS tier two, ETL and querying

0 of 3

Glue: crawlers, the data catalog, PySpark ETL jobs, workflows, bookmarks, data quality
Athena: serverless SQL on S3, Parquet and ORC optimization, partitioned tables, partition projection, CTAS, federated queries
DynamoDB: partition and sort key design, GSI and LSI, streams to Lambda, TTL, on demand capacity, transactions, point in time recovery, DAX

Weeks 11 to 12: AWS tier three, orchestration and events

0 of 3

Step Functions: standard versus express, all state types, input and output paths, catch and retry, nested workflows
EventBridge: event buses, pattern rules, cron schedules, pipes, streams integration, the schema registry
Service Catalog: portfolios, products, launch constraints, CloudFormation integration, cross account sharing

Build

A flow: S3 upload to EventBridge to Step Functions to Lambda to a Glue job to an Athena query to a DynamoDB write, monitored by CloudWatch

Level 2: Advanced (weeks 13 to 20)

Databricks

0 of 5

Platform basics: the workspace, clusters, notebooks, DBFS, Unity Catalog volumes, GitLab repo sync
Delta Lake deep dive: ACID and optimistic concurrency, time travel, OPTIMIZE, Z order, VACUUM, MERGE INTO, change data feed, liquid clustering
Pipelines: Delta Live Tables with expectations and quarantine, Auto Loader, Databricks workflows with repair and retries
Analytics and governance: Databricks SQL warehouses, dashboards and alerts, Unity Catalog hierarchy, column masking, row filters, MLflow
FinTech use cases: fraud detection with streaming and Delta, credit decisioning with a feature store, GDPR deletion, regulatory reporting

Build

A Delta Live Tables pipeline ingesting raw transactions through Auto Loader, applying data quality expectations, writing a gold table, and exposing a Databricks SQL dashboard, tracked in GitLab

Advanced SQL for FinTech

0 of 5

Warehouse modeling: star and snowflake schemas, Data Vault 2.0, the one big table pattern
Slowly changing dimensions: types zero through three and MERGE INTO for type two
Optimization: execution plans, index tuning, partition pruning, ANALYZE TABLE, materialized views
Procedures and UDFs: stored procedures, functions, cursors, exception handling, SQL and Python UDFs
FinTech patterns: running balance, deduplication with ROW NUMBER, cohort analysis, Basel III and IFRS9 aggregations

Level 3: Expert and production (weeks 21 to 36)

Expert integration architecture

0 of 5

The full pipeline pattern: EventBridge to Step Functions to Lambda to Glue and PySpark to S3 and Delta Lake to Athena and Databricks SQL
GitLab CI and CD: lint, unit test with pytest and moto, integration test, infrastructure as code staging deploy, a manual production gate, semantic versioning
Monitoring: CloudWatch alarms to SNS to alerting, Databricks job webhooks, data freshness alerts, cost anomaly detection
Data quality: Glue data quality, Great Expectations, Delta Live Tables expectations with quarantine, Lake Formation column masking, Unity Catalog row level security
GDPR compliance: Delta VACUUM and purge, CloudTrail audit shipping, Databricks audit logs

Expert hands on projects

Build

Fraud detection: PySpark structured streaming, Databricks ML, a DynamoDB lookup, CloudWatch alerts
Regulatory reporting: Glue and Athena for Basel III and IFRS9, Step Functions, GitLab CI and CD, a Databricks SQL dashboard
Customer 360: EventBridge, DynamoDB streams, Lambda enrichment, Glue change data capture, an Athena federated query
A full data lake: bronze, silver, and gold medallion on S3, Auto Loader, PySpark data quality, Unity Catalog, infrastructure as code

Certifications in order

AWS Certified Data Engineer Associate at the end of level one

Databricks Certified Data Engineer Associate at the end of level two

AWS Certified Solutions Architect Associate during level three

Databricks Data Engineer Professional as an optional expert step

HashiCorp Terraform Associate for infrastructure as code mastery

Resource master reference

Tools master list

Linux, bash, SQL, git, GitLab CI, PySpark, Delta Lake, S3, Lambda, CloudWatch, IAM, Glue, Athena, DynamoDB, Step Functions, EventBridge, Databricks, Unity Catalog, Auto Loader, Delta Live Tables, Great Expectations, Lake Formation, Terraform

Interview focus

Design a batch and a streaming pipeline for financial events

Explain slowly changing dimensions and implement type two with MERGE

Optimize a slow Spark job and read its explain plan

Design a medallion lakehouse with data quality gates

How do you handle GDPR deletion in a data lake

Partitioning and bucketing trade offs in Spark and Athena