Data Engineering (FinTech)
From Linux and SQL to production data lakes on the cloud, with a FinTech focus. Duration thirty six weeks across four levels. Target outcome: build and operate batch and streaming pipelines, lakehouses, and regulatory grade data systems.
Overview
This track restructures your FinTech data engineering roadmap and keeps every topic. It runs from foundations through core tools, advanced platforms, and expert integration, ending with certifications. The primary cloud is AWS and the primary lakehouse is Databricks with Delta Lake, with PySpark as the processing engine. Each level ends with a checkpoint project.
Level 0: Foundations (weeks 1 to 4)
Week 1: Linux and the terminal
0 of 5- A bash script that processes a CSV, filters rows, and schedules itself with cron
Week 2: SQL foundations
0 of 4- A query joining three tables with window functions to compute a running account balance
Week 3 to 4: Git and GitLab
0 of 3- Push a Python script with a GitLab pipeline that lints, tests, and deploys to staging
Level 1: Core tools (weeks 5 to 12)
Weeks 5 to 12: PySpark
0 of 4- A structured streaming pipeline reading financial events, transforming, and writing a Delta table on object storage
Weeks 5 to 7: AWS tier one, storage and serverless
0 of 4Weeks 8 to 10: AWS tier two, ETL and querying
0 of 3Weeks 11 to 12: AWS tier three, orchestration and events
0 of 3- A flow: S3 upload to EventBridge to Step Functions to Lambda to a Glue job to an Athena query to a DynamoDB write, monitored by CloudWatch
Level 2: Advanced (weeks 13 to 20)
Databricks
0 of 5- A Delta Live Tables pipeline ingesting raw transactions through Auto Loader, applying data quality expectations, writing a gold table, and exposing a Databricks SQL dashboard, tracked in GitLab
Advanced SQL for FinTech
0 of 5Level 3: Expert and production (weeks 21 to 36)
Expert integration architecture
0 of 5Expert hands on projects
- Fraud detection: PySpark structured streaming, Databricks ML, a DynamoDB lookup, CloudWatch alerts
- Regulatory reporting: Glue and Athena for Basel III and IFRS9, Step Functions, GitLab CI and CD, a Databricks SQL dashboard
- Customer 360: EventBridge, DynamoDB streams, Lambda enrichment, Glue change data capture, an Athena federated query
- A full data lake: bronze, silver, and gold medallion on S3, Auto Loader, PySpark data quality, Unity Catalog, infrastructure as code
Certifications in order
AWS Certified Data Engineer Associate at the end of level one
Databricks Certified Data Engineer Associate at the end of level two
AWS Certified Solutions Architect Associate during level three
Databricks Data Engineer Professional as an optional expert step
HashiCorp Terraform Associate for infrastructure as code mastery
Resource master reference
Tools master list
Linux, bash, SQL, git, GitLab CI, PySpark, Delta Lake, S3, Lambda, CloudWatch, IAM, Glue, Athena, DynamoDB, Step Functions, EventBridge, Databricks, Unity Catalog, Auto Loader, Delta Live Tables, Great Expectations, Lake Formation, Terraform
Interview focus
Design a batch and a streaming pipeline for financial events
Explain slowly changing dimensions and implement type two with MERGE
Optimize a slow Spark job and read its explain plan
Design a medallion lakehouse with data quality gates
How do you handle GDPR deletion in a data lake
Partitioning and bucketing trade offs in Spark and Athena