Service
Python and SQL
Excel is great until it is not. With 10 million rows, seven source systems, or API-driven pipelines, it is time to go a layer deeper. I build Python ETL and SQL stages that do exactly what they are supposed to, with logging, tests, versioning, without turning a concrete need into a DWH project.
The problem
What I usually find: a Python script written by a former intern. 600 lines, no tests, no logging, no type hints. Pandas operations with copy() warnings, no error handling. It has been running for three years and nobody dares touch it. Alongside a SQL file with 40 subqueries that worked exactly once, when the person wrote it.
Ad-hoc analyses whose results drift by 3% on re-run. Nobody knows if that is a bug or legitimate data drift. SELECT statements without index strategy that create locks on the production database.
Python and SQL are not the problem. Python and SQL without engineering are.
The solution
For Python: typed functions with type hints, clear separation into extract.py / transform.py / load.py, structured logging (loguru or standard logging), pytest tests for critical transformations, pyproject.toml with pinned dependencies, optional containerization. For larger pipelines Airflow, Dagster, or lean custom schedulers depending on setup.
For SQL: analysis logic as views or stored procedures instead of scattered ad-hoc snippets. Index strategy based on execution plans, not gut. CTEs for readability, window functions for ranked analyses, correct transactional boundaries. Where DWH-near demand exists, a lean stage between sources and reporting (PostgreSQL or SQL Server) as source of truth.
Deliverables: code in your Git, README, test suite, sample data for local development, deployment guide for your target environment (cron, Windows Task, container, Airflow DAG). Never 'runs somewhere'.
Typical use cases
- Python ETL: consolidate 5-20 source systems (CSV, APIs, databases), validate, load to SQL stage or data warehouse
- SQL stage between operational systems and reporting: PostgreSQL/SQL Server with star schema, materialized views, snapshot logic
- Migrating an unmaintained Python pipeline to testable, typed code with logging and CI integration
- API-driven data pulls (Salesforce, HubSpot, Microsoft Graph) with pagination, rate-limit handling, resumability on abort
- Performance tuning of existing SQL queries via execution-plan analysis, index strategy, partitioning
- Small tool for an internal team: CLI or lean FastAPI backend with defined endpoints, no frontend overhead
- Data quality monitoring: daily plausibility checks (ranges, aggregates, distributions) with Slack or email alerts
Concrete benefits
- Testability: critical transformations covered by tests, regressions caught before deployment
- Maintainability: type hints, clear module boundaries, linting. A second developer takes over in 2-4 days instead of 6 weeks
- Performance: SQL queries often 10-100x faster after index and execution-plan optimization
- Scalability: what starts as a small cron job can grow to Airflow or a proper DWH without complete rebuild
- Versionability: code in Git, schema migrations as code (Alembic or Flyway), reproducible deployments
- Security: credentials in vault or env vars, no .env files on desktops, no SQL strings with f-strings
How we work together
Clarify data flow and requirements
Sources, targets, frequency, data volume, SLA, data-protection constraints. Output is an architecture sketch with three options.
Architecture decision
Is a Python script with cron enough, or do we need Airflow/Dagster? SQL stage or direct to DWH? Self-hosted or cloud? Trade-offs in writing.
MVP pipeline with tests and logging
First runnable version with the 1-2 most important sources, tests on the critical transformations, logging schema defined.
Incremental expansion
More sources, validation layer, monitoring, alerts. Each step deployed, tested, documented.
Handover or retainer
README, architecture docs, test setup, deployment guide. Optional retainer for extensions and maintenance.
Frequently asked questions
Does Python replace our Power Query pipelines?
Not necessarily. Power Query is the better choice for most Excel/Power-BI-centric pipelines because it stays in the stack and needs no extra infrastructure. Python pays off with large data volumes, API complexity, ML/NLP needs, or pipelines outside the Microsoft stack. Decision in the discovery workshop.
Where does the code run? Cloud, on-prem, or local?
Depends on your data. Sensitive data (customer PII, HR, tax): on-prem or in a private cloud, never hobby serverless. Generic data without personal information: any clean environment works. I follow your IT policy, not hype.
Can you work directly on our production databases?
Read-only yes, with read replica or a dedicated reporting user. Write only on stage databases designed for it, never directly in operational tables without an explicit approval procedure. Locks on production tables are an avoidable catastrophe.
Which Python libraries do you use?
Standard stack: pandas and polars for data manipulation, SQLAlchemy for database abstraction, httpx or requests for APIs, pydantic for schema validation, pytest for tests, loguru or standard logging. For larger pipelines Airflow or Dagster. No library added without clear reason.
We have a SQL query that runs 8 minutes. Can you make it faster?
Almost always. Typical improvement 10-100x through index strategy, subquery refactor, window functions instead of self-joins. If the query is fundamentally hard (large aggregation over billions of rows), I suggest materialized view or partitioning.
What does a typical Python/SQL project cost?
MVP pipeline (2-3 sources, one SQL stage, logging and tests): 6-12 working days, around €8000-18000 net at €140-180/h. More complex DWH-near work proportionally longer and quotable after the discovery workshop.
Related services
Audit your Python or SQL pipeline?
Name sources, data volumes, frequency, and critical pain points. Within two working days you get a read with three options and realistic effort.