Service

Power Query and data preparation

Power Query is the most honest ETL layer that ships with every Office license. I build pipelines that do not break query folding, route parameters cleanly through the model, and fail with clear errors instead of silently producing data garbage.

The problem

What I often find: 30 steps per query, 14 of them named 'Changed Type1' through 'Changed Type14'. Each step breaks query folding and pulls the full table into local memory. Refresh takes 18 minutes even though the data source is a SQL database that could answer in 200 milliseconds.

Sources hardcoded to C:\Users\Maria\Desktop\Export.csv. If Maria is sick, nothing runs. Plausibility checks are absent: if the export has stale data or the 'Revenue' column is suddenly text instead of numbers, nobody notices before send.

Power Query can resolve these issues layer by layer. But only if you actually know what M does and where query folding still or no longer applies.

The solution

I build Power Query pipelines with the goal that every source is processed server-side as far as possible: filters, joins, aggregations placed so query folding propagates back to the source. For large SQL sources this often reduces refresh time by 10-30x.

Paths and filters move into a parameter table, not hardcoded into M. Custom functions for recurring cleansing steps (date parsing, trim/clean, column normalization) live in a library query. Schema validation via Try/Otherwise with clear error triggers.

Deliverables: documented M queries with clear step names, parameter table, validation layer, optional versioning of queries as .pq files in Git. Plus a README with connector requirements, OAuth setup if relevant, and maintenance conventions.

Typical use cases

  • Consolidating 5-20 CSV/XLSX from a folder with schema validation and clear error logs on structural mismatches
  • Connecting to SQL Server, PostgreSQL, MySQL, Oracle with query folding back to source, instead of naive full-table imports
  • API connections to REST services (Salesforce, Microsoft Graph, custom APIs) with pagination, throttling, OAuth handling
  • Migrating a copy-paste ETL trail from VBA to clean Power Query steps with clear performance improvement
  • Multi-tenant consolidation with subsidiary-specific elimination logic and auditable transformation log
  • Building a Power Query function library (date helpers, column normalization, error handlers) for team-wide reuse

Concrete benefits

  • Refresh time often 5-30x faster because query folding works again
  • Reproducibility: every transformation is M code, versionable, code-reviewable
  • Robustness: structural source changes fail with clear errors, not silently
  • Maintainability: a second analyst takes a thought-through PQ pipeline over in 1-2 days
  • Power BI scalability: the same queries run in Power BI Desktop and Service without rebuild

How we work together

  1. Audit existing queries (1-2 days)

    Which steps break query folding? Where are inefficient joins? Which sources are hardcoded? Output is a prioritized refactor plan with expected performance gains.

  2. Refactor in 1-week increments

    Pipeline by pipeline: reorder steps, extract custom functions, set up parameter table. Each step verified against the old version.

  3. Add validation layer

    Schema checks, range validation, consistency tests before handoff to Excel/Power BI. On errors, pipeline fails with a meaningful message instead of producing garbage.

  4. Handover with docs and walkthrough

    README with connector requirements, maintenance conventions, extension patterns. 60-90 min live walkthrough with recording.

Frequently asked questions

Power Query vs SQL vs Python: when which?

Power Query when the source is Excel/Power BI compatible and the data fits the model (usually <10M rows). SQL when multiple reports share transformed data or you have a DWH. Python when you need libraries (statistics, NLP, ML) or the pipeline must run outside the Microsoft stack. Decision in the audit.

What if columns change in the source?

With a correctly set-up pipeline, refresh fails with a clear error: 'Column XY not found, expected type Number.' Instead of silent garbage production. Schema validation as the first pipeline step is mandatory, not optional.

We have queries with 18-minute refresh. Can you make it faster?

Almost always. 18-minute PQ queries are usually query-folding problems or inefficient joins. Typical improvement to 30-90 seconds. If the data volume is fundamentally too large for Excel, I will say so and propose SQL stage or Power BI.

Does this work with our Power BI license or do we need more?

Power Query is included in Excel 365 and Power BI Free. You need Power BI Pro only to share reports in the service. For pure Excel-based pipelines: no extra cost.

Can we extend the pipelines ourselves?

Yes. Clear step names, inline-documented custom functions, parameter table, README. A team member with basic Power Query skills extends in 1-2 days. For deeper M-language questions (complex list functions, dynamic columns) a retainer makes sense.

Related services

Audit your Power Query pipelines?

Name sources, refresh time, and critical pain points. Within two working days you get a read: where query folding breaks, what is realistically speedable, in what sequence I would proceed.