SQL that doesn’t break: 5 scalable patterns I use in pipelines
SQL is an invisible infrastructure. If you write it well, no one notices. If you don’t, everybody gets mail.
For data analysis, SQL is often the first thing we learn and the last thing we master (at least, that was the case for me). I’ve worked with large and small databases across production systems, ad hoc dashboards, and machine learning pipelines, and one pattern keeps repeating:
Your SQL queries should scale effectively. They should be fast and communicate intent.
In this moment, as a data/business analyst, memorizing syntax and joins is secondary.
AI writes better syntax than most analysts and, with the right prompt, better queries.
Writing SQL that can survive growing datasets, onboarding new teammates, and long-term maintenance is an art.
I don’t use SQL daily, but over time, I’ve learned a few patterns that make my queries more productive, readable, testable, and scalable.
Here are 5 I keep coming back to, and the common traps they help me avoid.
1. Write joins that can handle schema changes
Subqueries are fine… until someone changes a field name and buries the error deep in the code. Clean SQL joins keep logic visible and auditable.
What I use:
SELECT t.transaction_id, t.user_id, u.signup_channel
FROM transactions t
LEFT JOIN users u ON t.user_id = u.user_id
WHERE t.created_at >= CURRENT_DATE - INTERVAL '30 days';
This tells a clear story: pull recent transactions with user metadata and current data within 30 days. Easy to read, easy to debug.
What I avoid:
SELECT transaction_id, user_id
FROM transactions
WHERE user_id IN (SELECT user_id FROM users WHERE signup_channel = 'email');
This will run, but it hides assumptions. Did the user table have duplicates? What if the column name changes?
Scale tip: Joins are relationships. You have to be “explicit”.
2. Filter early, not later
One of my favorite SQL tips is moving filters before joins and aggregations.
I’ve seen many queries that operate on the whole table or multiple tables. If your WHERE clause comes after multiple joins or calculations, you’re pushing more data than necessary, and the cost is high.
What I use:
WITH filtered_events AS (
SELECT user_id, event_time, product_id
FROM events
WHERE event_name = 'view_product' AND event_time >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT product_id, COUNT(DISTINCT user_id) AS viewers
FROM filtered_events
GROUP BY product_id;
I filtered with CTE before aggregating. The CTE trims the dataset early and helps with less memory, computing, and eventually, less room for error.
What I avoid:
SELECT product_id, COUNT(DISTINCT user_id)
FROM events
WHERE event_name = 'view_product'
AND event_time >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY product_id;
Tip: Don’t throw the entire table at your SQL logic; it will run, but the cost is higher than the results. Pre-filtering improves speed, especially when hundreds of rows scale into millions.
3. Have some boundaries with window functions(WF)
Window functions are among the most powerful SQL queries. They power analytics, rankings, rolling totals, etc. But they can be dangerous if you use them on unbounded datasets. Don’t apply them to your entire dataset by default. I recommend using them with a CTE or filter first.
What I use:
WITH day_zero_users AS (
SELECT user_id, MIN(event_date) AS signup_date
FROM events
WHERE event_name = 'signup'
GROUP BY user_id
)
SELECT e.user_id, e.event_date, d.signup_date,
DATEDIFF(e.event_date, d.signup_date) AS days_since_signup,
COUNT(*) OVER (PARTITION BY d.signup_date ORDER BY e.event_date) AS daily_active
FROM events e
JOIN day_zero_users d ON e.user_id = d.user_id
WHERE e.event_name = 'app_open' AND e.event_date <= d.signup_date + INTERVAL '7 days';
Again, using a CTE to filter. This limits the rows before the WF kicks in and avoids an overkill pattern. Filtering and partitioning work better with WFs.
4. Be specific with SELECT columns. Avoid SELECT *
I have a separate article on why this is generally bad practice.
Why you should not use SELECT * FROM TABLE in SQL
SELECT * is fine when exploring. But, in shared code or production, it can be dangerous. It pulls unnecessary columns, bloats pipelines, slows joins, and breaks when schema changes.
- What I use: e.g., a snapshot of marketing performance:
SELECT campaign_id, channel, spend, conversions
FROM marketing_snapshots
WHERE snapshot_date = CURRENT_DATE - INTERVAL '1 day';
Readable, clear, and can handle a schema change.
- what I avoid:
SELECT * FROM marketing_snapshots;
This will pull all columns, even archived ones, debug flags, or PII.
Scaling tip: SELECT * is fine when you’re exploring or working alone. Explicit selects can protect performance, privacy, and code stability.
5. Batch inserts and updates.
Always. Whether it’s inserts, updates, or reports, performing row-by-row operations is not advisable.
Databases are designed for set operations, and you should let them do that. If you’re loading data or updating values, use batch processing or automate it with a merge/upsert pattern.
- E.g., Monthly plan price adjustment:
UPDATE subscriptions
SET price = price * 1.1
WHERE plan_type = 'legacy' AND last_updated < CURRENT_DATE - INTERVAL '12 months';
Batching plan_type and last_updated columns can update thousands of rows in one shot. It’s readable. And reversible, if needed.
- Don’t do this:
-- Loop through each row in app code or stored proc
FOR EACH user_id IN (SELECT id FROM subscriptions WHERE plan_type = 'legacy')
UPDATE subscriptions SET price = price * 1.1 WHERE id = user_id;
A simpler and more common example is this:
INSERT INTO employees (name, department_id)
VALUES
('Alice', 1),
('Bob', 2),
('Charlie', 3);
Batching the names into the INSERT clause reduces round-trips to the database and minimizes lock contention.
Avoid this:
INSERT INTO employees (name, department_id) VALUES ('Alice', 1);
INSERT INTO employees (name, department_id) VALUES ('Bob', 2);
Multiple calls = slower scripts, more I/O.
Tip: Let the database do the heavy lifting. Batching reduces database load and keeps pipelines fast under pressure.
Conclusion
SQL is an invisible infrastructure. If you write it well, no one notices. If you don’t, everybody gets mail.
In fast-moving teams with dashboards and models that trigger emails, your SQL should be functional, fast, readable, and resilient.
Prioritize:
Readability over cleverness
Stability over brevity
Intent over syntax
These patterns don’t necessarily come from books, and you won’t learn them all at once. Broken dashboards, performance debugging, stackoverflow, broken pipelines, and lots of googling are how you get to scalable queries.
So, if you’re writing SQL for more than yourself, make it scalable. Write SQL like people and systems depend on it.
Enjoyed this? Share it with your team, or keep it in your back pocket for your next query review.
I write about scalable analytics, production-ready SQL, and the invisible work of good data teams.
Be data-informed, data-driven, but not data-obsessed — Amy
🔗 Connect with me on LinkedIn and GitHub for more data analytics insights.
#dataanalysis #dataScience #SQLPipelines #MicrosoftExcel #SQL
Data analyst⬩Spreadsheet advocate ⬩freelancer⬩turning data into insights
For more: Visit here:
Thank you for your post! I agree - SQL is infrastructure and should be treated like that. The CTE tip is very important - specially in the early exploration phases - a well defined CTE can reduce the processing load and make the discovery phase more effective! One thing that I like about BigQuery is the estimated memory/processing load per query. It gives a tangible metric of many of the tips that you have shared. It is all good until your query stars becoming VERY expensive. Architecture first approach can save time, money and head-ache.