Published on
17/02/2020

SQL Classical Mistakes and How Not to Shoot Yourself in the Foot

There are many excellent resources that teach different flavours of SQL. I assume you have some basic knowledge of the syntax, we shall focus on what can go wrong while writing SQL. We shall cover grouping aggregations, having clause, joins, NULLs, Subqueries, and common table expressions. In the upcoming blog, we shall conclude the SQL series with what can go wrong with window aggregations, views, Indexes.

1. Grouping aggregations

The most basic grouping aggregation function is the GROUP BY. Its used to divide the rows returned from the SELECT into groups and eliminates duplicate values from the results set, similar to how DISTINCT. For each group, one can apply an aggregate functions like COUNT, SUM, MIN, MAX, etc.

Group by syntax is as follows:

SELECT column_name1, aggregate_function(column_name)
FROM table_name
GROUP BY column_name1;
  • Mistake 1: Not putting every column in the SELECT statement in either the GROUP BY clause or aggregate_function.

The rule of thumb is that every column in the SELECT statement should be in the GROUP BY clause or an aggregate function. If not, you will get an error. for example, this is not a valid SQL query:

SELECT column_name1, column_name2, aggregate_function(column_name3)
FROM table_name
GROUP BY column_name1; --- column_name2 is not either in the GROUP BY or with the aggregate function
  • Mistake 2: Using aggregate functions in the GROUP BY clause.

The GROUP BY clause should only contain the column names, not the aggregate functions. For example, the following query is invalid:

SELECT aggregate_function(column_name2)
FROM table_name
GROUP BY aggregate_function(column_name2);

The solution to this is using the HAVING clause, which we shall discuss in the next section.

2. HAVING Clause

We use the HAVING clause to search/filter a group or aggregate after the GROUP BY or aggregate functions has been performed. Remember the WHERE clause cannot be used with aggregate functions. The syntax for HAVING is as follows:

SELECT column_name1, aggregate_function(column_name2)
FROM table_name
GROUP BY column_name1
HAVING aggregate_function(column_name2) condition;

or

SELECT column_name1, aggregate_function(column_name2)
FROM table_name
GROUP BY column_name1
HAVING column_name2 condition;
  • Mistake 1: - mixing up the WHERE and HAVING clauses(I have made this mistake a number of times).

The key difference is that, the Having clause works for groups resulting from the GROUP BY clause or aggregate functions, while the WHERE clause works for individual rows before the GROUP BY clause.

  • Mistake 2: - Using aliases in the HAVING clause.

You can not use aliases in the HAVING clause, as HAVING clause is evaluated before the SELECT clause. Here is the SQL oreder of execution if you are not familiar with it, its starts with the FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> DISTINCT -> ORDER BY -> LIMIT/OFFSET. My undergrad data warehousing lecturer used to call it FROGS WITH GREEN HATS SECRETIVELY DANCE OUTSIDE LOVELY (Though the sentence was meaningless it was useful for me).

3. NULLs

NULLs are probably the most tricky part of SQL, when a column has NULLs, treat it with caution and seriousness it deserves. Even experienced SQL developers can overlook this. Most aggregate functions will ignore NULL values completely, though COUNT(*) is a notable exception as it counts every row regardless of NULL values. If you try to evaluate an aggregate function on only NULL values, the result will be NULL. Performing any mathematical operation (addition, subtraction, multiplication, or division) between NULL and non NULL values will always result in NULL and we can not use typical comparison operators (=, !=, etc ) with NULL values.

  • Mistake 1 using "=" instead of IS NULL or IS NOT NULL operators.

When checking for NULL values, you should use the IS NULL or IS NOT NULL operators. The following query is invalid:

SELECT column_name1
FROM table_name
WHERE column_name1 = NULL;
  • Mistake 2: Not checking for NULLs before division and Not thinking of the consequences.

This is typically common with sales and marketing data; for example, one has to divide the discount by the total sales to get the percentage discount. If one doesn't check for NULLs, you will get NULL. A common strategy is to use the COALESCE function to replace NULLs with 0 to indicate no discount.

SELECT product_id,
COALESCE(AVG(discount/NULLIF(price_colume, 0)::float), 0) as avg_discount_rate
FROM sales_table
GROUP BY product_id;

4. Subqueries and Common Table Expressions(CTEs)

We use Subqueries to combine many queries into one or perform analysis that can not be achieved using a single query. The inner query can either return a scalar value or a results table. For scaler values, one can use them in a SELECT or WHERE clause.

  • Mistake 1: Using the "=" with WHERE Clause when the inner query returns multiple results.

That will be so ambiguous to evaluate. To succeed, one should use the right quantifiers such as IN, EXISTS, NOT IN, NOT EXISTS, and ANY

For example, the following query is invalid:

SELECT column_name1
FROM table_name
WHERE column_name1 = (SELECT
column_name1 FROM table_name); --- Invalid if the inner query returns multiple results

The solution is to use the appropriate quantifiers for example the IN operator as follows:

SELECT column_name1
FROM table_name
WHERE column_name1 IN (SELECT column_name1 FROM table_name);

Common Table Expressions (CTEs) are temporary result sets that are defined within the execution of a single SQL statement. This means that a CTE can only live within one statement; for example, the following query is invalid. The CTE no longer exists after the first statement execution. However, you can use the CTE multiple times with a single SQL statement.

WITH cte AS (
SELECT column_name1
FROM table_name
)

SELECT column_name1
FROM cte; --- This is okay and works

SELECT column_name1
FROM cte; -- Invalid as the CTE no longer exists. Once a cte has been used its gone!

They are similar to a derived table in that they are not stored as an object and are only available for the duration of a single query. They are also similar to a subquery, except that they are separate from the main query and defined by the WITH clause at the beginning of a query.

  • Mistake 2: Not clearly understanding the difference between temporal tables, CTEs, and Views.

CTEs, views, and temporary tables have distinct lifespans. A CTE is available only for the duration of a single query and must be used immediately after its creation. Once defined, you must utilize the CTE within that statement, as it won't persist for subsequent queries. In contrast, a view is a permanent virtual table stored within the database, you can access it across multiple sessions until explicitly dropped(view expensive for you storage!). A temporary table is a middle ground between the two - it exists only during a database session. It is created and used within a single database session and is automatically dropped when that session ends, making it ideal for scenarios where you must reference the same results set multiple times across different queries within your working session.

  • Mistake 3: Brackets are a must when using CTEs.

    This is how you define a CTE:
WITH cte AS (
    SELECT column_name1
    FROM table_name
)
SELECT column_name1
FROM cte;

The only brackets needed are those enclosing the CTE's definition after the AS keyword. You don't need additional brackets between the CTE and the main query, as shown below:

WITH name AS (subquery)
main_query

If there is a need to define multiple CTEs, you separate them with commas as shown below:

WITH cte1 AS (
    SELECT column_name1
    FROM table_name1
),
cte2 AS (
    SELECT column_name2
    FROM table_name2
)
SELECT *
FROM cte1
JOIN cte2 ON cte1.column_name1 = cte2.column_name2;

Those are some of the common mistakes usually made while writing or executing SQL queries that are some times hard to debug and can lead to wrong results when dealing with NULLs, grouping aggregations, HAVING clause, subqueries, and common table expressions.

Some other classic mistakes worth mentioning are:

  • Classic Mistake 1: Mixing and misussing UNION and UNION ALL operators.

This is a classic mistake that is always asked in interviews. The UNION operator combines the result-set of two or more SELECT statements and return distinct results, while the UNION ALL combines the result-set of two or more SELECT statements, and includes all the data with duplicate values.

  • Classic Mistake 2: Not paying attention to timezones when dealing with dates.

    This is also an interview classic. When unsure of the timezone, do not truncate. Without explicitly setting the time zone, you might get different results, depending on the application back-end. For example, the following query is guaranteed to give different results:
SELECT created_at::date, COUNT(*)
FROM table_name
GROUP BY created_at::date;

The solution is to always set the time zone explicitly as follows:

SELECT (created_at AT TIME ZONE 'Africa/Kampala')::date AS new_date, COUNT(*) ---- AT TIME ZONE  is the keyword
FROM table_name
GROUP BY new_date;
  • Classic Mistake 3: Not understanding the BETWEEN inclusive nature.

    The BETWEEN operator is inclusive, meaning it includes the start and end values. For example, the following query would include the start and end values, which is not what you might want. This is usually common when dealing with dates.
SELECT column_name1
FROM table_name
WHERE column_name1 BETWEEN '2020-01-01' AND '2021-01-01'; ---- This will include 2021-01-01 data
  • Classic Mistake 4: Double Quotes vs Single Quotes

    In SQL, single quotes (' ') are used to denote string literals, while double quotes (" ")are used to denote identifiers such as column names or table names. For example, the following query is invalid:
SELECT "column_name1"
FROM table_name
WHERE "column_name1" = 'some_value'; --- Invalid if column_name1 is not the actual column name
  • Classic Mistake 5: Not understanding the different types of JOINs and when to use them.

    There are different types of JOINs: INNER JOIN, LEFT JOIN (or LEFT OUTER JOIN), RIGHT JOIN (or RIGHT OUTER JOIN), FULL JOIN (or FULL OUTER JOIN), CROSS JOIN, and SELF JOIN. Each type of join serves a different purpose and is used in different scenarios. Not understanding the differences can lead to incorrect results or inefficient queries.

For comments, please send me an email.