PostgreSQL

Complete PostgreSQL cheat sheet covering SQL fundamentals, joins, indexes, transactions, window functions, JSON support, and query optimization.

12 sections25 cards

How PostgreSQL thinks

PostgreSQL is a relational database — data lives in tables with rows and columns. Relationships between tables are expressed via foreign keys. Unlike MongoDB, schema is enforced at the database level.

Key concepts: Database → Schema (namespace, default is public) → Tables → Rows.

PostgreSQL is ACID-compliant — Atomicity, Consistency, Isolation, Durability. Every transaction either fully succeeds or fully rolls back. No partial writes.

Strong typing, constraints, foreign keys, triggers, full-text search, JSON support, and window functions make it one of the most capable open-source databases available.

Common types

Numeric: integer (4 bytes), bigint (8 bytes), numeric(p,s) (exact decimal, for money), real / double precision (floating point).

String: varchar(n) (variable, max n), text (unlimited — prefer this over varchar in Postgres), char(n) (fixed length, padded).

Boolean: boolean — true/false/null.

Date/Time: date, time, timestamp, timestamptz (with timezone — always prefer this), interval.

Other: uuid, json, jsonb (binary JSON, indexable — prefer over json), array, enum, serial / bigserial (auto-increment).

Use timestamptz over timestamp — stores in UTC, displays in session timezone. Avoids timezone bugs.

Serial vs Identity

serial — shorthand that creates a sequence and sets default. Legacy but still common.

id serial PRIMARY KEY — equivalent to creating a sequence manually.

GENERATED ALWAYS AS IDENTITY — modern SQL standard approach. Recommended over serial.

id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY

GENERATED BY DEFAULT AS IDENTITY — allows manual override of the value (useful for data migration).

Tables & constraints

CREATE TABLE users ( id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, email TEXT NOT NULL UNIQUE, name TEXT NOT NULL, created_at TIMESTAMPTZ DEFAULT now() )

Constraints:

NOT NULL — column cannot be null.

UNIQUE — all values must be distinct.

PRIMARY KEY — NOT NULL + UNIQUE. One per table.

CHECK (age >= 0) — custom condition.

DEFAULT value — value used if not provided.

FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE — referential integrity. ON DELETE options: CASCADE, SET NULL, RESTRICT, NO ACTION.

Altering tables

ALTER TABLE users ADD COLUMN age INTEGER

ALTER TABLE users DROP COLUMN age

ALTER TABLE users ALTER COLUMN name SET NOT NULL

ALTER TABLE users ALTER COLUMN name TYPE VARCHAR(100)

ALTER TABLE users RENAME COLUMN name TO full_name

ALTER TABLE users ADD CONSTRAINT chk_age CHECK (age >= 0)

ALTER TABLE users DROP CONSTRAINT chk_age

DROP TABLE users — delete table and data. DROP TABLE IF EXISTS users CASCADE — also drops dependent objects.

TRUNCATE users — delete all rows, keep table. Faster than DELETE with no WHERE.

SELECT

SELECT * FROM users — avoid * in production, list columns explicitly.

SELECT name, email FROM users WHERE active = true

ORDER BY created_at DESC

LIMIT 10 OFFSET 20 — pagination. OFFSET gets slow on large tables — use cursor-based pagination for scale.

DISTINCT — deduplicate rows: SELECT DISTINCT city FROM users

AS — alias: SELECT name AS full_name

Operators: =, !=, <, >, BETWEEN, IN (...), NOT IN, IS NULL, IS NOT NULL, LIKE '%pattern%', ILIKE (case-insensitive LIKE).

INSERT, UPDATE, DELETE

INSERT INTO users (name, email) VALUES ('Alice', 'a@b.com')

INSERT INTO users (name, email) VALUES (...), (...) — multi-row insert.

INSERT INTO users ... RETURNING id, name — return inserted row(s). Very useful.

UPDATE users SET name = 'Bob' WHERE id = 1

UPDATE users SET name = 'Bob', active = false WHERE id = 1 RETURNING *

DELETE FROM users WHERE id = 1

DELETE FROM users WHERE created_at < now() - interval '1 year' RETURNING id

Always include a WHERE clause in UPDATE/DELETE unless you intentionally want all rows affected. No WHERE = affects everything.

Upsert

INSERT INTO users (email, name) VALUES ('a@b.com', 'Alice') ON CONFLICT (email) DO UPDATE SET name = EXCLUDED.name

EXCLUDED — refers to the row that failed to insert. Use it to reference the incoming values in the DO UPDATE clause.

ON CONFLICT (email) DO NOTHING — ignore duplicates silently.

The column(s) in ON CONFLICT must have a UNIQUE or PRIMARY KEY constraint.

Join types

INNER JOIN — rows where condition matches in both tables. Most common.

LEFT JOIN — all rows from left table, matched rows from right. NULL for unmatched right rows.

RIGHT JOIN — all rows from right table. Rarely used — just swap tables and use LEFT JOIN.

FULL OUTER JOIN — all rows from both tables. NULLs where no match.

CROSS JOIN — cartesian product. Every row from left x every row from right. Rarely intentional.

SELF JOIN — table joined to itself. Used for hierarchical data (employees and their managers).

Join syntax & tips

SELECT u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id

Use table aliases — keeps queries readable.

Join on multiple conditions: ON a.id = b.id AND a.type = b.type

USING (column) — shorthand when both tables have the same column name: JOIN orders USING (user_id)

Join order matters for readability, not for the query planner — Postgres reorders joins for optimal execution.

LEFT JOIN then filter on right table in WHERE is equivalent to INNER JOIN. If you want true LEFT JOIN semantics, put the condition in ON, not WHERE.

Aggregate functions

COUNT(*) — count all rows. COUNT(col) — count non-null values.

SUM(col), AVG(col), MIN(col), MAX(col)

STRING_AGG(col, ', ') — concatenate strings. ARRAY_AGG(col) — aggregate into array.

GROUP BY col — group rows with same value. All non-aggregate SELECT columns must be in GROUP BY.

HAVING COUNT(*) > 5 — filter groups after aggregation. WHERE filters before, HAVING filters after.

GROUPING SETS, ROLLUP, CUBE — advanced grouping for multiple aggregation levels in one query.

Window functions

Perform calculations across rows related to the current row without collapsing them into one row. The most powerful feature in SQL.

ROW_NUMBER() OVER (ORDER BY salary DESC) — sequential row number.

RANK() OVER (ORDER BY salary DESC) — rank with gaps for ties.

DENSE_RANK() — rank without gaps.

LAG(col, 1) OVER (ORDER BY date) — value from previous row.

LEAD(col, 1) OVER (ORDER BY date) — value from next row.

SUM(amount) OVER (PARTITION BY user_id ORDER BY date) — running total per user.

PARTITION BY — reset window calculation per group. Without it, window spans all rows.

FIRST_VALUE(col) OVER (...) / LAST_VALUE(col) / NTH_VALUE(col, n)

CTEs (Common Table Expressions)

WITH active_users AS ( SELECT * FROM users WHERE active = true ) SELECT * FROM active_users WHERE created_at > '2024-01-01'

CTEs improve readability. Can define multiple CTEs separated by commas.

Recursive CTE — query hierarchical data (org charts, category trees):

WITH RECURSIVE tree AS ( SELECT id, name, parent_id FROM categories WHERE parent_id IS NULL UNION ALL SELECT c.id, c.name, c.parent_id FROM categories c JOIN tree t ON c.parent_id = t.id ) SELECT * FROM tree

In Postgres, CTEs are by default optimization fences — the planner cannot push conditions into them. Add MATERIALIZED or NOT MATERIALIZED to control this explicitly (Postgres 12+).

Subqueries

Subquery in WHERE: SELECT * FROM orders WHERE user_id IN (SELECT id FROM users WHERE active = true)

Correlated subquery — references outer query. Runs once per outer row (can be slow):

SELECT *, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_count FROM users u

EXISTS — check if subquery returns any rows. Stops at first match (faster than IN for large sets):

SELECT * FROM users u WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id)

Scalar subquery — returns exactly one value: SELECT name, (SELECT MAX(total) FROM orders) AS max_order FROM users

Index types

CREATE INDEX idx_users_email ON users(email) — B-tree index (default). Good for equality and range queries.

CREATE UNIQUE INDEX idx_users_email ON users(email) — enforces uniqueness.

CREATE INDEX CONCURRENTLY — build index without locking the table. Use in production.

Index types:

B-tree — default. Handles =, <, >, BETWEEN, LIKE 'prefix%'.

Hash — only equality. Slightly faster than B-tree for equality, but rarely worth it.

GIN — inverted index. Good for arrays, JSONB, full-text search.

GiST — geometric/geographic data. Used with PostGIS.

BRIN — block range index. Very small. Good for naturally ordered large tables (timestamps, auto-increment IDs).

Index strategy

Partial index — index only a subset of rows: CREATE INDEX ON orders(user_id) WHERE status = 'pending'. Smaller and faster if you always query with that condition.

Composite index — CREATE INDEX ON orders(user_id, created_at). Column order matters — most selective or most frequently filtered first. Can satisfy queries on first column alone.

Covering index — INCLUDE (col) adds columns to the index without being part of the key: CREATE INDEX ON orders(user_id) INCLUDE (total, status). Allows index-only scans.

EXPLAIN ANALYZE SELECT ... — see the actual execution plan. Look for: Seq Scan (bad on large tables), Index Scan (good), Index Only Scan (best).

Drop unused indexes — they slow down writes and waste space. Check pg_stat_user_indexes for usage stats.

Transaction basics

BEGIN — start transaction.

COMMIT — save all changes.

ROLLBACK — discard all changes since BEGIN.

Every statement is automatically wrapped in a transaction if not explicitly started.

Savepoints — partial rollback within a transaction:

SAVEPOINT my_save

ROLLBACK TO my_save — roll back to savepoint, transaction still open.

RELEASE my_save — remove savepoint.

Isolation levels

Controls what concurrent transactions can see. Higher isolation = safer but slower.

READ COMMITTED (default) — sees only committed data. May see different data on repeated reads in same transaction.

REPEATABLE READ — same data on repeated reads. Prevents non-repeatable reads. May see phantom rows from inserts.

SERIALIZABLE — full isolation. Transactions behave as if run sequentially. Safest but slowest. Postgres uses SSI (Serializable Snapshot Isolation) — very efficient implementation.

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE — set for current transaction.

For most web apps, READ COMMITTED is fine. Use SERIALIZABLE for financial or inventory systems where concurrent updates must be exact.

Locking

Row-level locks — automatically acquired on UPDATE/DELETE.

SELECT ... FOR UPDATE — lock selected rows. Other transactions block until lock released. Use when you read then write based on the value.

SELECT ... FOR UPDATE SKIP LOCKED — skip already-locked rows. Great for job queues — multiple workers can pull tasks without blocking each other.

SELECT ... FOR SHARE — shared lock. Others can read but not update.

Deadlocks — two transactions each waiting for the other's lock. Postgres detects and kills one. Avoid by always acquiring locks in the same order.

JSONB

jsonb stores JSON in binary format — faster to query, supports indexing. Always use jsonb over json.

Access operators: data->'key' (returns jsonb), data->>'key' (returns text).

Nested: data->'address'->>'city'

#> and #>> — path operators: data #>> '{address,city}'

Containment: data @> '{"role":"admin"}'::jsonb — does JSON contain this key-value?

Key exists: data ? 'key'

Any key exists: data ?| array['key1','key2']

All keys exist: data ?& array['key1','key2']

Update JSONB: UPDATE t SET data = jsonb_set(data, '{key}', '"newval"')

GIN index on JSONB: CREATE INDEX ON t USING GIN(data) — enables fast containment and key queries.

Arrays

tags TEXT[] — array column.

Insert: INSERT INTO t (tags) VALUES (ARRAY['a','b'])

Access: tags[1] — 1-indexed in Postgres.

Contains: tags @> ARRAY['a'] — array contains element.

Overlap: tags && ARRAY['a','b'] — arrays share any element.

UNNEST(tags) — expand array to rows.

ARRAY_LENGTH(tags, 1) — length of first dimension.

ARRAY_APPEND(tags, 'c'), ARRAY_REMOVE(tags, 'a')

GIN index: CREATE INDEX ON t USING GIN(tags) — fast array containment queries.

String & date functions

CONCAT(a, b) or a || b — string concatenation.

LOWER(s), UPPER(s), TRIM(s), LENGTH(s)

SUBSTRING(s FROM 1 FOR 5), REPLACE(s, 'a', 'b')

SPLIT_PART(s, ',', 1) — split by delimiter, get nth part.

REGEXP_REPLACE(s, pattern, replacement)

NOW() / CURRENT_TIMESTAMP — current timestamp with timezone.

CURRENT_DATE, CURRENT_TIME

DATE_TRUNC('month', ts) — truncate to start of month/day/hour etc.

EXTRACT(YEAR FROM ts) — extract part of date.

ts + INTERVAL '7 days' — date arithmetic.

AGE(ts1, ts2) — difference as interval.

Conditional & type functions

COALESCE(a, b, c) — first non-null value.

NULLIF(a, b) — returns null if a = b, else a. Avoid division by zero: val / NULLIF(divisor, 0)

CASE WHEN condition THEN value ELSE other END

CAST(val AS integer) or val::integer — type cast. :: is Postgres shorthand.

TO_CHAR(ts, 'YYYY-MM-DD') — format timestamp as string.

TO_TIMESTAMP('2024-01-01', 'YYYY-MM-DD') — parse string to timestamp.

GEN_RANDOM_UUID() — generate UUID v4. Built-in since Postgres 13.

Common traps

NULL comparisons — NULL = NULL is NULL, not true. Always use IS NULL / IS NOT NULL. NULL in IN list does not match anything.

LEFT JOIN + WHERE on right table = INNER JOIN. Put the right table condition in the ON clause instead.

LIKE with leading wildcard '%pattern' cannot use a B-tree index — full table scan. Use full-text search for prefix-independent matching.

COUNT(*) vs COUNT(col) — COUNT(*) counts all rows including nulls. COUNT(col) skips nulls.

GROUP BY — every non-aggregate column in SELECT must be in GROUP BY.

OFFSET pagination is slow — for page 1000 with LIMIT 20, Postgres still scans 20,020 rows. Use keyset pagination: WHERE id > last_seen_id ORDER BY id LIMIT 20.

Things to know cold

EXPLAIN ANALYZE — always use this to debug slow queries. Look at actual vs estimated row counts. Big discrepancies mean stale stats — run ANALYZE table.

Vacuum — Postgres uses MVCC (multi-version concurrency control). Old row versions accumulate. AUTOVACUUM reclaims them. Do not disable it.

Connection pooling — Postgres forks one process per connection. Use PgBouncer or your ORM pool. Default max_connections = 100.

pg_stat_statements extension — tracks query performance over time. Best tool for finding slow queries in production.

Partial vs full table scan threshold — Postgres chooses sequential scan over index scan when retrieving more than roughly 5-10% of rows. This is correct behavior, not a bug.

PostgreSQL

1 · mental model

How PostgreSQL thinks

2 · data types