Concurrency Control
Concurrency Control is a mechanism that maintains atomicity and isolation, which are two properties of the ACID, when several transactions run concurrently in the database.
Transaction Isolation Level in PostgreSQL
Isolation Level | Dirty Reads | Non-repeatable Read | Phantom Read | Serialization Anomaly |
---|---|---|---|---|
READ UNCOMMITTED | Allowed, but not in PG | Possible | Possible | Possible |
READ COMMITTED | Not possible | Possible | Possible | Possible |
REPEATABLE READ | Not possible | Not possible | Allowed, but not in PG | Possible |
SERIALIZABLE | Not possible | Not possible | Not possible | Not possible |
In PostgreSQL, you can request any of the four standard transaction isolation levels, but internally only three distinct isolation levels are implemented, i.e., PostgreSQL's Read Uncommitted
mode behaves like Read Committed
.
Transaction ID
Whenever a transaction begins, a unique identifier, referred to as a transaction id (txid), is assigned by the transaction manager. PostgreSQL's txid is a 32-bit unsigned integer, approximately 4.2 billion (thousand millions).
PostgreSQL reserves the following three special txids:
- 0 means Invalid txid.
- 1 means Bootstrap txid, which is only used in the initialization of the database cluster.
- 2 means Frozen txid
Since the txid space is insufficient in practical systems, PostgreSQL treats the txid space as a circle. The previous 2.1 billion txids are in the past
, and the next 2.1 billion txids are in the future
.
Transaction wraparound problem
Assume that tuple Tuple_1
is inserted with a txid of 100
, i.e. the t_xmin of Tuple_1
is 100
. The server has been running for a very long period and Tuple_1
has not been modified.
The current txid is 2.1 billion + 100 and a SELECT command is executed. At this time, Tuple_1
is visible because txid 100
is in the past. Then, the same SELECT command is executed; thus, the current txid is 2.1 billion + 101. However, Tuple_1
is no longer visible because txid 100
is in the future. This is the so called transaction wraparound problem in PostgreSQL.
To deal with this problem, PostgreSQL introduced a concept called frozen txid, and implemented a process called FREEZE.
In PostgreSQL, a frozen txid, which is a special reserved txid 2
, is defined such that it is always older than all other txids. In other words, the frozen txid is always inactive and visible.
The freeze process is invoked by the vacuum process. The freeze process scans all table files and rewrites the t_xmin of tuples to the frozen txid if the t_xmin value is older than the current txid minus the vacuum_freeze_min_age (the default is 50 million).
Tuple Structure
A heap tuple comprises three parts, i.e. the HeapTupleHeaderData structure, NULL bitmap, and user data.
While the HeapTupleHeaderData structure contains seven fields, four fields are required in the subsequent sections.
- t_xmin holds the txid of the transaction that inserted this tuple.
- t_xmax holds the txid of the transaction that deleted or updated this tuple. If this tuple has not been deleted or updated, t_xmax is set to 0, which means INVALID.
- t_cid holds the command id (cid), which means how many SQL commands were executed before this command was executed within the current transaction beginning from 0. For example, assume that we execute three INSERT commands within a single transaction:
BEGIN; INSERT; INSERT; INSERT; COMMIT;
. If the first command inserts this tuple, t_cid is set to 0. If the second command inserts this, t_cid is set to 1, and so on. - t_ctid holds the tuple identifier (tid) that points to itself or a new tuple tid, is used to identify a tuple within a table. When this tuple is updated, the t_ctid of this tuple points to the new tuple; otherwise, the t_ctid points to itself.
The image below shows an example of how tuples are represented.
Inserting, Deleting and Updating Tuples
Insertion
Suppose that a tuple is inserted in a page by a transaction whose txid is 99
. In this case, the header fields of the inserted tuple are set as follows.
- t_xmin is set to
99
because this tuple is inserted by txid 99. - t_xmax is set to
0
because this tuple has not been deleted or updated. - t_cid is set to
0
because this tuple is the first tuple inserted by txid 99. - t_ctid is set to
(0,1)
, which points to itself, because this is the latest tuple.
Deletion
Suppose that Tuple_1 is deleted by txid 111. In this case, the header fields of Tuple_1 are set as follows.
- t_xmax is set to 111.
Dead tuples should eventually be removed from pages. Cleaning dead tuples is referred to as VACUUM processing,
Update
Suppose that the row, which has been inserted by txid 99
, is updated twice by txid 100
.
When the first UPDATE
command is executed, Tuple_1
is logically deleted by setting txid 100
to the t_xmax, and then Tuple_2
is inserted. Then, the t_ctid of Tuple_1
is rewritten to point to Tuple_2
. The header fields of both Tuple_1
and Tuple_2
are as follows.
Tuple_1:
- t_xmax is set to 100.
- t_ctid is rewritten from (0, 1) to (0, 2).
Tuple_2:
- t_xmin is set to 100.
- t_xmax is set to 0.
- t_cid is set to 0.
- t_ctid is set to (0,2).
When the second UPDATE
command is executed, as in the first UPDATE
command, Tuple_2
is logically deleted and Tuple_3
is inserted. The header fields of both Tuple_2
and Tuple_3
are as follows.
Tuple_2:
- t_xmax is set to 100.
- t_ctid is rewritten from (0, 2) to (0, 3).
Tuple_3:
t_xmin is set to 100. t_xmax is set to 0. t_cid is set to 1. t_ctid is set to (0,3).
As with the delete operation, if txid 100 is committed, Tuple_1
and Tuple_2
will be dead tuples, and, if txid 100 is aborted, Tuple_2
and Tuple_3
will be dead tuples.
Commit Log (clog)
PostgreSQL holds the statuses of transactions in the Commit Log. The Commit Log, often called the clog, is allocated to the shared memory, and is used throughout transaction processing.
Transaction Status
PostgreSQL defines four transaction statuses
- IN_PROGRESS
- COMMITTED
- ABORTED
- SUB_COMMITTED (for sub-transaction)
How Clog Performs
The clog comprises one or more 8 KB pages in shared memory. The clog logically forms an array. The indices of the array correspond to the respective transaction ids, and each item in the array holds the status of the corresponding transaction id.
Maintenance of the Clog
When PostgreSQL shuts down or whenever the checkpoint process runs, the data of the clog are written into files stored under the pg_xact subdirectory. These files are named 0000, 0001, etc. The maximum file size is 256 KB.
When PostgreSQL starts up, the data stored in the pg_xact's files are loaded to initialize the clog.
The size of the clog continuously increases because a new page is appended whenever the clog is filled up. However, not all data in the clog are necessary. Vacuum processing, regularly removes such old data (both the clog pages and files).
Transaction Snapshot
A transaction snapshot is a dataset that stored information about whether all transactions are active, at a certain point in time for an individual transaction.
The textual representation of the txid_current_snapshot is xmin:xmax:xip_list
, and the components are described as follows.
- xmin earliest txid that is still active. All earlier transactions will either be committed and visible, or rolled back and dead.
- xmax first as-yet-unassigned txid. All txids greater than or equal to this are not yet started as of the time of the snapshot, and thus invisible.
- xip_list the list includes only active txids between xmin and xmax
For example, in the snapshot 100:104:100,102
, xmin is 100
, xmax 104
, and xip_list 100,102
.
Transaction snapshots are provided by the transaction manager.
- In the READ COMMITTED isolation level, the transaction obtains a snapshot whenever an SQL command is executed
- In REPEATABLE READ or SERIALIZABLE, the transaction only gets a snapshot when the first SQL command is executed.
Preventing Lost Updates
A Lost Update, also known as a ww-conflict, is an anomaly that occurs when concurrent transactions update the same rows, and it must be prevented in both the REPEATABLE READ and SERIALIZABLE levels.
The target row is being updated
Being updated
means that the row is updated by another concurrent transaction and its transaction has not terminated. In this case, the current transaction must wait for termination of the transaction that updated the target row because PostgreSQL's SI uses the first-updater-win scheme. For example, assume that transactions Tx_A and Tx_B run concurrently, and Tx_B attempts to update a row; however, Tx_A has updated it and is still in progress. In this case, Tx_B waits for the termination of Tx_A.
After the transaction that updated the target row commits, the update operation of the current transaction proceeds. If the current transaction is in the READ COMMITTED level, the target row will be updated; otherwise (REPEATABLE READ or SERIALIZABLE), the current transaction is aborted immediately to prevent lost updates.
The target row has been updated by the concurrent transaction
The current transaction attempts to update the target tuple; however, the other concurrent transaction has updated the target row and has already been committed. In this case, if the current transaction is in the READ COMMITTED level, the target row will be updated; otherwise, the current transaction is aborted immediately to prevent lost updates.
There is no conflict
When there is no conflict, the current transaction can update the target row.
Serializable Snapshot Isolation
Write skew problem
Write skew is the problem when two concurrent transactions each determine what they are writing based on reading a data set which overlaps what the other is writing, you can get a state which could not occur if either had run before the other.
PostgreSQL takes the following strategy for the SSI implementation to prevent write skew:
- Record all objects (tuples, pages, relations) accessed by transactions as SIREAD locks.
- Detect rw-conflicts using SIREAD locks whenever any heap or index tuple is written.
- Abort the transaction if a serialization anomaly is detected by checking detected rw-conflicts.
Implementing SSI in PostgreSQL
PostgreSQL has implemented many functions and data structures. However, here we uses only two data structures to describe the SSI mechanism
- SIREAD locks
- rw-conflicts,
Predicate locks (SIREAD lock)
An SIREAD lock, internally called a predicate lock, is a pair of an object and (virtual) txids that store information about who has accessed which object.
For example, if txid 100 reads Tuple_1 of the given table, an SIREAD lock {Tuple_1, {100}}
is created. If another transaction, e.g. txid 101, reads Tuple_1, the SIREAD lock is updated to {Tuple_1, {100,101}}
.
Note that a SIREAD lock is also created when an index page is read.
SIREAD lock has three levels: tuple
, page
, and relation
.
If the SIREAD locks of all tuples within a single page are created, they are aggregated into a single SIREAD lock for that page, and all SIREAD locks of the associated tuples are released (removed), to reduce memory space. The same is true for all pages that are read.
When creating an SIREAD lock for an index, the page level SIREAD lock is created from the beginning.
When using sequential scan, a relation level SIREAD lock is created from the beginning regardless of the presence of indexes and/or WHERE clauses. Note that, in certain situations, this implementation can cause false-positive detections of serialization anomalies.
Predicate locks are not always released immediately on completion of the transaction. It only releases until all concurrent transactions commit
rw-conflicts
A rw-conflict is a triplet of an SIREAD lock and two txids that reads and writes the SIREAD lock. The CheckTargetForConflictsIn function is invoked whenever either an INSERT, UPDATE or DELETE command is executed in SERIALIZABLE mode, and it creates rw-conflicts when detecting conflicts by checking SIREAD locks. For example, assume that txid 100 reads Tuple_1 and then txid 101 updates Tuple_1. In this case, the CheckTargetForConflictsIn function, invoked by the UPDATE command in txid 101, detects a rw-conflict with Tuple_1 between txid 100 and 101 then creates a rw-conflict {r=100, w=101, {Tuple_1}}
How SSI Performs
Transactions Tx_A and Tx_B execute the following commands:
Assume that all commands use index scan. Therefore, when the commands are executed, they read both heap tuples and index pages.
The image below shows how PostgreSQL detects and resolves the Write-Skew anomaly described in the above scenario.
False-Positive Serialization Anomalies
Scenario where false-positive serialization anomaly occurs.
When using sequential scan, PostgreSQL creates a relation level SIREAD lock. In this case, rw-conflicts C1 and C2, which are associated with the tbl's SIREAD lock, are created, and they create a cycle in the precedence graph. Thus, a false-positive Write-Skew anomaly is detected.
Even when using index scan, if both transactions Tx_A and Tx_B get the same index SIREAD lock, PostgreSQL detects a false-positive anomaly.