Buffer Manager
The buffer manager (primarily configured by shared_buffers) is the part of Postgres that caches on-disk data in memory.
The PostgreSQL buffer manager comprises a buffer table, buffer descriptors, and buffer pool,
Buffer Tag
In PostgreSQL, each page of all data files can be assigned a unique tag, i.e. a buffer tag. When the buffer manager receives a request, PostgreSQL uses the buffer_tag of the desired page.
The buffer_tag comprises three values:
- The RelFileNode
1 2 3 4 5 6 |
|
- The fork number of the relation to which its page belongs
The fork numbers of tables, freespace maps and visibility maps are defined in 0, 1 and 2, respectively.
- The block number of its page.
For example, the buffer_tag {(16821, 16384, 37721), 0, 7}
identifies the page that is in the seventh
block, fork number = 0
, tablespace's OID = 16821
, database's OID = 16384
and relation's OID = 37721
How a Backend Process Reads Pages
- When reading a table or index page, a backend process sends a request that includes the page's
buffer_tag
to the buffer manager - The buffer manager returns the buffer_ID of the slot that stores the requested page. If the requested page is not stored in the buffer pool, the buffer manager loads the page from persistent storage to one of the buffer pool slots and then returns the buffer_ID's slot.
- The backend process accesses the buffer_ID's slot (to read the desired page).
Page Replacement Algorithm
When all buffer pool slots are occupied but the requested page is not stored, the buffer manager must select one page in the buffer pool that will be replaced by the requested page.
Since version 8.1, PostgreSQL has used clock sweep algorithm because it is simpler and more efficient than the LRU algorithm used in previous versions.
Buffer Manager Structure
The PostgreSQL buffer manager comprises three layers:
- The buffer pool is an array. Each slot stores a data file pages. The indices of the array slots are referred to as
buffer_ids
. - The buffer descriptors layer is an array of buffer descriptors. Each descriptor has one-to-one correspondence to a buffer pool slot and holds metadata of the stored page in the corresponding slot.
- The buffer table is a hash table that stores the relations between the buffer_tags of stored pages and the buffer_ids of the descriptors that hold the stored pages' respective metadata.
Buffer Table
A buffer table can be logically divided into three parts: a hash function, hash bucket slots, and data entries.
A data entry comprises two values: the buffer_tag of a page, and the buffer_id of the descriptor that holds the page's metadata. For example, a data entry Tag_A, id=1
means that the buffer descriptor with buffer_id 1
stores metadata of the page tagged with Tag_A
.
Buffer Descriptor
Buffer descriptor holds the metadata of the stored page in the corresponding buffer pool slot. The buffer descriptor structure is defined by the structure BufferDesc.
The structure BufferDesc is defined in src/include/storage/buf_internals.h.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
|
To simplify the following descriptions, three descriptor states are defined:
- Empty When the corresponding buffer pool slot does not store a page (i.e. refcount and usage_count are 0)
- Pinned When the corresponding buffer pool slot stores a page and any PostgreSQL processes are accessing the page (i.e. refcount and usage_count are greater than or equal to 1)
- Unpinned When the corresponding buffer pool slot stores a page but no PostgreSQL processes are accessing the page (i.e. usage_count is greater than or equal to 1, but refcount is 0)
When the PostgreSQL server starts, the state of all buffer descriptors is empty. In PostgreSQL, those descriptors comprise a linked list called freelist.
Buffer Pool
The buffer pool is a simple array that stores data file pages, such as tables and indexes. Indices of the buffer pool array are referred to as buffer_ids. The buffer pool slot size is 8 KB, which is equal to the size of a page. Thus, each slot can store an entire page.
How the Buffer Manager Works
Accessing a Page Stored in the Buffer Pool
- Create the buffer_tag of the desired page and compute the hash bucket slot
- Acquire the BufMappingLock partition that covers the obtained hash bucket slot in shared mode (this lock will be released in step (5)).
- Look up the entry whose tag is 'Tag_C' and obtain the buffer_id from the entry. In this example, the buffer_id is 2.
- Pin the buffer descriptor for buffer_id 2, i.e. the refcount and usage_count of the descriptor are increased by 1
- Release the BufMappingLock.
- Access the buffer pool slot with buffer_id 2.
Then, when reading rows from the page in the buffer pool slot, the PostgreSQL process acquires the shared content_lock
of the corresponding buffer descriptor. Thus, buffer pool slots can be read by multiple processes simultaneously.
When inserting (and updating or deleting) rows to the page, a Postgres process acquires the exclusive content_lock
of the corresponding buffer descriptor (note that the dirty bit of the page must be set to '1').
After accessing the pages, the refcount values of the corresponding buffer descriptors are decreased by 1.
Loading a Page from Storage to Empty Slot
- Look up the buffer table (we assume it is not found).
1 2 3 4 |
|
- Obtain the empty buffer descriptor from the freelist, and pin it. In this example, the buffer_id of the obtained descriptor is 4.
- Acquire the BufMappingLock partition in exclusive mode (this lock will be released in step (6)).
- Create a new data entry that comprises the buffer_tag 'Tag_E' and buffer_id 4; insert the created entry to the buffer table.
- Load the desired page data from storage to the buffer pool slot with buffer_id 4 as follows:
1 2 3 4 5 |
|
- Release the BufMappingLock.
- Access the buffer pool slot with buffer_id 4.
Dirty Pages
When data is modified in memory, the page in which this data is stored is called a dirty page
, as long as the modifications are not written to disk. So if there is a page that is different in the shared buffer and the disk, it is called a dirty page. The buffer manager flushes the dirty pages to storage with the assistance of two subsystems called – checkpointer and background writer.