TOAST Details: Data Types, Representations, and Compression

<productname>PostgreSQL</productname> uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code. The technique is affectionately known as <acronym>TOAST</acronym> (or <quote>the best thing since sliced bread</quote>). The <acronym>TOAST</acronym> infrastructure is also used to improve handling of large data values in-memory. </para> <para> Only certain data types support <acronym>TOAST</acronym> — there is no need to impose the overhead on data types that cannot produce large field values. To support <acronym>TOAST</acronym>, a data type must have a variable-length (<firstterm>varlena</firstterm>) representation, in which, ordinarily, the first four-byte word of any stored value contains the total length of the value in bytes (including itself). <acronym>TOAST</acronym> does not constrain the rest of the data type's representation. The special representations collectively called <firstterm><acronym>TOAST</acronym>ed values</firstterm> work by modifying or reinterpreting this initial length word. Therefore, the C-level functions supporting a <acronym>TOAST</acronym>-able data type must be careful about how they handle potentially <acronym>TOAST</acronym>ed input values: an input might not actually consist of a four-byte length word and contents until after it's been <firstterm>detoasted</firstterm>. (This is normally done by invoking <function>PG_DETOAST_DATUM</function> before doing anything with an input value, but in some cases more efficient approaches are possible. See <xref linkend="xtypes-toast"/> for more detail.) </para> <para> <acronym>TOAST</acronym> usurps two bits of the varlena length word (the high-order bits on big-endian machines, the low-order bits on little-endian machines), thereby limiting the logical size of any value of a <acronym>TOAST</acronym>-able data type to 1 GB (2<superscript>30</superscript> - 1 bytes). When both bits are zero, the value is an ordinary un-<acronym>TOAST</acronym>ed value of the data type, and the remaining bits of the length word give the total datum size (including length word) in bytes. When the highest-order or lowest-order bit is set, the value has only a single-byte header instead of the normal four-byte header, and the remaining bits of that byte give the total datum size (including length byte) in bytes. This alternative supports space-efficient storage of values shorter than 127 bytes, while still allowing the data type to grow to 1 GB at need. Values with single-byte headers aren't aligned on any particular boundary, whereas values with four-byte headers are aligned on at least a four-byte boundary; this omission of alignment padding provides additional space savings that is significant compared to short values. As a special case, if the remaining bits of a single-byte header are all zero (which would be impossible for a self-inclusive length), the value is a pointer to out-of-line data, with several possible alternatives as described below. The type and size of such a <firstterm>TOAST pointer</firstterm> are determined by a code stored in the second byte of the datum. Lastly, when the highest-order or lowest-order bit is clear but the adjacent bit is set, the content of the datum has been compressed and must be decompressed before use. In this case the remaining bits of the four-byte length word give the total size of the compressed datum, not the original data. Note that compression is also possible for out-of-line data but the varlena header does not tell whether it has occurred — the content of the <acronym>TOAST</acronym> pointer tells that, instead. </para> <para> The compression technique used for either in-line or out-of-line compressed

This section delves into the specifics of TOAST, explaining that only certain data types with variable-length (varlena) representations support it. It describes how TOAST modifies the initial length word of stored values, using two bits to indicate whether the value is un-TOASTed, has a single-byte header for space efficiency, or is a pointer to out-of-line data. It also covers the use of compression, noting that the varlena header indicates whether in-line data is compressed, while the TOAST pointer indicates compression for out-of-line data. The logical size of a TOAST-able data type is limited to 1 GB due to the length word modification.