B-Tree Deduplication

sweep</quote> by a <command>VACUUM</command> operation (typically run in an autovacuum worker process) will eventually be required as a part of <emphasis>collective</emphasis> cleanup of the table and each of its indexes. </para> <para> Unlike <command>VACUUM</command>, bottom-up index deletion does not provide any strong guarantees about how old the oldest garbage index tuple may be. No index can be permitted to retain <quote>floating garbage</quote> index tuples that became dead prior to a conservative cutoff point shared by the table and all of its indexes collectively. This fundamental table-level invariant makes it safe to recycle table <acronym>TID</acronym>s. This is how it is possible for distinct logical rows to reuse the same table <acronym>TID</acronym> over time (though this can never happen with two logical rows whose lifetimes span the same <command>VACUUM</command> cycle). </para> </sect3> <sect3 id="btree-deduplication"> <title>Deduplication</title> <para> A duplicate is a leaf page tuple (a tuple that points to a table row) where <emphasis>all</emphasis> indexed key columns have values that match corresponding column values from at least one other leaf page tuple in the same index. Duplicate tuples are quite common in practice. B-Tree indexes can use a special, space-efficient representation for duplicates when an optional technique is enabled: <firstterm>deduplication</firstterm>. </para> <para> Deduplication works by periodically merging groups of duplicate tuples together, forming a single <firstterm>posting list</firstterm> tuple for each group. The column key value(s) only appear once in this representation. This is followed by a sorted array of <acronym>TID</acronym>s that point to rows in the table. This significantly reduces the storage size of indexes where each value (or each distinct combination of column values) appears several times on average. The latency of queries can be reduced significantly. Overall query throughput may increase significantly. The overhead of routine index vacuuming may also be reduced significantly. </para> <note> <para> B-Tree deduplication is just as effective with <quote>duplicates</quote> that contain a NULL value, even though NULL values are never equal to each other according to the <literal>=</literal> member of any B-Tree operator class. As far as any part of the implementation that understands the on-disk B-Tree structure is concerned, NULL is just another value from the domain of indexed values. </para> </note> <para> The deduplication process occurs lazily, when a new item is inserted that cannot fit on an existing leaf page, though only when index tuple deletion could not free sufficient space for the new item (typically deletion is briefly considered and then skipped over). Unlike GIN posting list tuples, B-Tree posting list tuples do not need to expand every time a new duplicate is inserted; they are merely an alternative physical representation of the original logical contents of the leaf page. This design prioritizes consistent performance with mixed read-write workloads. Most client applications will at least see a moderate performance benefit from using deduplication. Deduplication is enabled by default. </para> <para> <command>CREATE INDEX</command> and <command>REINDEX</command> apply deduplication to create posting list tuples, though the strategy they use is slightly different. Each group of duplicate ordinary tuples encountered in the sorted input taken from the table is merged into a posting list tuple <emphasis>before</emphasis> being added to the current pending leaf page. Individual posting list tuples are packed with as many <acronym>TID</acronym>s as possible. Leaf pages are written out in the usual way, without

This section discusses B-Tree deduplication, a technique to optimize storage and performance by merging duplicate leaf page tuples into a single posting list tuple containing the key value(s) and an array of TIDs. Deduplication is triggered when a new item cannot fit on a leaf page after index tuple deletion. It's effective with NULL values and prioritizes consistent performance with mixed read-write workloads, generally providing a performance benefit. CREATE INDEX and REINDEX also use deduplication when building indexes.