CSV Quirks and Binary Format for COPY

<literal>CSV</literal> lines with white space out to some fixed width. If such a situation arises you might need to preprocess the <literal>CSV</literal> file to remove the trailing white space, before importing the data into <productname>PostgreSQL</productname>. </para> </note> <note> <para> <literal>CSV</literal> format will both recognize and produce <literal>CSV</literal> files with quoted values containing embedded carriage returns and line feeds. Thus the files are not strictly one line per table row like text-format files. </para> </note> <note> <para> Many programs produce strange and occasionally perverse <literal>CSV</literal> files, so the file format is more a convention than a standard. Thus you might encounter some files that cannot be imported using this mechanism, and <command>COPY</command> might produce files that other programs cannot process. </para> </note> </refsect2> <refsect2 id="sql-copy-binary-format" xreflabel="Binary Format"> <title>Binary Format</title> <para> The <literal>binary</literal> format option causes all data to be stored/read as binary format rather than as text. It is somewhat faster than the text and <literal>CSV</literal> formats, but a binary-format file is less portable across machine architectures and <productname>PostgreSQL</productname> versions. Also, the binary format is very data type specific; for example it will not work to output binary data from a <type>smallint</type> column and read it into an <type>integer</type> column, even though that would work fine in text format. </para> <para> The <literal>binary</literal> file format consists of a file header, zero or more tuples containing the row data, and a file trailer. Headers and data are in network byte order. </para> <note> <para> <productname>PostgreSQL</productname> releases before 7.4 used a different binary file format. </para> </note> <refsect3> <title>File Header</title> <para> The file header consists of 15 bytes of fixed fields, followed by a variable-length header extension area. The fixed fields are: <variablelist> <varlistentry> <term>Signature</term> <listitem> <para> 11-byte sequence <literal>PGCOPY\n\377\r\n\0</literal> — note that the zero byte is a required part of the signature. (The signature is designed to allow easy identification of files that have been munged by a non-8-bit-clean transfer. This signature will be changed by end-of-line-translation filters, dropped zero bytes, dropped high bits, or parity changes.) </para> </listitem> </varlistentry> <varlistentry> <term>Flags field</term> <listitem> <para> 32-bit integer bit mask to denote important aspects of the file format. Bits are numbered from 0 (<acronym>LSB</acronym>) to 31 (<acronym>MSB</acronym>). Note that this field is stored in network byte order (most significant byte first), as are all the integer fields used in the file format. Bits 16–31 are reserved to denote critical file format issues; a reader should abort if it finds an unexpected bit set in this range. Bits 0–15 are reserved to signal backwards-compatible format issues; a reader should simply ignore any unexpected bits set in this range. Currently only one flag bit is defined, and the rest must be zero: <variablelist> <varlistentry> <term>Bit 16</term> <listitem> <para> If 1, OIDs are included in the data; if 0, not. Oid system columns are not supported in <productname>PostgreSQL</productname> anymore, but the format still contains the indicator. </para> </listitem> </varlistentry> </variablelist></para> </listitem> </varlistentry>

This section contains notes on the CSV format, highlighting issues with whitespace padding, carriage returns, line feeds, and the lack of a strict standard. It then transitions to the binary format for COPY, describing it as faster but less portable than text or CSV, and very data-type specific. The binary format consists of a file header, tuples of row data, and a file trailer, all in network byte order. A compatibility note mentions that PostgreSQL releases before 7.4 used a different binary file format. The section details the structure of the file header, including a signature for file identification and a flags field used to denote aspects of the file format, especially the inclusion of OIDs.