Home Explore Blog CI



man-pages

13th chunk of `awk.man`
c521c42b6a8a767e0678585a6206b89c587be787af5213eb0000000100000fc4
 sentinel A[0] = "" will be created if needed
            }

COMPATIBILITY ISSUES
   MAWK 1.3.3 versus POSIX 1003.2 Draft 11.3
       The POSIX 1003.2(draft 11.3) definition of the AWK language is AWK as described in the AWK book with a few extensions that appeared in SystemVR4 nawk.  The extensions are:

          •   New functions: toupper() and tolower().

          •   New variables: ENVIRON[] and CONVFMT.

          •   ANSI C conversion specifications for printf() and sprintf().

          •   New command options:  -v var=value, multiple -f options and implementation options as arguments to -W.

          •   For systems (MS‐DOS or Windows) which provide a setmode function, an environment variable MAWKBINMODE and a built‐in variable BINMODE.  The bits of the BINMODE value tell mawk  how to  modify  the  RS  and  ORS
              variables:

              0  set standard input to binary mode, and if BIT‐2 is unset, set RS to "\r\n" (CR/LF) rather than "\n" (LF).

              1  set standard output to binary mode, and if BIT‐2 is unset, set ORS to "\r\n" (CR/LF) rather than "\n" (LF).

              2  suppress the assignment to RS and ORS of CR/LF, making it possible to run scripts and generate output compatible with Unix line‐endings.

       POSIX AWK is oriented to operate on files a line at a time.  RS can be changed from "\n" to another single character, but it is hard to find any use for this — there are no examples in the AWK book.  By convention, RS
       = "", makes one or more blank lines separate records, allowing multi‐line records.  When RS = "", "\n" is always a field separator regardless of the value in FS.

       mawk, on the other hand, allows RS to be a regular expression.  When "\n" appears in records, it is treated as space, and FS always determines fields.

       Removing the line at a time paradigm can make some programs simpler and can often improve performance.  For example, redoing example 3 from above,

            BEGIN { RS = "[^A-Za-z]+" }

            { word[ $0 ] = "" }

            END { delete  word[ "" ]
              for( i in word )  cnt++
              print cnt
            }

       counts the number of unique words by making each word a record.  On moderate size files, mawk executes twice as fast, because of the simplified inner loop.

       The following program replaces each comment by a single space in a C program file,

            BEGIN {
              RS = "/\*([^*]|\*+[^/*])*\*+/"
                 # comment is record separator
              ORS = " "
              getline  hold
              }

              { print hold ; hold = $0 }

              END { printf "%s" , hold }

       Buffering one record is needed to avoid terminating the last record with a space.

       With mawk, the following are all equivalent,

            x ~ /a\+b/    x ~ "a\+b"     x ~ "a\\+b"

       The strings get scanned twice, once as string and once as regular expression.  On the string scan, mawk ignores the escape on non‐escape characters while the AWK book advocates \c be recognized as c which necessitates
       the  double  escaping  of meta‐characters in strings.  POSIX explicitly declines to define the behavior which passively forces programs that must run under a variety of awks to use the more portable but less readable,
       double escape.

       POSIX AWK does not recognize "/dev/std{in,out,err}".  Some systems provide an actual device for this, allowing AWKs which do not implement the feature directly to support it.

       POSIX AWK does not recognize \x hex escape sequences in strings.  Unlike ANSI C, mawk limits the number of digits that follows \x to two as the current implementation only supports 8 bit characters.

       POSIX explicitly leaves the behavior of FS = "" undefined, and mentions splitting the record into characters as a possible interpretation, but currently this use is not portable across implementations.

       Some features were not part of the POSIX standard

Title: MAWK Compatibility Issues and Differences from POSIX AWK
Summary
This section details the compatibility issues between MAWK 1.3.3 and the POSIX 1003.2 Draft 11.3 standard for AWK. It highlights extensions in POSIX AWK, such as new functions (toupper(), tolower()), variables (ENVIRON[], CONVFMT), ANSI C specifications, and command options. It also discusses the BINMODE variable for MS-DOS/Windows. The section contrasts MAWK's handling of RS (record separator) as a regular expression with POSIX AWK's line-at-a-time paradigm. It illustrates the differences with examples, including counting unique words and replacing comments in C code. It addresses discrepancies in string scanning, escape sequences, and the recognition of special device files. Finally, it mentions the undefined behavior of FS = "" in POSIX and notes that some features were not part of the POSIX standard.