sentinel A[0] = "" will be created if needed
}
COMPATIBILITY ISSUES
MAWK 1.3.3 versus POSIX 1003.2 Draft 11.3
The POSIX 1003.2(draft 11.3) definition of the AWK language is AWK as described in the AWK book with a few extensions that appeared in SystemVR4 nawk. The extensions are:
• New functions: toupper() and tolower().
• New variables: ENVIRON[] and CONVFMT.
• ANSI C conversion specifications for printf() and sprintf().
• New command options: -v var=value, multiple -f options and implementation options as arguments to -W.
• For systems (MS‐DOS or Windows) which provide a setmode function, an environment variable MAWKBINMODE and a built‐in variable BINMODE. The bits of the BINMODE value tell mawk how to modify the RS and ORS
variables:
0 set standard input to binary mode, and if BIT‐2 is unset, set RS to "\r\n" (CR/LF) rather than "\n" (LF).
1 set standard output to binary mode, and if BIT‐2 is unset, set ORS to "\r\n" (CR/LF) rather than "\n" (LF).
2 suppress the assignment to RS and ORS of CR/LF, making it possible to run scripts and generate output compatible with Unix line‐endings.
POSIX AWK is oriented to operate on files a line at a time. RS can be changed from "\n" to another single character, but it is hard to find any use for this — there are no examples in the AWK book. By convention, RS
= "", makes one or more blank lines separate records, allowing multi‐line records. When RS = "", "\n" is always a field separator regardless of the value in FS.
mawk, on the other hand, allows RS to be a regular expression. When "\n" appears in records, it is treated as space, and FS always determines fields.
Removing the line at a time paradigm can make some programs simpler and can often improve performance. For example, redoing example 3 from above,
BEGIN { RS = "[^A-Za-z]+" }
{ word[ $0 ] = "" }
END { delete word[ "" ]
for( i in word ) cnt++
print cnt
}
counts the number of unique words by making each word a record. On moderate size files, mawk executes twice as fast, because of the simplified inner loop.
The following program replaces each comment by a single space in a C program file,
BEGIN {
RS = "/\*([^*]|\*+[^/*])*\*+/"
# comment is record separator
ORS = " "
getline hold
}
{ print hold ; hold = $0 }
END { printf "%s" , hold }
Buffering one record is needed to avoid terminating the last record with a space.
With mawk, the following are all equivalent,
x ~ /a\+b/ x ~ "a\+b" x ~ "a\\+b"
The strings get scanned twice, once as string and once as regular expression. On the string scan, mawk ignores the escape on non‐escape characters while the AWK book advocates \c be recognized as c which necessitates
the double escaping of meta‐characters in strings. POSIX explicitly declines to define the behavior which passively forces programs that must run under a variety of awks to use the more portable but less readable,
double escape.
POSIX AWK does not recognize "/dev/std{in,out,err}". Some systems provide an actual device for this, allowing AWKs which do not implement the feature directly to support it.
POSIX AWK does not recognize \x hex escape sequences in strings. Unlike ANSI C, mawk limits the number of digits that follows \x to two as the current implementation only supports 8 bit characters.
POSIX explicitly leaves the behavior of FS = "" undefined, and mentions splitting the record into characters as a possible interpretation, but currently this use is not portable across implementations.
Some features were not part of the POSIX standard