Controlling Greediness and Case Sensitivity in Regular Expressions

non-greedy subexpressions, the total match length is either as long as possible or as short as possible, according to the attribute assigned to the whole RE. The attributes assigned to the subexpressions only affect how much of that match they are allowed to <quote>eat</quote> relative to each other. </para> <para> The quantifiers <literal>{1,1}</literal> and <literal>{1,1}?</literal> can be used to force greediness or non-greediness, respectively, on a subexpression or a whole RE. This is useful when you need the whole RE to have a greediness attribute different from what's deduced from its elements. As an example, suppose that we are trying to separate a string containing some digits into the digits and the parts before and after them. We might try to do that like this: <screen> SELECT regexp_match('abc01234xyz', '(.*)(\d+)(.*)'); <lineannotation>Result: </lineannotation><computeroutput>{abc0123,4,xyz}</computeroutput> </screen> That didn't work: the first <literal>.*</literal> is greedy so it <quote>eats</quote> as much as it can, leaving the <literal>\d+</literal> to match at the last possible place, the last digit. We might try to fix that by making it non-greedy: <screen> SELECT regexp_match('abc01234xyz', '(.*?)(\d+)(.*)'); <lineannotation>Result: </lineannotation><computeroutput>{abc,0,""}</computeroutput> </screen> That didn't work either, because now the RE as a whole is non-greedy and so it ends the overall match as soon as possible. We can get what we want by forcing the RE as a whole to be greedy: <screen> SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); <lineannotation>Result: </lineannotation><computeroutput>{abc,01234,xyz}</computeroutput> </screen> Controlling the RE's overall greediness separately from its components' greediness allows great flexibility in handling variable-length patterns. </para> <para> When deciding what is a longer or shorter match, match lengths are measured in characters, not collating elements. An empty string is considered longer than no match at all. For example: <literal>bb*</literal> matches the three middle characters of <literal>abbbc</literal>; <literal>(week|wee)(night|knights)</literal> matches all ten characters of <literal>weeknights</literal>; when <literal>(.*).*</literal> is matched against <literal>abc</literal> the parenthesized subexpression matches all three characters; and when <literal>(a*)*</literal> is matched against <literal>bc</literal> both the whole RE and the parenthesized subexpression match an empty string. </para> <para> If case-independent matching is specified, the effect is much as if all case distinctions had vanished from the alphabet. When an alphabetic that exists in multiple cases appears as an ordinary character outside a bracket expression, it is effectively transformed into a bracket expression containing both cases, e.g., <literal>x</literal> becomes <literal>[xX]</literal>. When it appears inside a bracket expression, all case counterparts of it are added to the bracket expression, e.g., <literal>[x]</literal> becomes <literal>[xX]</literal> and <literal>[^x]</literal> becomes <literal>[^xX]</literal>. </para> <para> If newline-sensitive matching is specified, <literal>.</literal> and bracket expressions using <literal>^</literal> will never match the newline character (so that matches will not cross lines unless the RE explicitly includes a newline) and <literal>^</literal> and <literal>$</literal> will match the empty string after and before a newline respectively, in addition to matching at beginning and end of string respectively. But the ARE escapes <literal>\A</literal> and <literal>\Z</literal> continue to match beginning or end of string <emphasis>only</emphasis>.

This section provides a detailed explanation of how to control greediness in regular expressions using quantifiers like `{1,1}` and `{1,1}?`. It demonstrates scenarios where adjusting the greediness of the entire RE separately from its components is necessary to achieve the desired matching behavior, especially when dealing with variable-length patterns. The section also defines how match lengths are determined. Furthermore, it covers case-independent and newline-sensitive matching, explaining how these options affect the behavior of character matching, anchor points, and bracket expressions.