Java regex predefined character classes

12/31/2023

The following example demonstrates the usefulness of a back reference in searching text for a grammatical error: java RegexDemo "(Java( language)\2)" "The Java language language" The presence of a back reference causes a matcher to use the back reference's capturing group number to recall the capturing group's saved match, and then use that match's characters to attempt a further match operation. Specified as a backslash character followed by a digit character denoting a capturing group number, the back reference recalls a capturing group's captured text characters. In (a)(b), (a) belongs to capturing group number 1, and (b) belongs to capturing group number 2.Įach capturing group saves its match for later recall by a back reference. In the example, (Java( language)) belongs to capturing group number 1, and ( language) belongs to capturing group number 2. Each nested or non-nested capturing group receives its own number, numbering starts at 1, and capturing groups are numbered from left to right. For example, in the (Java( language)) regex, ( language) nests inside (Java). Each match replaces the previous match's saved Java characters with the next match's Java characters.Ĭapturing groups can be nested inside other capturing groups. This capturing group matches the Java pattern against all occurrences of Java in the input text. For example, the (Java) capturing group combines letters J, a, v, and a into a single unit. All characters within the capturing group are treated as a single unit during pattern matching. The paragraph-separator character ( \u2029)Ī capturing group saves a match's characters for later recall during pattern matching this construct is a character sequence surrounded by parentheses metacharacters ( ( ) ).The carriage-return character immediately followed by the new-line character ( \r\n).The new-line (line feed) character ( \n).Pattern recognizes the following line terminators: Unless dotall mode (discussed later) is in effect, line terminators are matched by period in dotall mode. Pattern's SDK documentation refers to the period metacharacter as a predefined character class that matches any character except for a line terminator (a one- or two-character sequence identifying the end of a text line). You should observe the following output, which shows that the period and space characters are not considered word characters: regex = \wįound starting at 5 and ending at 5 Line terminators This example uses the \w predefined character class to identify all word characters in the input text: java RegexDemo \w "aZ.8 _" The following list describes only the standard category: Several categories of predefined character classes are provided: standard, POSIX,, and Unicode script/block/category/binary property. Use them to simplify your regexes and minimize syntax errors. Pattern provides predefined character classes as these shortcuts. Some character classes occur often enough in regexes to warrant shortcuts.

This example matches d and f with their counterparts in abcdefg: regex = &]įound starting at 5 and ending at 5 Predefined character classes For example, ] matches characters a through l and q through z: java RegexDemo "&]" abcdefg Some background about this is provided in the Wikipedia page for the backtick.The subtraction character class consists of all characters except for those indicated in nested negation character classes and matches the remaining characters.

It may be more relevant to ask how the grave ended up as part of the original ASCII character set, in the first place. The grave is a bit of an oddity, perhaps, given it has a historical usage outside of being used as a diacritic. It's worth noting, however, that the Sk cateogry includes characters such as the acute accent, the cedilla, the diaeresis, and so on - and (as already noted) our grave accent.Īll these are diacritics - typically used in combination with a base letter to alter the pronunciation.

I do not have a good answer for that - I'm sure there are "historical reasons". The obvious next question is why did the grave character not get included in the Unicode Po general category? Why is it in Sk instead?

It is assigned to a different general category from the punctuation category we are using in our regex. That explains why the grave is no longer matched when we add the Pattern.UNICODE_CHARACTER_CLASS flag to our original pattern. For that we can see that the general category assignment is Po. When you use Pattern p = pile("\\p list, shown above).

0 Comments

Java regex predefined character classes

Leave a Reply.

Author

Archives

Categories