String Patterns

How string patterns work.

by Halalaluyafail3

Author Avatar

Pattern Matching Functions

Lua provides functions for matching strings through patterns.

string.find(string s,string pattern [, number init [, boolean plain]])

Searchs the string s with string pattern. The optional init controls where the search starts. (default = 1) The optional plain controls whether the string pattern matching facilities are enabled. Returns the starting and ending positions of the entire match, plus all captures (explained later). When only looking to see if s contains pattern, this function should be used. And if pattern matching isn't desired, the plain argument should be true, to disable the pattern matching.

string.match(string s,string pattern [, number init])

Looks for the first match in s using string pattern. The optional init controls where the search starts. (default = 1) If there is atleast 1 match, all matches are returned. Otherwise the whole capture is returned. No match being found always returns nil.

string.gsub(string s,string pattern,repl [,number n])

Returns a copy of string s in which all (or the first n, if provided) occurences of pattern have been replaced by a replacement string according to repl, and the amount of matches that occurred. repl can either be a string, table, or function. If repl is a string, then its value is used for replacement. % has special meaning, %d where d is between 1-9 represents the dth capture, %0 represents the whole match, and %% represents the % character itself. If repl is a table, the result of indexing the table with the first capture is used for each match (or the whole match if there are no captures). If repl is a function, this function is called for every match, passing through all captures (or the whole match if there are no captures). If the value returned by the table index or function call is a string or number, that is used for the replacement; otherwise, if it is false or nil, there is no replacement, and the original match is kept in the string.

string.gmatch(string s,string pattern)

Returns an iterator function that, each time it is called, returns the next captures from pattern over s. If there are no captures, the whole match is returned, similar to string.match. For this function the anchor ^ (explained later) can't be used in the pattern, as this would prevent iteration, making gmatch useless. (although, for some reason the $ anchor does still function even though it would prevent iteration just like the ^ anchor)

Patterns

Patterns match a certain part of the string according the characters of pattern.

Character Class

x - where x isn't one of the magic characters ^$()%.[]*+-? represents x itself. . - represents all characters %a - represents all letters %c - represents all control characters %d - represents all digits %g - represents all printable characters except space characters %l - represents all lowercase letters %p - represents all punctuation characters %s - represents all space characters %u - represents all uppercase letters %w - represents all alphanumeric characters %x - represents all hexadecimal digits %z - represents the null character (using a null character works fine so this can be ignored) %x - (where x is a non alphanumeric character) represents the character x itself. This is the standard way to escape magic characters. Any non-alphanumeric character (including all punctuation characters, even the non-magical) can be preceded by a % when used to represent itself in a pattern. [set] - Character class set, represents the class which is the union of all characters in set. A range of characters can be specified by seperating the end characters of the range, in ascending order, with a -. All classes %x described above can be used as components in set. All other characters represent themselves. For example, [%d%u_] (order doesn't matter for characters inside the set) represents all digits, uppercase characters, and _, [0-7] represents all octal digits, and [0-7%l%-] represents the octal digits, lowercase letters, and -. The interaction between ranges and classes is not defined. Therefore patterns like [%a-z] or [a-%%] have no meaning. Ranges are based on the the ascii values of the characters. So a range of [A-z] will represent all characters between ascii 65 (A) and ascii 122 (z). The first character in the range should have a lower value than the second character in the range. [a-z] represents all lowercase letters, but [z-a] represents nothing. An opening bracket can be included by positioning it at the start of the set, and a hyphen can be included by positioning it at the start or end of the set (or by using an escape sequence). Because of the special meaning of certain characters, those characters may not be able to be used in ranges; for example, there is no way to have ] as the upper bound of a range as it would be interpretted as closing the set. [^set] - represents the complement of set, set is interpreted as described above.

For all classes represented by single letters (%a, %c, etc.), the corresponding uppercase letter represents the complement of the class. For instance, %S represents all non space characters.

Pattern Item

A pattern item can be:

A single character class, which represents any character in the character class.

A single character class, followed by a quantifier (explained later), represents character(s) according to the character class, and the quantifier

%n, for n between 1 and 9; such item matches a substring equivalent to the n-th captured substring.

%bxy, where x and y are two distinct characters; such item matches strings that start with x and end with y, where x and y are balanced. This means that, if one reads the string from left to right, counting +1 for an x and -1 for a y, the ending y is the first y where the count reaches 0. For instance, %b() matches expressions with balanced parentheses. If x and y are the same, they are not balanced. For instance, %b|| will get the characters between the vertical bars, and the vertical bars themselves.

%f[set], a frontier pattern; such item matches an empty string at any position such that the next character belongs to set and the previous character does not belong to set. set is interpreted as previously described (so something like %f[^set] will represent the complement of set). The beginning and end of the subject are handled as if they were the null character (ascii 0).

Quantifiers

Adding a quantifier after a character class changes how many times character class is matched. There are 4 quantifers.

'*' - The asterisk quantifier makes the character class match as many times as possible, giving back as needed. (greedy) The character class will be matched 0+ times.

'+' - The plus quantifier makes the character class match as many times as possible, giving back as needed. (greedy) The character class will be matched 1+ times.

'-' - The minus quantifier makes the character class match as few times as possible, expanding as needed. (lazy) The character class will be matched 0+ times.

'?' - The question mark quantifier makes the character class optional, matching 0-1 times. The question mark quantifier is also greedy, only giving back as needed.

Pattern

A pattern is a sequence of Pattern Items. A caret '^' at the beginning of the pattern anchors the match at the beginning of subject string. A '$' at the end of the pattern anchors the match at the end of subject string. Using both will anchor the pattern to matching the whole subject string. At other positions, '^' and '$' have no special meaning.

Captures

A pattern can contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds the substrings of the subject string that match captures are stored (captured) for future use. Captures are numbered according to their left parenthesis. For instance, in the pattern (a?(%a+).(%s+)), the part of the string matching a?(%a+).(%s+) is stored at capture index 1, the part of the string matching %a+ is stored at capture index 2, and the part of the string matching %s+ is stored at capture index 3.

As a special case, an empty capture () will capture the current string position (a number). For instance, the pattern ()i() will capture 2 and 3 when applied to the string 'hi'.

View in-game to comment, award, and more!