Lexical Analysis

This lesson will teach you the basics of how to preform lexical analysis!

by IProgram_CPlusPlus

Author Avatar

Hello! Welcome to this lesson I will be teaching you how to preform lexical analysis in Lua!


Introduction

Lexical analysis is the process of taking in an input string and breaking it into token(s), for example: Our input string: "2+2" -> Lexer -> Output: "number : 2 | operator : + | number : 2".

Lexical analysis can be used for a number of things, say you had a GUI input field and wanted all of the numbers people enter to be colored blue, well you could write a lexer to find all of the numbers that the user has entered then you would change the foreground of those numbers to blue using the string.color function. The Lua Learning game, yes the game you're playing right now, most likely uses a lexer and then a parser to check your Lua code you wrote.


Write a lexer

I composed a simple lexer script in ServerScriptService as an example and will explain everything in the program. Here is the script:

local token_types = {"operator","word","number"}
local category
local token

local function categorize_string(string_to_cat)
    local is_number = tonumber(string_to_cat) ~= nil
    if is_number then
        category = token_types[3]
    elseif string_to_cat == "+" or string_to_cat == "-" or string_to_cat == "/" or string_to_cat == "*" then
        category = token_types[1]
    else
        category = token_types[2]
    end
end

local function find_tokens(string_to_search,string_to_find)
    if string.find(string_to_search,string_to_find) then
        token = string_to_find
        categorize_string(string_to_find)
        return true
    else
        return false
    end
end

local example_string = "Hello there, welcome to this lesson! 1"

find_tokens(example_string,"welcome")

print(category .. " : " .. token)

Alright time to explain what's going on here... I made a list of three token types. Operator, word, and number. Operator is a mathematical operator, such as +, -, *, and /. Number is a number, pretty self explanatory, and a word is everything else, like "hello" or "goodbye". Category stores the category the token is(operator, word, or number) and token stores the current token. Then I made a function that will categorize the string based on it's content for example if my string was "hello" it would be categorized as a word, if it was "+" it would be an operator. Then I made a function that would find the token and call the categorize function to categorize that token. NOTE: if you wanted to make a more advanced and practical lexer I could have used the string.split function and split it by every space, then I would have looped through that and categorize each of the strings then looped through all of the characters and categorize the character tokens like "!" as a punctuation mark. End of note. Then I made a string variable with some text in it, I then called the find_tokens function to find the string welcome and categorized it, and finally I output the results.

Output:

img|80x45

Thanks for following along, hope this helped!