Compiler Weekly: Truth

This week I took yet another look at improving the parser logic that I have been using. In the past I tried to use parser-combinators to build up my parsers with the idea that they might be smaller, simpler or better. This time I decided to simply write the parser from scratch.

The program for this week is a logic language. You can give it facts or ask it queries. Facts take the form of SOMEONE is FACT and queries take the form of who is QUERY?. The program keeps track of all the facts and performs the desired set operations to find the result. You can use and, or and not as well as () to build up complex queries.

Lexer

The lexer is very simple. Using logos again we come up with a few tokens but nothing unexpected here:

use logos::{Logos};

#[derive(Logos, PartialEq, Copy, Clone, Debug)]
pub enum Token<'a> {
    #[regex(r"\w+")]
    Identifier(&'a str),

    #[token("is")] Is,
    #[token("or")] Or,
    #[token("and")] And,
    #[token("not")] Not,
    #[token("who")] Who,

    #[token("?")] QuestionMark,
    #[token("(")] OpenBracket,
    #[token(")")] CloseBracket,

    #[regex(r"\s+", logos::skip)]
    #[error]
    Error,
}

Parser

The parser is much different than my previous attempts.

First off the AST is split into multiple enums rather than just a single one. I found this to be a cleaner way of representing the different levels of the program. At the top we have Statement. A statement actually does something within the program. It changes state by either setting a fact or printing the results of a query. Below that we have Expression. Expressions are used to build up queries and can be nested to build something more complex.

The parser itself takes the lexer and starts reading tokens. It is a recursive decent parser. That means that it starts at the top and looks at the next token that comes in. If its a Token::Who then it calls the query parser and goes down that path. If its a Token::Identifier it call the fact parser and goes down that path. Each path is very similar to the parser combinators we build before but rather than having a data structure that puts the parsers together, we are just using normal function calls.

One thing I would also like to point out is that I discovered the ? operator in Rust. It makes it super easy to use Result and Option in your code. Rather than having to use unwrap or is_ok you can just add a ? and if there was an error rust will return that error.

use std::fmt::Debug;
use std::ops::Range;
use logos::Lexer;
use crate::token::Token;

#[derive(Debug)]
pub struct Error {
    pub message: String,
    pub span: Range<usize>
}

#[derive(Clone, Debug, PartialEq)]
pub enum Statement<'a> {
    Fact(&'a str, &'a str),
    Query(Expression<'a>),
}

#[derive(Clone, Debug, PartialEq)]
pub enum Expression<'a> {
    Not(Box<Expression<'a>>),
    And(Box<Expression<'a>>, Box<Expression<'a>>),
    Or(Box<Expression<'a>>, Box<Expression<'a>>),
    Identifier(&'a str)
}

type Reader<'a> = Lexer<'a, Token<'a>>;

macro_rules! read {
    ($tokens: ident, $pattern: pat => $expr: expr, $name: expr) => {
        match $tokens.next() {
            Some($pattern) => Ok($expr),
            _ => Err(Error{ message: format!("expected {}", $name), span: $tokens.span() }),
        }
    };
}

pub fn parse<'a>(tokens: &mut Reader<'a>) -> Result<Statement<'a>, Error>{
    match tokens.next() {
        Some(Token::Who) => query(tokens),
        Some(Token::Identifier(name)) => fact(name, tokens),
        _ => Err(Error{ message: String::from("expected statement"), span: tokens.span() }),
    }
}

fn query<'a>(tokens: &mut Reader<'a>) -> Result<Statement<'a>, Error> {
    read!(tokens, Token::Is => (), "is")?;
    let t = expression(tokens).map(|expr|Statement::Query(expr))?;
    read!(tokens, Token::QuestionMark => (), "?")?;
    return Ok(t);
}

fn expression<'a>(tokens: &mut Reader<'a>) -> Result<Expression<'a>, Error> {
    let lhs = unary(tokens)?;
    match tokens.clone().next() {
        Some(Token::And) => {
            tokens.next();
            let rhs = expression(tokens)?;
            Ok(Expression::And(Box::new(lhs), Box::new(rhs)))
        },
        Some(Token::Or) => {
            tokens.next();
            let rhs = expression(tokens)?;
            Ok(Expression::Or(Box::new(lhs), Box::new(rhs)))
        },
        _ => Ok(lhs),
    }
}

fn unary<'a>(tokens: &mut Reader<'a>) -> Result<Expression<'a>, Error> {
    match tokens.next() {
        Some(Token::Not) => Ok(Expression::Not(Box::new(expression(tokens)?))),
        Some(Token::OpenBracket) => bracket(tokens),
        Some(Token::Identifier(value)) =>  Ok(Expression::Identifier(value)),
        _ => Err(Error{ message: String::from("expected expression"), span: tokens.span() }),
    }
}

fn bracket<'a>(tokens: &mut Reader<'a>) -> Result<Expression<'a>, Error> {
    let expr = expression(tokens)?;
    read!(tokens, Token::CloseBracket => (), ")")?;
    return Ok(expr);
}

fn fact<'a>(owner: &'a str, tokens: &mut Reader<'a>) -> Result<Statement<'a>, Error> {
    read!(tokens, Token::Is => (), "is")?;
    return Ok(Statement::Fact(owner, identifier(tokens)?));
}

fn identifier<'a>(tokens: &mut Reader<'a>) -> Result<&'a str, Error> {
    read!(tokens, Token::Identifier(name) => name, "identifier")
}

The final parser is about the same size as before. However, error handling is a lot better and its easy to debug and step through what is happening. Overall I am happy with this approach and will probably continue using it.

Runtime

The runtime for this program is kind of simple but interesting none the less. It keeps a HashMap of all the facts it knows. So if you tell it rust is cool then it stores cool -> ["rust"]. Then if you query it with something like who is cool?, its just a quick lookup into the table. More complex queries take the form of set operations. Union, intersect and inverse are used to create and, or and not.

use std::collections::{HashMap, HashSet};
use std::iter::FromIterator;
use crate::parser::{Expression, Statement};

pub struct FactEngine {
    facts: HashMap<String, HashSet<String>>,
    all: HashSet<String>
}

impl FactEngine {
    pub fn new() -> Self {
        FactEngine{
            facts: HashMap::new(),
            all: HashSet::new()
        }
    }

    pub fn exec(&mut self, statement: Statement) -> Option<HashSet<&str>> {
        match statement {
            Statement::Fact(name, value) => { self.remember(name, value); None },
            Statement::Query(query) => Some(self.find(&query)),
        }
    }

    fn find(&self, query: &Expression) -> HashSet<&str> {
        match query {
            Expression::Not(expr) => self.not(expr),
            Expression::And(lhs, rhs) => self.and(rhs, lhs),
            Expression::Or(lhs, rhs) =>  self.or(rhs, lhs),
            Expression::Identifier(fact) => self.names_for(fact)
        }
    }

    fn names_for(&self, fact: &str) -> HashSet<&str> {
        self.facts[fact].iter().map(|x: &String|x.as_str()).collect()
    }

    fn and(&self, lhs: &Expression, rhs: &Expression) -> HashSet<&str> {
        self.find(lhs)
            .intersection(&self.find(rhs))
            .map(|x|*x)
            .collect()
    }

    fn or(&self, lhs: &Expression, rhs: &Expression) -> HashSet<&str> {
        self.find(lhs)
            .union(&self.find(rhs))
            .map(|x|*x)
            .collect()
    }

    fn not(&self, expr: &Expression) -> HashSet<&str> {
        self.all().difference(&self.find(expr)).map(|x|*x).collect()
    }

    fn all(&self) -> HashSet<&str> {
        return self.all.iter().map(|x: &String|x.as_str()).collect();
    }

    fn remember(&mut self, name: &str, fact: &str) {
        self.all.insert(String::from(name));

        if !self.facts.contains_key(fact) {
            self.facts.insert(String::from(fact), HashSet::from_iter([String::from(name)]));
        }else {
            self.facts.get_mut(fact).unwrap().insert(String::from(name));
        }
    }
}

Finally the REPL is very simple as well. It reads the input line by line, if there is an error then it prints it, if not it runs the parsed statement in the engine and prints any results.

mod token;
mod parser;
mod fact_engine;

use std::collections::HashSet;
use std::io::stdin;
use crate::fact_engine::FactEngine;
use crate::parser::parse;
use crate::token::Token;
use logos::Logos;

fn main() {
    let mut buffer = String::new();
    let mut engine = FactEngine::new();

    while stdin().read_line(&mut buffer).is_ok() {
        let mut tokens = Token::lexer(&buffer);
        match parse(&mut tokens) {
            Err(e) => eprintln!("error: {}", e.message),
            Ok(statement) => print(engine.exec(statement))
        }
        buffer.clear();
    }
}

fn print(results: Option<HashSet<&str>>) {
    if let Some(names) = results {
        println!("> {:?}", names.into_iter());
    }
}

Here is some output of running the program:

rust is cool
java is old
who is cool?
> ["rust"]
who is not cool?
> ["java"]

Conclusion

After this week I am pretty happy with the technology stack that I am using for the compiler frontends now. Logos plus a hand built parser seems to be ideal for me. So next week I will probably look into writing some compiler backends and see if I can get some native code compilation working using LLVM.

_26/10/2021