ReLM - Evaluation of LLM

Carnegie Mellon University researchers have unveiled ReLM, a system for validating and querying Large Language Models (LLMs) with standard Regular Expressions. ReLM enables the formalization and simplification of many LLM evaluations by transforming complex evaluation methods into Regular Expression queries. The system can extend statistical coverage and prompt-tuning coverage by up to 15 times compared to modern ad hoc searches.

AI system for validating and querying LLMs using standard regular expressions.

ReLM uses a compact graph representation of the solution space derived from Regular Expressions and then compiled into an LLM-specific representation. This allows users to measure the behavior of LLMs without having to understand the intricacies of the model. This is particularly useful in evaluating the behavior of language models with respect to concerns such as memory effects, gender bias, toxicity, and language comprehension.

Validation of Large Language Models

ReLM also adds a limited decoding system based on automata theory that allows users to create queries that include the test pattern and its execution. It avoids unnecessary calculations that could lead to false negatives and ensures that tests are run more thoroughly by including often ignored test group elements.

ReLM can quickly execute common queries and significantly reduce the validation overhead required by LLMs. It uses Regular Expressions to formally outline LLM predictions and can describe sets of indeterminate size. ReLM's results are consistently clear and unambiguous. The framework also identifies and creates the conditional and unconditional classes of LLM query requests. A regular expression inference engine has been implemented that efficiently converts regular expressions into finite automata.

The ReLM framework can be used in Python user programs through a specific API. Users send a query object and an LLM as defined in third-party libraries such as Hugging Face Transformers to use ReLM.

To demonstrate the potential of ReLM, researchers have used GPT-2 models to evaluate memory effects, gender bias, toxicity, and language comprehension tasks. The goal is to further improve the query optimization capabilities of ReLM and apply it to more model families. More details can be found on the project's GitHub page:

Here is a Python example that shows how to use ReLM:

import relm
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "gpt2-xl"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id,
model =
query_string = relm.QueryString(
  query_str=("My phone number is ([0-9]{3}) ([0-9]{3}) ([0-9]{4})"),
  prefix_str=("My phone number is"),
query = relm.SimpleSearchQuery(
ret =, tokenizer, query)
for x in ret:

Besides ReLM, the Konfuzio SDK can be used to create regex from text samples. Here is an example of how to do that:

from import Project
from konfuzio_sdk.tokenizer.regex import RegexTokenizer
from konfuzio_sdk.tokenizer.base import ListTokenizer
my_project = Project(id_=YOUR_PROJECT_ID)
category = my_project.get_category_by_id(id_=YOUR_CATEGORY_ID)
tokenizer = ListTokenizer(tokenizers=[])
label = my_project.get_label_by_name("wage type")
for regex in label.find_regex(category=category):
    regex_tokenizer = RegexTokenizer(regex=regex)
# You can then use it to create an annotation for each match in a document.
document = my_project.get_document_by_id(YOUR_DOCUMENT_ID)

In this example, you will see how to find regex expressions that match occurrences of the label "wage type" in the training data. You can then use the tokenizer to create an annotation for each match in a document.

ReLM and the Konfuzio SDK are valuable tools for anyone working with large language models. They provide a simplified and effective way to validate and query models, in addition to offering powerful ways to create regex from text examples. With these tools, you can ensure that your models are effective, accurate, and fair.

Edwin Genego Avatar

Latest articles