Linguistic Benchmarking
Applying linguistic theory to evaluate and optimize LLM performance
Python, LangChain, LangSmith, OpenAI API
June, 2025

In the world of conversational UX design, we often lean on linguistic theory to craft human-centered interactions. Lately, I've wondered how the same approach could apply to LLM evaluation.
As a starting point, I looked to Paul Grice’s Maxims of Conversation. This theoretical framework, also called the Cooperative Principle, gives us a way to label the expectations and unspoken rules we carry into conversations, whether with another person or with an AI system.
To explore this idea in more detail, I built an evaluation system using Python, OpenAI's API, LangChain, and LangSmith. This system measures LLM responses against these Gricean criteria. 
I also took this chance to bring in a recent hobby of mine: indoor gardening. I can't stop myself from sharing how this year's crop of Thai basil is doing, so I didn't pass up that opportunity here. 
Let's start with an overview of the cooperative principle.
The Maxim of Quantity: Be as informative as is required for the current purposes of the exchange, but not more informative than is needed.
The Maxim of Quality: Make your contribution one that is true. Specifically, do not say what you believe to be false or anything for which you lack adequate evidence.
The Maxim of Relation: Be relevant; make your contributions relevant to the conversation.
The Maxim of Manner: Be Clear; avoid using words or phrases that could be confusing or misunderstood. Use language that is easily understood by your audience.
This evaluation process included three types of prompts:
User Prompts: These are the questions and requests that a user would send to the model. These typically carry the user’s intent, context, and any other relevant details that the user wants to include. Since they are a natural language input, these prompts are also highly variable.
System Prompts: These define the model’s persona and shape the tone, length, and level of detail of the LLM's responses. From a UX perspective, these prompts wouldn't reflect typical user input and are set on the backend. 
Evaluator Prompt: After the model produces its answer (using the system prompts, in response to the user prompts), a separate evaluation LLM, called the judge model, is driven by its own prompt. In this case, the evaluator prompt instructs the judge model to score the generated response against Grice’s Maxims, provide rationale for the score, and suggest improvements.
I chose GPT-3.5-turbo for this project. While it's an older model, I found it has a good cost:performance ratio. I accessed this model using OpenAI's API.
Next, LangChain is an open-source framework that chains the model calls and parses the OpenAI API responses.
I also used LangSmith, a platform offered by LangChain, to capture every request and response, then to aggregate and visualize the evaluation results. 
To coordinate all of this, I wrote a custom Python script referencing a tutorial by LangSmith. This script handles dataset management, LLM invocation, prompt setup, and connection to the LangSmith UI. 

My first step was to simulate how different users would prompt an LLM with questions about indoor gardening.
This question set was divided into four groups based on the utterance type, along with the user's context and writing style. Each group contained three user prompts with callouts for any defining linguistic features and anticipated system behavior.




The idea for this was to compile a list of realistic sample utterances that highlight key linguistic features such as: 
Certainty markers: "...I'm not really sure" vs straightforward "wh-" questions.
Identity markers: "I've never grown anything before..." vs "I want to impress my guests..."
Granularity: "What's the best way to grow culinary herbs like basil or thyme indoors?"
Speech Acts: Advice seeking "Any tips...?",  permission/validation seeking "Is it realistic...?"
The result was 12 user prompts that highlighted these features. While it's a small dataset, I found it useful to keep the scope relatively small for these tests. This ensured that I could iterate quickly and not worry as much about constraints like latency, token use, and overall cost.
The next step was to create a separate set of system prompts. In this workflow, system prompts act like personas. These prompts establish the personality that the LLM adopts when answering a user's questions, each with its own set of goals, instructions, and constraints. 
Using a range of system prompts ensures that the core questions are answered through multiple lenses, revealing how the model's initial framing can shape the final result. Treating system prompts as personas also mirrors real-world use cases, where LLM developers tailor an assistant’s identity to match a target audience or adhere to a unique set of brand voice or safety considerations.

For this series of tests, I created three system prompts at varying levels of detail: 
The basic system prompt includes a generic statement about the model's role and basic directions, giving minimal guidance. 
Next, the structured system prompt gives the model more explicit direction by including a clearer persona, instructions for its communication style, and basic conditional logic.
Finally, the comprehensive system prompt provides the model with deeper context about its role in the interaction, instructions for handling a range of requests at different levels of complexity, and was given more robust safety guidelines.


The evaluation prompt comes into play at the end of this process. This component is a separate instance of GPT-3.5-turbo which assesses each response from the earlier steps.
This evaluation prompt sits above the question-answering layers and ensures that variations in content or tone are measured in a systematic, repeatable way. Treating the evaluation prompt as its own role highlights not just what the model says, but how well it aligns with an idependent, explicit set of criteria.
In this case, I structured the evaluation prompt by first defining the judge model's role, then providing clear instructions for how to score the earlier system's responses against the Gricean maxims. 
I provided descriptions for each maxim with instructions for scoring against them, then finished with instructions on how to format its own responses.

Once I completed the prompt sets and configured the Python script to connect to LangChain, LangSmith, and the OpenAI API, the system was ready to run. In LangSmith, the test results populated in a customizable UI which included columns for the test-level and aggregate scores for each Gricean category. 

Using the evaluator prompt's instruction to leave comments and suggestions for each test case, I was able to see the rationale behind the judge model's scoring decisions. This provided a mix of quantitative and qualitative data that was ready to export for a more in-depth visual comparison.

After exporting the dataset as a CSV, I used Google Sheets to see the relationship between user and system prompts per maxim.




Zooming out from this, I pulled the average scores for each system prompt and plotted them against each maxim.

From these two views, I could see:
1. The Quality and Relation scores had the tightest distribution across the three system prompts, while the Quantity and Manner scores showed more fluctuation.
2. The Basic system prompt performed well for all four maxims and tended to receive consistent scores across user prompt groups.
 
3. The Structured system prompt consistently underperformed when measuring for Quantity and Manner. However, it tended to be on par with the other prompts for Quality and Relation. 
4. The Comprehensive system prompt received its highest scores when measuring for Quantity, Relation, and Manner, but received lower scores for Quality.
Pulling all of this together, I narrowed down a key insight:
More detail ≠ better system prompts. While the Comprehensive system prompt excelled in a few areas, like receiving the highest average score for Manner, and receiving a score of 9.3 for Manner in group 4, it was on par with the Basic prompt on average. 
These results challenged my initial assumption that a more detailed, explicit system prompt would be the highest performer overall. At this point, it seems to me that the quality of direction is more important than the quantity of directions given.
One way I can improve this is to replace the Comprehensive system prompt's Response Structure, Communication Approach and Special Instructions sections with few-shot examples, such as: 
User: 'Help! My basil is dying, what went wrong?'
PlantMentor: 'Oh no! Let's figure this out quickly. The two most common culprits are too much water or not enough light. First, stick your finger into the soil - if it's soggy, that's your problem. If the soil feels okay, check if your basil is getting at least 6 hours of bright light each day. What's happening exactly - are the leaves turning yellow, getting brown spots, or just looking droopy?'
It's worth noting that this this approach wouldn't provide the model with ready-made responses. Few-shot examples are intended to show, rather than tell, the model how to respond.
After reflecting on all of this, I identified two dependencies which double as the starting points for the next iteration of this project:
1. Qualitative + quantitative data > quantitative data alone. While I instructed the Evaluation prompt to include a 1-2 sentence rationale for its scores, I was running up against a (self-imposed) deadline, and decided to focus only on the quantitative aspect. Naturally, this limited my level of depth with this dataset. There are likely other patterns that I've missed, all of which are worth exploring.
2. Introducing a different model may improve the evaluation prompt. Using a separate instance of GPT-3.5-Turbo for the Evaluation prompt made sense for this iteration. However, at the moment I'm wondering if an LLM would be partial to its own outputs compared to those of a different model. For example, it'd be interesting to see how Claude Sonnet 4 performs as the Judge model.
As a conversation designer, I'm defined by my ability to anticipate the messy, unpredictable ways that people use conversational interfaces. From there, my job is to guide the user in the direction that they ultimately want to go. All of that said, I tend to design for a fixed set of inputs and outputs. 
 
Working hands-on with LLMs like this was a departure from that, but I was surprised at how naturally the process came to me. The same instincts are at play, just applied to a different process. 
There is still plenty to learn, and I'm eager to dive into this more deeply in the future. I plan to document the next iteration of this project in the months to come.
Until then, I have plants to water.
- Oliver
 
Gricean Prompt Variation Dataset
LangChain Documentation
LangSmith Documentation
Full script: gricean_prompt_eval.py
from langsmith import Client
from langsmith.evaluation import evaluate
from langsmith.evaluation.evaluator import DynamicRunEvaluator
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langsmith.utils import LangSmithConflictError
#   Constants  
DATASET_NAME = "Gricean Prompt Variation Dataset"
MODEL = "gpt-3.5-turbo"
#   Dataset client setup  
client = Client()
try:
    dataset = client.create_dataset(dataset_name=DATASET_NAME)
    print(f"Created new dataset: {DATASET_NAME}")
except LangSmithConflictError:
    dataset = client.read_dataset(dataset_name=DATASET_NAME)
    print(f"Reusing existing dataset: {DATASET_NAME}")
#   System prompt variants (personas)  
def prompt_style_basic(q):
    return [
        {"role": "system", "content": '''
         - You are a helpful AI assistant for indoor gardening. 
         - Be nice and help people grow plants inside. 
         - Answer questions about plants accurately and be creative when needed, but don't be rude. 
         - Help users with whatever plant problems they have. 
         - Be friendly but also knowledgeable. Make sure your advice is good.
         '''
         },
        {"role": "user", "content": q},
    ]
def prompt_style_structured(q):
    return [
        {"role": "system", "content": '''
         You are GreenThumb, an AI assistant specializing in indoor gardening and houseplant care. 
         Role: Act as an experienced indoor gardener who enjoys helping others succeed with houseplants.
            
            Communication Style: 
            - Use clear, beginner-friendly explanations for plant care 
            - Provide specific care instructions with practical examples 
            - Ask about growing conditions when diagnosing plant problems 
            - Maintain an encouraging, supportive tone for plant struggles 
            Response Format:
            - For plant problems: Provide step-by-step diagnosis and solutions 
            - For care questions: Include specific watering, lighting, and feeding schedules 
            - Structure advice with clear sections (immediate actions, long-term care) 
            - Keep responses focused on the specific plant or issue 
            Boundaries: 
            - Focus on common houseplants and indoor growing methods 
            - Don't recommend plants that are toxic if the user mentions pets/children 
            - If unsure about plant identification, ask for more details or photos 
            - Redirect outdoor gardening questions to indoor alternatives
         '''
         },
        {"role": "user", "content": q},
    ]
def prompt_style_comprehensive(q):
    return [
        {"role": "system", "content": '''
         You are PlantMentor, an expert indoor gardening specialist with 15+ years of experience helping people transform their homes into thriving green spaces. 
            Core Identity 
            - Expert in houseplant care, indoor growing systems, and plant troubleshooting 
            - Patient educator who adapts advice to user's experience and living situation 
            - Practical problem-solver focused on sustainable, successful plant care 
            Communication Approach
            - For beginners: Start with easy-care plants, explain basic concepts like "bright indirect light," provide simple care schedules 
            - For intermediate growers: Discuss plant-specific needs, seasonal adjustments, propagation techniques, pest management 
            - For advanced enthusiasts: Cover specialized growing methods, rare plants, environmental optimization, advanced troubleshooting 
            Response Structure 
            - Plant problems: Ask about symptoms, environment, and recent care changes, then provide diagnosis with confidence level and treatment steps
            - Care questions: Cover light, water, humidity, feeding, and repotting needs specific to the plant and user's conditions
            - Plant selection: Consider user's light conditions, experience level, lifestyle, pets/children, and aesthetic preferences 
            - Setup advice: Factor in space constraints, budget, and long-term goals 
            Handling Uncertainty 
            - If plant identification is unclear, ask for specific details (leaf shape, growth pattern, current size) 
            - When multiple issues could cause symptoms, rank possibilities by likelihood
            - For unusual problems, recommend consulting local plant experts or extensions services
            - Always mention if certain treatments might stress plants further 
            Safety and Boundaries 
            - Always ask about pets and children before recommending plants - provide safe alternatives for toxic species 
            - Don't recommend plants requiring conditions most homes can't provide without significant investment 
            - Avoid suggesting treatments involving potentially harmful chemicals without proper safety warnings 
            - Redirect questions about outdoor gardening, large-scale growing, or plant breeding to indoor alternatives when possible 
            Special Instructions 
            - When a user asks about plant care, first understand their experience level and current growing setup. 
            - Always ask about current light conditions when giving plant recommendations 
            - When diagnosing problems, inquire about watering frequency, recent changes, and seasonal timing 
            - Emphasize that plant care is learned through experience - normalize some plant loss as part of learning
         '''
         },
        {"role": "user", "content": q},
    ]
prompt_variants = {
    "basic": prompt_style_basic,
    "structured": prompt_style_structured,
    "comprehensive": prompt_style_comprehensive,
}
#   User prompts (utterances)
questions = [
    # #group 1
    # "What herbs are easiest to grow indoors?",
    # "I've been thinking about trying to grow herbs inside, but I'm not really sure where to begin. Any advice?",
    # "Help! My basil is dying, what went wrong?",
    
    # #group 2
    # "I've never grown anything before — how do I even begin an herb garden inside?",
    # "I want to impress guests with fresh herbs in my cooking",
    # "What's the optimal lighting setup for growing Mediterranean herbs indoors?",
    # #group 3
    # "What's the best way to grow culinary herbs like basil or thyme indoors?",
    # "Which herbs can I grow indoors for making fresh pasta sauce?",
    # "What herbs can I start growing indoors this winter?",
    #group 4
    "What are the main challenges of growing herbs indoors?",
    "Any tips for someone who keeps killing houseplants but wants to try herbs?",
    "Is it realistic to grow enough herbs indoors to actually use in cooking?",
]
#   Prepare examples: clear old and add fresh  
new_examples = []
for q in questions:
    for style_name, formatter in prompt_variants.items():
        new_examples.append({
            "input": {
                "style": style_name,
                "messages": formatter(q),
                "question": q,
            }
        })
# Delete all existing examples
for ex in client.list_examples(dataset_id=dataset.id):
    client.delete_example(example_id=ex.id)
print(f"Cleared old examples")
# Batch-create new examples
client.create_examples(
    dataset_id=dataset.id,
    inputs=[ex["input"] for ex in new_examples],
)
print(f"Added {len(new_examples)} fresh examples")
# Reload Example objects for evaluation
examples = list(client.list_examples(dataset_id=dataset.id))
print(f"Loaded {len(examples)} dataset Example objects.")
#   Target function: LLM call  
llm = ChatOpenAI(model=MODEL, temperature=0.7, max_tokens=512)
chain = llm | StrOutputParser()
def wrapped_target(inputs, langsmith_extra=None):
    messages = inputs["messages"]
    return {"response": chain.invoke(messages)}
print("Starting Gricean Prompt Evaluation...")
#   Evaluator function: Grice's Maxims scoring  
def gricean_maxims_eval(run, example: dict, langsmith_extra=None):
    question = example.inputs.get("question", "")
    response = run.outputs.get("response", "")
    prompt = f"""
    You are a senior language evaluation specialist. 
    Your task is to assess how well a given LLM response adheres to Grice's Maxims of Conversation. 
   
    Score the response using a 0-10 scale for each maxim, along with rationale behind your score.
    Evaluate each maxim critically and specifically.
        Grice's Maxims (with scoring criteria):
        - Quantity: Does the response supply all necessary information to fully address the question without including irrelevant or redundant details?
        - Quality: Is the content factually accurate, well-supported by evidence or sound reasoning, and presented with appropriate confidence or qualifiers?
        - Relation: Does every part of the response directly relate to the user's query and context, avoiding tangents or off-topic content?
        - Manner: Is the answer clearly organized, unambiguous, and stylistically appropriate in tone and register for the intended audience?
        QUANTITY - Is the information sufficient but not excessive?
        10: Perfectly scoped to the prompt, with no fluff or repetition
        7-9: Adequate, but may be verbose or overly concise in places
        4-6: Either lacking key details or includes unnecessary padding
        0-3: Mostly irrelevant or extremely underdeveloped
        QUALITY - Is the information accurate and supported?
        10: Factual, precise, and grounded in known best practices
        7-9: Generally correct but may lack citation, precision, or hedge unnecessarily
        4-6: Some inaccuracies, exaggerations, or questionable framing
        0-3: Misleading, vague, or outright false
        RELATION - Does the content stay relevant to the prompt?
        10: Entirely focused and logically connected to the topic
        7-9: Mostly relevant but includes minor tangents or assumptions
        4-6: Strays from topic or includes unrelated filler
        0-3: Off-topic or disorganized
        MANNER - Is the expression clear, structured, and free from ambiguity?
        10: Highly readable, logically sequenced, and accessible
        7-9: Mostly clear, may include minor clunky phrasing or awkward transitions
        4-6: Difficult to follow or overly technical for the audience
        0-3: Ambiguous, rambling, or poorly structured
    Instructions
        - Evaluate the given response against Grice's Maxims: Quantity, Quality, Relation, and Manner.
        - Score on a scale from 1.00 to 10.00 using exactly two decimal places (e.g. 8.97, 9.22).
        - Always format numbers like `X.XX`.
        - Reserve a score of 10 for exceptional responses, making it very difficult to achieve. 
        - Opt for granularity where possible, using decimals to the hundredth-place to get as much detail as possible.
        - Consider the user's context and implied meaning behind their request and assess the response accordingly.
        
        Example output:
            {{
            "quantity":   {{"score": 8.97, "rationale": "…", "suggestion": "…"}},
            "quality":    {{"score": 9.22, "rationale": "…", "suggestion": "…"}},
            "relation":   {{"score": 7.45, "rationale": "…", "suggestion": "…"}},
            "manner":     {{"score": 6.88, "rationale": "…", "suggestion": "…"}}
            }}
        Question: {question}
        Response: {response}
    For each maxim, return:
    - A score from 1.00 (poor) to 10.00 (exceptional),
    - A 1-2 sentence rationale explaining your reasoning,
    - One specific suggestion to improve the response with respect to that maxim.
    Respond in JSON format:
        {{
        "quantity": {{"score": float, "rationale": str, "suggestion": str}},
        "quality":  {{"score": float, "rationale": str, "suggestion": str}},
        "relation": {{"score": float, "rationale": str, "suggestion": str}},
        "manner":   {{"score": float, "rationale": str, "suggestion": str}}
        }}
    """
    import json
    # Send evaluation prompt as a single user message
    eval_response = chain.invoke([
        {"role": "user", "content": prompt}
    ])
    try:
        parsed = json.loads(eval_response)
    except json.JSONDecodeError:
        return {
            "key": "gricean_maxims_eval",
            "score": 1,
            "comment": "Failed to parse JSON from LLM",
            "suggestion": f"Raw output: {eval_response}"
        }
    return [
        {
            "key": "grice_quantity",
            "score": parsed["quantity"]["score"],
            "comment": parsed["quantity"]["rationale"],
            "suggestion": parsed["quantity"]["suggestion"],
        },
        {
            "key": "grice_quality",
            "score": parsed["quality"]["score"],
            "comment": parsed["quality"]["rationale"],
            "suggestion": parsed["quality"]["suggestion"],
        },
        {
            "key": "grice_relation",
            "score": parsed["relation"]["score"],
            "comment": parsed["relation"]["rationale"],
            "suggestion": parsed["relation"]["suggestion"],
        },
        {
            "key": "grice_manner",
            "score": parsed["manner"]["score"],
            "comment": parsed["manner"]["rationale"],
            "suggestion": parsed["manner"]["suggestion"],
        },
    ]
gricean_evaluator = DynamicRunEvaluator(func=gricean_maxims_eval)
#   Run evaluation  
# Run evaluation using the Example objects directly
evaluate(
    wrapped_target,
    examples,
    evaluators=[gricean_evaluator],
)