Stanford CS547 HCI Seminar | Autumn 2025 | Going Beyond Linear Conversation

Key Concepts

AI Assistant Programming: Using AI, specifically large language models (LLMs), to assist in the programming process.
Mutual Grounding: Establishing a common understanding between human programmers and LLMs, where both parties understand each other's intent and mental states.
Vibe Coding: A term for using natural language to instruct an AI to write code.
Natural Language Programming (NLP): The concept of using natural language to instruct computers, with historical examples like Luna and Shludoo.
Ambiguity of Natural Language: The inherent challenge of natural language being open to multiple interpretations, making precise programming difficult.
Massive Program Space: The virtually infinite complexity and length of possible programs, making exhaustive search infeasible.
Domain-Specific Languages (DSLs): Creating specialized languages for specific domains to reduce the program space.
Input Modality Enrichment: Combining natural language with examples (input-output pairs, demonstrations) to clarify intent.
Semantic Parsers: Systems that translate natural language into formal queries or code.
Grounding Theory in Communication: A theory explaining how humans establish shared understanding through dialogue.
Step-by-Step Explanation: LLMs explaining their generated code or reasoning process in natural language.
Fine-grained Feedback: Human programmers providing precise feedback on specific parts of the LLM's output.
Clarification Questions: LLMs asking users for more information to resolve ambiguities.
Recognition or Recall Principle: Designing clarification questions as multiple-choice options for easier user response.
Selective Prompt Anchoring (SPA): Allowing users to highlight and emphasize specific parts of a prompt to guide the LLM's attention.
Attention Scores (Transformers): Mechanisms within transformer models that determine the importance of different input tokens.
Logits: The raw, unnormalized output of a neural network layer, often used in the final stage of prediction.
Cross-Cutting Concerns: Aspects of software development like security, maintainability, and efficiency.

AI Assistant Programming and Mutual Grounding

The presentation focuses on AI assistant programming, specifically on establishing mutual grounding between human programmers and large language models (LLMs). This involves not only aligning LLMs to understand human intent but also enabling humans to understand LLM "mental states" – why they get stuck or make errors – to provide precise feedback. The concept of vibe coding, using natural language to instruct AI for code generation, is highlighted as a popular trend.

Historical Context of Natural Language Programming

The idea of programming via natural language is not new. Early systems like Luna (late 1960s/early 1970s) allowed geologists to query lunar rock experiment data using natural language, which was translated into formal predicate calculus queries. Another example is the Shludoo system developed at Stanford. However, figures like Edsger Dijkstra viewed natural language programming as a "foolish idea" due to the inherent tension between the ambiguity of natural language and the precision required for programming. Dijkstra advocated for high-level formal languages. Historically, structured programming languages like Java and Python have been the primary means of harnessing computational power.

Fundamental Challenges in Natural Language Programming

Two core challenges remain relevant even with LLMs:

Inherent Ambiguity of Natural Language: A single natural language description can lead to multiple interpretations and program candidates. This lack of a clear "oracle" makes it difficult to assess correctness.
Massive Program Space: The space of possible programs is virtually infinite due to grammar, complexity, and the availability of numerous libraries and APIs.

Pre-LLM Approaches to Natural Language Programming

Before LLMs, three main strategies were employed:

Reducing Program Space: Focusing on specific domains to create Domain-Specific Languages (DSLs) with smaller, context-free grammars, making navigation more efficient.
Enriching Input Modality: Pairing natural language with input-output examples, user demonstrations, or code skeletons to provide concrete details and ground ambiguous intent.
Building Better Models: Advancements in semantic parsers, including early statistical models, RNNs, and later Transformers, leading up to LLMs.

The LLM Revolution in Code Generation

The release of LLMs like Codex and GitHub Copilot in 2021 marked a significant shift, enabling AI to generate substantial code chunks from natural language across various domains. This paradigm shift prompted researchers to pivot towards this new approach. Modern AI agents can now use external tools (browsers, shells), plan, ingest large code repositories, and perform self-reflection and self-correction.

The Problem of LLM Misunderstandings in Vibe Coding

Despite advancements, LLMs often fail to understand instructions, leading to incorrect code. Giving more prompts can sometimes worsen the output. Research is actively investigating the limitations of coding agents.

A New Interaction Paradigm: Mutual Grounding

The speaker's research group at Purdue focuses on the interaction paradigm between human programmers and LLMs, observing that current interactions are often linear and human-driven. In contrast, human conversations involve turn-taking, repetition for confirmation, clarification questions, and adjustments in tone and detail to establish shared understanding, a process termed grounding by Clark and Brina.

Three Methods for Establishing Common Ground

The research group has developed three methods inspired by grounding theory:

Method 1: Enabling LLMs to Explain Back Understanding

This method aims to have LLMs explain their understanding of a programming task back to the human. Since directly measuring an LLM's internal "mental model" is difficult, the approach leverages the principle that "actions speak louder than words." The generated program is considered a faithful reflection of the LLM's mental model.

Process:

The LLM generates a program based on its initial understanding.
The LLM explains the generated program step-by-step in natural language to the human programmer.
This step-by-step explanation acts as a scaffold for fine-grained feedback. Humans can pinpoint misunderstandings or errors within specific steps.
Humans can directly edit the natural language explanation, providing precise feedback on the correct thinking process.
This precise feedback allows the LLM to regenerate only the incorrect step, rather than the entire program.

Applications and Findings:

Applied to SQL generation, web automation, data wrangling, and Python code generation.
User studies show improved task completion and reduced task completion time.
In domains like web automation and SQL generation, this method narrows the performance gap between novices and experts by translating the LLM's mental model into natural language, which novices can understand.
Explanations can serve as a basis for further interaction, such as setting breakpoints in the explanation to debug the natural language description or visualizing the correspondence between prompt words and computational results.

Example Tool (SQL Generation): A chatbot interface where users ask natural language questions, the LLM synthesizes SQL, and then explains the query step-by-step. Users can see visual correspondences, intermediate execution results, and edit the explanation.

Challenge: Generating faithful and readable explanations. Early GPT models were unreliable. The team developed symbolic methods by decomposing SQL queries based on grammar and using template translations for accuracy and reliability, avoiding randomness.

Method 2: Enabling LLMs to Ask Clarification Questions

This method focuses on LLMs proactively identifying and resolving ambiguities by asking clarification questions.

Process:

The LLM is prompted to summarize and reinterpret the user's intent.
The LLM determines if there is ambiguity based on its interpretation.
If ambiguity exists, the LLM generates a clarification question.
Key Design Choice: Adhering to the recognition or recall principle, the LLM formulates clarification questions as multiple-choice options, presenting possible interpretations of the user's intent. Users can also provide their own interpretation.

Integration and Application (Dango System):

This method is combined with the explanation method in a system called Dango for automating data wrangling tasks.
Demo: Users import spreadsheets and issue natural language commands. For example, asking to "perform some correlation analysis" triggers the LLM to suggest options like Pearson correlation, Spearman correlation, or t-test. After a choice is made, the LLM synthesizes code and provides an explanation. If the user then asks to "delete one column," the LLM identifies ambiguity (which column?) and presents options. The LLM then synthesizes a program to perform both steps.

User Study Findings (Dango):

A study with 33 students across diverse backgrounds compared three conditions:
1. Chatbot with GPT-generated explanations.
2. Chatbot with symbolic method explanations.
3. Chatbot with clarification questions enabled.
The condition with clarification questions significantly improved user performance, reduced task completion time, and decreased LLM hallucination. This is attributed to the LLM having a better understanding of the user's task due to the clarifications.

Method 3: Enabling Human Steering of Model Attention

This method allows human programmers to steer the LLM's attention and emphasize important parts of the prompt.

Concept: Instead of just sending a plain text prompt, users can highlight and anchor specific parts of the prompt they deem important, communicating this importance to the LLM.

Example: Given a prompt to "count the number of uppercase wows in even indices," a regular LLM might misunderstand and count all vowels. By anchoring "uppercase wows," the LLM correctly understands the constraint.

Implementation (Selective Prompt Anchoring - SPA):

Intuitive Approach: Directly manipulating attention scores in transformer blocks. However, this is complex due to the large number of attention heads.
Alternative (Pasta): Profiling attention heads to identify important ones and manipulating only those. This is computationally expensive.
Proposed Approach (SPA): Approximating attention steering by manipulating the logits of the last layer.
- The process involves comparing logits from two decoding paths: one with a masked "anchored" token and one without.
- The difference in logits reveals which code tokens the masked word influenced.
- This difference is amplified or deamplified using an anchor strength hyperparameter (omega) to influence the output distribution.

SPA Framework:

Final Logits = Original Logits + omega * (Logits_without_mask - Logits_with_mask)
Omega > 1: Increases attention to the anchored token.
Omega = 1: No amplification (original prompt).
0 < Omega < 1: Decreases attention to the anchored token.
Omega = 0: Ignores the anchored token (uses masked prompt).
Omega < 0: Reverses the effect (e.g., anchoring "uppercase" leads to focusing on "lowercase").

Evaluation:

SPA was evaluated on various coding models and benchmarks.
It consistently improved code generation and outperformed baselines like Pasta and advanced prompting methods (ReAct, self-debugging).
SPA has a low overhead compared to other methods that require extensive profiling or long reasoning chains.

Relevance of These Methods in the Future

The speaker addresses the question of whether these methods will remain relevant given rapid LLM advancements. The answer is yes, unless coding agents achieve 100% accuracy.

Analogy to Compilers: Compilers are trusted due to their 100% accuracy, allowing programmers to ignore low-level code. Coding agents are not yet at this level.
Unpredictability: LLMs can fail unpredictably, even in simple cases, requiring human double-checking.
Key Question: The more pertinent question is whether current LLM/RL approaches will lead to 100% accurate coding agents. If not, methods for program comprehension, validation, knowing when to interrupt, building trust, and establishing shared understanding will remain crucial.

Discussion and Q&A Highlights

Automating Prompt Anchoring: While currently manual, future work could involve retrieval-based methods (like RAG) to automatically identify relevant instructions or context for anchoring in agentic workflows.
Cost and Effort Shift: The shift from manual coding to debugging natural language prompts changes the cost structure. For experts, it can be an amplification effect, saving keystrokes. For novices, without interaction support, it can be costly due to the difficulty of understanding and debugging generated code.
CS Education Implications: This shift presents an opportunity to move away from rote memorization of syntax towards teaching code comprehension, debugging, good coding style, and building correct mental models of code (e.g., call graphs, data flow).
Anchoring Unstated Assumptions: LLMs may anchor on unstated assumptions due to pre-training data biases (e.g., defaulting to Python/Java) or backend prompt optimizations. This highlights the potential for agents to expand on ambiguous user intent and reveal these anchors for user modification.
Prompt Verbosity for Anchoring: Longer, more detailed prompts offer more opportunities for effective anchoring. Short prompts may require manual annotation of adjectives, verbs, or nouns, but controlling anchoring strength can be brittle. Combining anchoring with methods like ReAct, which generate reasoning chains, can provide more anchoring points.
Grounding in Other Domains: Grounding principles are applicable beyond coding (e.g., email writing, research), but challenges arise with more dynamic user intents and the difficulty of generating structured clarification questions for open-ended tasks.
Cross-Cutting Concerns (Efficiency, Security): Current work focuses on task completion. Future research could augment explanations to include discussions on security, maintainability, and efficiency, potentially showing alternative code examples. Natural language is a good vehicle for communicating these concerns.
Engineering vs. Training Data: The gap between spoken and written natural language, and LLMs trained primarily on text, is a significant factor. The structure of pre-training data (e.g., instruction-following formats, injecting reasoning chains) is crucial. Incorporating real interaction structures and multimodal data (tone, facial expressions) could lead to better-aligned models.
The Nature of Anchoring: The speaker acknowledges that for a sentence like "show me the students I heard D," the context and non-verbal cues are vital, which are not captured in text-based training data. This emphasizes the need for models that better understand natural human dialogue.