Large Language Models and Conversational User Interfaces for Interactive Fiction and other Videogames

Introduction

This forum post is about implementational details pertaining to conversational user interfaces (CUI) for interactive fiction (IF) and other videogames and pertaining to bidirectional synchronizations between game engines and large language models (LLMs).

Natural Language Understanding and Semantic Frames

How can LLMs be utilized as text parsers for existing or new works of IF?

One approach involves mapping spoken or written commands to semantic frames with slots that could be filled by nouns or by embedding vectors which represent those nouns. Perhaps the zero vector could be utilized to signify an empty slot, null or undefined.

Consider that commands like “take lamp” and “pick up the bronze lamp” could both utilize the typed semantic frame for “taking” (https://framenet2.icsi.berkeley.edu/fnReports/data/frame/Taking.xml).

A command like “take it” or “pick it up” could be interpreted using by LLMs using dialogue context after a command like “inspect lamp”.

Disjunction support for semantic frames’ slots could be useful for reporting multiple candidate nouns. A NLU component might want to output that for “pick it up” the “lamp” is 90% probably the resolution of the pronoun and “treasure chest” 10%. With disjunctive and potentially probabilistic outputs, CUI for IF or other videogames could ask players whether they meant “the lamp” or “the treasure chest” in a previous command.

Bidirectional Synchronizations between Game Engines and Large Language Models

Envisioned here are bidirectional synchronizations between game engines and LLMs. In these regards, let us consider that game engines could manage and maintain dynamic documents, transcripts, and logs and that these could be components of larger prompts to LLMs.

Consider, for example, an animate creature arriving on screen and that a player is desired to be able to use a CUI to refer to it. How did the LLM know that the creature, e.g., an “orc”, was on screen, that it had entered the dialogue context?

By managing dynamic documents, transcripts, or logs, game engines could provide synchronized contexts as components of prompts to LLMs.

This would be towards providing an illusion that the CUI AI also sees or understands the contexts of IF or other videogames.

Next, that creature, e.g., an “orc”, might enter view and then exit view. How would an LLM interpret a delayed command from the player to then respond that that creature was no longer in view? This suggests features of a dynamic transcript or log.

That is, a fuller illusion would be one that the AI sees or understands the present and recent past contexts of IF and other videogames.

Game engines, e.g., Unity and Unreal, could eventually come to support interoperation with LLMs’ dialogue contexts via features for the maintenance of dynamic documents, transcripts, or logs. These engines would then be of general use for creating CUI-enhanced IF and other videogames.

Also possible are uses of multimodal LLMs.

Transmission Efficiency

Instead of having to transmit the entirety of dynamic documents, transcripts, logs, or prompts for each spoken or written command to be interpreted by the LLM CUI, it is possible that “deltas” or “diffs” could be transmitted to synchronize between client-side and server-side copies of larger prompts or portions thereof.

Conclusion

Thank you. I hope that I expressed these ideas clearly. I look forward to discussing these ideas with you. Is anyone else thinking about or working on these or similar challenges?

1 Like

Thank you for raising this important question! While both large language models (LLMs) and foundation models share similarities, they aren’t exactly the same. A foundation model refers to a broad class of AI models trained on vast datasets for various tasks, including natural language understanding, computer vision, and more. LLMs are a subset of foundation models specifically designed for natural language processing, focusing on generating and understanding human-like text.

LLMs have demonstrated remarkable versatility in areas like customer service, fraud detection, and personalized financial recommendations within the fintech sector. Their ability to analyze and generate large volumes of financial data enhances decision-making and improves user experiences. However, concerns remain about accuracy and trustworthiness when applied to sensitive financial tasks.

Thanks again for this insightful topic! For those interested in further reading, check out this LLM in Fintech Service.

Thanks @morrisjones. I’m still interested in these topics of bridging modern AI assistants to visuospatial contents such as: (1) intricate charts, diagrams, and schematics, (2) CAD/CAE content, (3) scientific and educational computer simulations, and (4) interactive fiction and other videogames.

Beyond computer vision techniques (processing 2D imagery and/or interoperating with 3D virtual cameras), semantics could play a role, describing and interrelating things and their parts.

One idea is that approaches for enhancing accessibility, man-machine Q&A and dialogue about documents’ visual components (e.g., charts), can generalize to those other indicated scenarios.

More recently, I’m exploring these topics, in greater detail, here: https://github.com/WICG/proposals/issues/168.

Also, as interesting: https://intfiction.org/t/arxiv-can-language-models-serve-as-text-based-world-simulators/69159.