Hey everyone!
I’m currently working on a project that involves extracting question and answers from an exam question PDF document and storing each question (along with options, answer, and explanation) in json format. To accomplish this, I’m using regular expressions to identify the questions in the text, and then utilizing the OpenAI GPT-3.5 Turbo model to generate structured outputs in JSON format for each question.
However, I’m encountering a specific issue with the model. Even though I’ve provided clear instructions in the prompt to only extract options if they are explicitly available in the text, the model still generates options from the explanation section. I want to ignore questions which do not have options in proper format (because some options are in the form of images).
Here’s the prompt I’m using: “Extract information from text. {format_instructions} The response should be presented in a markdown JSON codeblock. Question description: {inputText}.Please remember that if options are not explicitly present in the prompt text in the form of ‘A. option_a_text’, ‘B. option_b_text’ and so on, do not extract ‘options’ from answer/explanation and set the ‘options’ field as an empty object and provide ‘result’ field as ‘failed’. If options are present, provide the correct ‘ans’ field with valid options (a, b, c, d, e) and provide ‘result’ field as ‘success’. Do not make up options or answers or explanation.”
I would greatly appreciate your assistance in understanding why the model is generating options in violation of the given instructions and if there are any potential solutions or alternative approaches that can improve the accuracy of option extraction.
PS :- I am using zod for schema validation and StructuredOutputParser from the langchain/output_parsers module to parse the output generated by the OpenAI GPT-3.5 model.