Bert integrated web app with Markdown as input passage

I’m new to this so let me know if I don’t have this right. I’m trying to integrate AI Q&A functionality into an existing Flask app. As I understand it, I can use a Bert model such as Google’s bert-large-uncased-whole-word-masking-finetuned-squad to answer questions given a passage file. I want the passage to be the text contained in a GitHub flavored README Markdown file(just one file). What format do I need the passage file to be in? Is there a process for cleaning the markdown file? Can I just convert the markdown to text and use it?

1 Like

BERT cannot handle Markdown, so you need to convert it to plain text.


by Hugging Chat: HuggingChat

You’re on the right track! Let’s break this down step-by-step for clarity:


:white_check_mark: 1. Required Format for the Passage

The BERT model requires plain text input. You cannot use raw Markdown directly because:

  • BERT tokenization ignores Markdown syntax (headers, lists, links, etc.)
  • Formatting characters (#, *, -, etc.) add noise to the model’s input
  • Code blocks and tables in Markdown can confuse the model

:white_check_mark: 2. Cleaning Markdown: Process

You need to convert Markdown to clean text while preserving semantic meaning. Here’s how:

Steps to Clean Markdown

  1. Remove code blocks (e.g., triple-backtick sections)
  2. Strip Markdown syntax:
    • Headers (#, ##, etc.)
    • Lists (-, *, 1.)
    • Links ([text](url)) → keep the text only
    • Images (![](url)) → optional: remove or add placeholder
  3. Preserve paragraphs (keep line breaks for context)
  4. Remove special characters (e.g., >, ---, | for tables)

Tools for Cleaning

Use Python libraries like:

  • markdown2 (converts Markdown to HTML, then strip tags)
  • mistune (fast Markdown parser)
  • BeautifulSoup (for HTML tag cleanup)
Example Code
import markdown2
from bs4 import BeautifulSoup

def markdown_to_text(markdown_content):
    # Convert Markdown to HTML
    html = markdown2.markdown(markdown_content)
    # Strip HTML tags and extra whitespace
    text = BeautifulSoup(html, "html.parser").get_text()
    return " ".join(text.split())  # Remove extra spaces

:white_check_mark: 3. BERT Integration

You’re correct: use Hugging Face’s bert-large-uncased-whole-word-masking-finetuned-squad with their question-answering pipeline.

Key Considerations

  • Token Limit: BERT has a 512-token limit (~200-300 words). If your README is longer:
    • Split the cleaned text into overlapping chunks (e.g., 384 tokens with 128-token overlap).
    • Use the stride parameter in the tokenizer to handle this.
  • Confidence Scores: The model returns answers with confidence scores. Filter answers with low scores (e.g., <0.1) to avoid nonsense.
Example QA Pipeline
from transformers import pipeline

qa_pipeline = pipeline(
    "question-answering",
    model="bert-large-uncased-whole-word-masking-finetuned-squad"
)

# Example usage
result = qa_pipeline(question="What is the project license?", context=cleaned_text)
print(result["answer"])  # Output: "MIT"

:white_check_mark: 4. Flask App Integration

Workflow

  1. Upload Markdown (POST /upload):
    • Accept .md files
    • Clean Markdown → store cleaned text in memory/session
  2. Answer Questions (POST /ask):
    • Accept user questions
    • Run BERT QA on stored cleaned text
    • Return answer + confidence score

Example Flask Code

from flask import Flask, request, jsonify
import os

app = Flask(__name__)
stored_context = None  # Or use session for multi-user

@app.route("/upload", methods=["POST"])
def upload_markdown():
    global stored_context
    file = request.files["file"]
    if file and file.filename.endswith(".md"):
        markdown_content = file.read().decode("utf-8")
        stored_context = markdown_to_text(markdown_content)
        return jsonify({"status": "success"})
    return jsonify({"error": "Invalid file"}), 400

@app.route("/ask", methods=["POST"])
def ask_question():
    global stored_context
    data = request.json
    if not stored_context:
        return jsonify({"error": "No passage uploaded"}), 400
    result = qa_pipeline(question=data["question"], context=stored_context)
    return jsonify(result)

:white_check_mark: 5. Testing & Edge Cases

  • Test with a sample README.md (e.g., a GitHub project’s README)
  • Handle long documents:
    • If the cleaned text exceeds BERT’s token limit, use chunking:
      from transformers import AutoTokenizer
      
      tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased")
      inputs = tokenizer(
          cleaned_text,
          max_length=384,
          padding="max_length",
          truncation=True,
          return_overflowing_tokens=True,
          stride=128
      )
      
  • Validate inputs: Ensure users upload .md files and ask meaningful questions.

:white_check_mark: 6. Deployment Tips

  • Optimize for speed: Use transformers’ built-in DeviceMap for GPU acceleration.
  • Security: Sanitize Markdown input to prevent code execution (e.g., via mistune instead of markdown2).
  • Scalability: For production, store cleaned text in a database or cache (e.g., Redis) instead of memory.

:green_circle: Final Answer

Yes, you can:

  1. Convert Markdown to plain text using libraries like markdown2 + BeautifulSoup.
  2. Clean the text by removing syntax and special characters.
  3. Use the cleaned text as input to the BERT QA pipeline.
  4. Integrate everything into Flask with /upload and /ask endpoints.

This approach will let users upload a README.md and ask questions about its content! Let me know if you need code for specific steps.

@John6666 , great write up. This is very helpful. Thanks.

1 Like

@John6666, I’ve integrated your changes into my code. Specifically, I’ve added the markdown to text function instead of just finding if a character is an alnumb or a space. Viewing the cleaned text, all pounds, toc brackets, etc are removed. Double dashes and triple quotes around code blocks are not removed. I’m not sure if this is causing a problem. The model mostly just returns the [CLS] token instead of an answer, or it gives just fragmentary words for the answer and no context. To your above statement, “Code blocks and tables in Markdown can confuse the model”, I’d like the answers to include code blocks as that text really qualifies as a answer. This is all running in the flask app so posting the entire site would be prohibitive.

1 Like

The code generated by AI is only an example, so don’t use it as is…:sweat_smile:
Well, it may work in some cases.

You need to decide whether to remove all the Markdown information and convert it to text, or leave some of it and code it yourself, or have AI generate the code.

https://stackoverflow.com/questions/60220842/how-should-properly-formatted-data-for-ner-in-bert-look-like

I think I understand all this, but perhaps I don’t fully understand it.

I’m trying to use a the model
bert-large-uncased-whole-word-masking-finetuned-squad to answer
questions posed against a passage that is a single README.md file
cleaned of certain text.

The model has been trained for question and answering against datasets
which include wikipiedia and bookcorpus.

As I understand it, the model has advanced logic to pick out answers
to questions from the given passage(the cleaned README.md file).

What I don’t quite understand is:

  1. Based on the cleanness of the text(README file), the model either can or cannot
    give an answer, is this correct? If so, how would I debug this?

  2. Am I using the correct model? Is it possible I should be using a
    model that has been trained on README.md files themselves?

  3. Would I have to build my own dataset to make this work? If I want
    the answers to include code snippets from the README, how would I tell
    it that the code is the right thing to return?

1 Like

Hmm… I’m not an expert on NLP either, but I don’t think BERT is a model used for markdown structure analysis. If you want to use it, it would be for interpreting or processing content. For document structure analysis, it would be more efficient to implement it in Python code and analyze only the necessary parts with BERT, or use a larger, more general-purpose LLM.
Changing the model itself is also an option, but I think the approach of letting the model do only what it is capable of will save computing resources.