Bert integrated web app with Markdown as input passage

I’m new to this so let me know if I don’t have this right. I’m trying to integrate AI Q&A functionality into an existing Flask app. As I understand it, I can use a Bert model such as Google’s bert-large-uncased-whole-word-masking-finetuned-squad to answer questions given a passage file. I want the passage to be the text contained in a GitHub flavored README Markdown file(just one file). What format do I need the passage file to be in? Is there a process for cleaning the markdown file? Can I just convert the markdown to text and use it?

1 Like

BERT cannot handle Markdown, so you need to convert it to plain text.


by Hugging Chat: HuggingChat

You’re on the right track! Let’s break this down step-by-step for clarity:


:white_check_mark: 1. Required Format for the Passage

The BERT model requires plain text input. You cannot use raw Markdown directly because:

  • BERT tokenization ignores Markdown syntax (headers, lists, links, etc.)
  • Formatting characters (#, *, -, etc.) add noise to the model’s input
  • Code blocks and tables in Markdown can confuse the model

:white_check_mark: 2. Cleaning Markdown: Process

You need to convert Markdown to clean text while preserving semantic meaning. Here’s how:

Steps to Clean Markdown

  1. Remove code blocks (e.g., triple-backtick sections)
  2. Strip Markdown syntax:
    • Headers (#, ##, etc.)
    • Lists (-, *, 1.)
    • Links ([text](url)) → keep the text only
    • Images (![](url)) → optional: remove or add placeholder
  3. Preserve paragraphs (keep line breaks for context)
  4. Remove special characters (e.g., >, ---, | for tables)

Tools for Cleaning

Use Python libraries like:

  • markdown2 (converts Markdown to HTML, then strip tags)
  • mistune (fast Markdown parser)
  • BeautifulSoup (for HTML tag cleanup)
Example Code
import markdown2
from bs4 import BeautifulSoup

def markdown_to_text(markdown_content):
    # Convert Markdown to HTML
    html = markdown2.markdown(markdown_content)
    # Strip HTML tags and extra whitespace
    text = BeautifulSoup(html, "html.parser").get_text()
    return " ".join(text.split())  # Remove extra spaces

:white_check_mark: 3. BERT Integration

You’re correct: use Hugging Face’s bert-large-uncased-whole-word-masking-finetuned-squad with their question-answering pipeline.

Key Considerations

  • Token Limit: BERT has a 512-token limit (~200-300 words). If your README is longer:
    • Split the cleaned text into overlapping chunks (e.g., 384 tokens with 128-token overlap).
    • Use the stride parameter in the tokenizer to handle this.
  • Confidence Scores: The model returns answers with confidence scores. Filter answers with low scores (e.g., <0.1) to avoid nonsense.
Example QA Pipeline
from transformers import pipeline

qa_pipeline = pipeline(
    "question-answering",
    model="bert-large-uncased-whole-word-masking-finetuned-squad"
)

# Example usage
result = qa_pipeline(question="What is the project license?", context=cleaned_text)
print(result["answer"])  # Output: "MIT"

:white_check_mark: 4. Flask App Integration

Workflow

  1. Upload Markdown (POST /upload):
    • Accept .md files
    • Clean Markdown → store cleaned text in memory/session
  2. Answer Questions (POST /ask):
    • Accept user questions
    • Run BERT QA on stored cleaned text
    • Return answer + confidence score

Example Flask Code

from flask import Flask, request, jsonify
import os

app = Flask(__name__)
stored_context = None  # Or use session for multi-user

@app.route("/upload", methods=["POST"])
def upload_markdown():
    global stored_context
    file = request.files["file"]
    if file and file.filename.endswith(".md"):
        markdown_content = file.read().decode("utf-8")
        stored_context = markdown_to_text(markdown_content)
        return jsonify({"status": "success"})
    return jsonify({"error": "Invalid file"}), 400

@app.route("/ask", methods=["POST"])
def ask_question():
    global stored_context
    data = request.json
    if not stored_context:
        return jsonify({"error": "No passage uploaded"}), 400
    result = qa_pipeline(question=data["question"], context=stored_context)
    return jsonify(result)

:white_check_mark: 5. Testing & Edge Cases

  • Test with a sample README.md (e.g., a GitHub project’s README)
  • Handle long documents:
    • If the cleaned text exceeds BERT’s token limit, use chunking:
      from transformers import AutoTokenizer
      
      tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased")
      inputs = tokenizer(
          cleaned_text,
          max_length=384,
          padding="max_length",
          truncation=True,
          return_overflowing_tokens=True,
          stride=128
      )
      
  • Validate inputs: Ensure users upload .md files and ask meaningful questions.

:white_check_mark: 6. Deployment Tips

  • Optimize for speed: Use transformers’ built-in DeviceMap for GPU acceleration.
  • Security: Sanitize Markdown input to prevent code execution (e.g., via mistune instead of markdown2).
  • Scalability: For production, store cleaned text in a database or cache (e.g., Redis) instead of memory.

:green_circle: Final Answer

Yes, you can:

  1. Convert Markdown to plain text using libraries like markdown2 + BeautifulSoup.
  2. Clean the text by removing syntax and special characters.
  3. Use the cleaned text as input to the BERT QA pipeline.
  4. Integrate everything into Flask with /upload and /ask endpoints.

This approach will let users upload a README.md and ask questions about its content! Let me know if you need code for specific steps.

@John6666 , great write up. This is very helpful. Thanks.

1 Like