Bert integrated web app with Markdown as input passage

stevem995 · May 14, 2025, 3:50pm

I’m new to this so let me know if I don’t have this right. I’m trying to integrate AI Q&A functionality into an existing Flask app. As I understand it, I can use a Bert model such as Google’s bert-large-uncased-whole-word-masking-finetuned-squad to answer questions given a passage file. I want the passage to be the text contained in a GitHub flavored README Markdown file(just one file). What format do I need the passage file to be in? Is there a process for cleaning the markdown file? Can I just convert the markdown to text and use it?

John6666 · May 15, 2025, 5:26am

BERT cannot handle Markdown, so you need to convert it to plain text.

by Hugging Chat: HuggingChat

You’re on the right track! Let’s break this down step-by-step for clarity:

1. Required Format for the Passage

The BERT model requires plain text input. You cannot use raw Markdown directly because:

BERT tokenization ignores Markdown syntax (headers, lists, links, etc.)
Formatting characters (#, *, -, etc.) add noise to the model’s input
Code blocks and tables in Markdown can confuse the model

2. Cleaning Markdown: Process

You need to convert Markdown to clean text while preserving semantic meaning. Here’s how:

Steps to Clean Markdown

Remove code blocks (e.g., triple-backtick sections)
Strip Markdown syntax:
- Headers (#, ##, etc.)
- Lists (-, *, 1.)
- Links ([text](url)) → keep the text only
- Images (![](url)) → optional: remove or add placeholder
Preserve paragraphs (keep line breaks for context)
Remove special characters (e.g., >, ---, | for tables)

Tools for Cleaning

Use Python libraries like:

markdown2 (converts Markdown to HTML, then strip tags)
mistune (fast Markdown parser)
BeautifulSoup (for HTML tag cleanup)

Example Code

import markdown2
from bs4 import BeautifulSoup

def markdown_to_text(markdown_content):
    # Convert Markdown to HTML
    html = markdown2.markdown(markdown_content)
    # Strip HTML tags and extra whitespace
    text = BeautifulSoup(html, "html.parser").get_text()
    return " ".join(text.split())  # Remove extra spaces

3. BERT Integration

You’re correct: use Hugging Face’s bert-large-uncased-whole-word-masking-finetuned-squad with their question-answering pipeline.

Key Considerations

Token Limit: BERT has a 512-token limit (~200-300 words). If your README is longer:
- Split the cleaned text into overlapping chunks (e.g., 384 tokens with 128-token overlap).
- Use the stride parameter in the tokenizer to handle this.
Confidence Scores: The model returns answers with confidence scores. Filter answers with low scores (e.g., <0.1) to avoid nonsense.

Example QA Pipeline

from transformers import pipeline

qa_pipeline = pipeline(
    "question-answering",
    model="bert-large-uncased-whole-word-masking-finetuned-squad"
)

# Example usage
result = qa_pipeline(question="What is the project license?", context=cleaned_text)
print(result["answer"])  # Output: "MIT"

4. Flask App Integration

Workflow

Upload Markdown (POST /upload):
- Accept .md files
- Clean Markdown → store cleaned text in memory/session
Answer Questions (POST /ask):
- Accept user questions
- Run BERT QA on stored cleaned text
- Return answer + confidence score

Example Flask Code

from flask import Flask, request, jsonify
import os

app = Flask(__name__)
stored_context = None  # Or use session for multi-user

@app.route("/upload", methods=["POST"])
def upload_markdown():
    global stored_context
    file = request.files["file"]
    if file and file.filename.endswith(".md"):
        markdown_content = file.read().decode("utf-8")
        stored_context = markdown_to_text(markdown_content)
        return jsonify({"status": "success"})
    return jsonify({"error": "Invalid file"}), 400

@app.route("/ask", methods=["POST"])
def ask_question():
    global stored_context
    data = request.json
    if not stored_context:
        return jsonify({"error": "No passage uploaded"}), 400
    result = qa_pipeline(question=data["question"], context=stored_context)
    return jsonify(result)

5. Testing & Edge Cases

Test with a sample README.md (e.g., a GitHub project’s README)

Handle long documents:

If the cleaned text exceeds BERT’s token limit, use chunking:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased")
inputs = tokenizer(
    cleaned_text,
    max_length=384,
    padding="max_length",
    truncation=True,
    return_overflowing_tokens=True,
    stride=128
)

Validate inputs: Ensure users upload .md files and ask meaningful questions.

6. Deployment Tips

Optimize for speed: Use transformers’ built-in DeviceMap for GPU acceleration.
Security: Sanitize Markdown input to prevent code execution (e.g., via mistune instead of markdown2).
Scalability: For production, store cleaned text in a database or cache (e.g., Redis) instead of memory.

Final Answer

Yes, you can:

Convert Markdown to plain text using libraries like markdown2 + BeautifulSoup.
Clean the text by removing syntax and special characters.
Use the cleaned text as input to the BERT QA pipeline.
Integrate everything into Flask with /upload and /ask endpoints.

This approach will let users upload a README.md and ask questions about its content! Let me know if you need code for specific steps.

stevem995 · May 15, 2025, 11:37am

@John6666 , great write up. This is very helpful. Thanks.

Topic		Replies	Views
HTML Embedding processing Intermediate	8	3754	February 13, 2022
Reconstructing Original Sentence From Bert Output w/ Added Noise Beginners	2	612	January 24, 2024
Data preprocessing steps for pretraining BERT from scratch Beginners	1	3768	January 30, 2022
How much cleaning for transformers? Beginners	2	7761	August 27, 2020
Finetuning Bert for Question answering task without context Models	1	609	June 21, 2024