I’m new to this so let me know if I don’t have this right. I’m trying to integrate AI Q&A functionality into an existing Flask app. As I understand it, I can use a Bert model such as Google’s bert-large-uncased-whole-word-masking-finetuned-squad to answer questions given a passage file. I want the passage to be the text contained in a GitHub flavored README Markdown file(just one file). What format do I need the passage file to be in? Is there a process for cleaning the markdown file? Can I just convert the markdown to text and use it?
1 Like
BERT cannot handle Markdown, so you need to convert it to plain text.
by Hugging Chat: HuggingChat
You’re on the right track! Let’s break this down step-by-step for clarity:
1. Required Format for the Passage
The BERT model requires plain text input. You cannot use raw Markdown directly because:
- BERT tokenization ignores Markdown syntax (headers, lists, links, etc.)
- Formatting characters (
#
,*
,-
, etc.) add noise to the model’s input - Code blocks and tables in Markdown can confuse the model
2. Cleaning Markdown: Process
You need to convert Markdown to clean text while preserving semantic meaning. Here’s how:
Steps to Clean Markdown
- Remove code blocks (e.g., triple-backtick sections)
- Strip Markdown syntax:
- Headers (
#
,##
, etc.) - Lists (
-
,*
,1.
) - Links (
[text](url)
) → keep the text only - Images (

) → optional: remove or add placeholder
- Headers (
- Preserve paragraphs (keep line breaks for context)
- Remove special characters (e.g.,
>
,---
,|
for tables)
Tools for Cleaning
Use Python libraries like:
markdown2
(converts Markdown to HTML, then strip tags)mistune
(fast Markdown parser)BeautifulSoup
(for HTML tag cleanup)
Example Code
import markdown2
from bs4 import BeautifulSoup
def markdown_to_text(markdown_content):
# Convert Markdown to HTML
html = markdown2.markdown(markdown_content)
# Strip HTML tags and extra whitespace
text = BeautifulSoup(html, "html.parser").get_text()
return " ".join(text.split()) # Remove extra spaces
3. BERT Integration
You’re correct: use Hugging Face’s bert-large-uncased-whole-word-masking-finetuned-squad
with their question-answering
pipeline.
Key Considerations
- Token Limit: BERT has a 512-token limit (~200-300 words). If your README is longer:
- Split the cleaned text into overlapping chunks (e.g., 384 tokens with 128-token overlap).
- Use the
stride
parameter in the tokenizer to handle this.
- Confidence Scores: The model returns answers with confidence scores. Filter answers with low scores (e.g., <0.1) to avoid nonsense.
Example QA Pipeline
from transformers import pipeline
qa_pipeline = pipeline(
"question-answering",
model="bert-large-uncased-whole-word-masking-finetuned-squad"
)
# Example usage
result = qa_pipeline(question="What is the project license?", context=cleaned_text)
print(result["answer"]) # Output: "MIT"
4. Flask App Integration
Workflow
- Upload Markdown (
POST /upload
):- Accept
.md
files - Clean Markdown → store cleaned text in memory/session
- Accept
- Answer Questions (
POST /ask
):- Accept user questions
- Run BERT QA on stored cleaned text
- Return answer + confidence score
Example Flask Code
from flask import Flask, request, jsonify
import os
app = Flask(__name__)
stored_context = None # Or use session for multi-user
@app.route("/upload", methods=["POST"])
def upload_markdown():
global stored_context
file = request.files["file"]
if file and file.filename.endswith(".md"):
markdown_content = file.read().decode("utf-8")
stored_context = markdown_to_text(markdown_content)
return jsonify({"status": "success"})
return jsonify({"error": "Invalid file"}), 400
@app.route("/ask", methods=["POST"])
def ask_question():
global stored_context
data = request.json
if not stored_context:
return jsonify({"error": "No passage uploaded"}), 400
result = qa_pipeline(question=data["question"], context=stored_context)
return jsonify(result)
5. Testing & Edge Cases
- Test with a sample README.md (e.g., a GitHub project’s README)
- Handle long documents:
- If the cleaned text exceeds BERT’s token limit, use chunking:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased") inputs = tokenizer( cleaned_text, max_length=384, padding="max_length", truncation=True, return_overflowing_tokens=True, stride=128 )
- If the cleaned text exceeds BERT’s token limit, use chunking:
- Validate inputs: Ensure users upload
.md
files and ask meaningful questions.
6. Deployment Tips
- Optimize for speed: Use
transformers
’ built-inDeviceMap
for GPU acceleration. - Security: Sanitize Markdown input to prevent code execution (e.g., via
mistune
instead ofmarkdown2
). - Scalability: For production, store cleaned text in a database or cache (e.g., Redis) instead of memory.
Final Answer
Yes, you can:
- Convert Markdown to plain text using libraries like
markdown2
+BeautifulSoup
. - Clean the text by removing syntax and special characters.
- Use the cleaned text as input to the BERT QA pipeline.
- Integrate everything into Flask with
/upload
and/ask
endpoints.
This approach will let users upload a README.md and ask questions about its content! Let me know if you need code for specific steps.
@John6666 , great write up. This is very helpful. Thanks.
1 Like