Create your own search engine

abhibisht89 · November 24, 2021, 5:59am

Hi,

Thanks. We use the “simplewiki-2020-11-01.jsonl.gz” dataset due to memory constraints. It is way smaller than the complete wikipedia dataset on huggingface.

This is the structure of the dataset:

{‘id’: ‘9824’, ‘title’: ‘Aileen Wuornos’, ‘paragraphs’: [‘Aileen Carol Wuornos Pralle (born Aileen Carol Pittman; February 29, 1956\xa0– October 9, 2002) was an American serial killer. She was born in Rochester, Michigan. She confessed to killing six men in Florida and was executed in Florida State Prison by lethal injection for the murders. Wuornos said that the men she killed had raped her or tried to rape her while she was working as a prostitute.’, ‘Wuornos was diagnosed with antisocial personality disorder and borderline personality disorder.’, ‘The movie, “Monster” is about her life. Two documentaries were made about her.’, ‘Wuornos was born Aileen Carol Pittman in Rochester, Michigan. She never met her father. Wuornos was adopted by her grandparents. When she was 13 she became pregnant. She started working as a prostitute when she was 14.’]}

this dataset is much cleaner i will say. we use the “paragraphs” and create the embedding out of it.

Not much cleaning is needed in this dataset as things are pretty clean.

Hope this will help

Topic		Replies	Views
Language model to search an answer in a huge collection of (unrelated) paragraphs Research	4	1543	July 6, 2021
Demo of Open Domain Long Form Question Answering Beginners	13	4559	February 8, 2021
Build a title recommender for scientific articles 🤗 Course Projects	20	2081	November 22, 2021
Find document by keyword? Beginners	1	425	July 28, 2022
A specific documents AI API for Hugging Face? Intermediate	0	244	May 12, 2023

Create your own search engine

Related topics