Create your own search engine

lewtun · November 10, 2021, 4:50pm

Please read the topic category description to understand what this is all about

Create your own search engine

In Chapter 5 of the Course, you learned how to use FAISS to find documents that are most semantically similar to a given query. The goal of this project is to extend this idea to build a retrieval and reranking system, where the retriever returns possibly relevant results, while the reranker evaluates the how relevant these hits are to the query.

An example of the architecture might looks as follows (taken from the sentence-transformers library):

Model(s)

The sentence-transformers models on the Hub are great for the reranking task.

Datasets

Wikipedia is usually a good corpus to test retrieval systems on and you can find a dump in various languages here:

wikipedia

Challenges

Implementing the full retriever-reranking architecture might be a challenge, so a simpler place to start is with a single long document. You can then chunk that document into paragraphs and compute the relevancy scores across each paragraph

Desired project outcomes

Create a Streamlit or Gradio app on Spaces that allows a user to enter a search query about a document (or a whole corpus of documents), and returns the top 5 most relevant paragraphs.
Don’t forget to push all your models and datasets to the Hub so others can build on them!

Additional resources

https://www.sbert.net/examples/applications/retrieve_rerank/README.html#retrieve-re-rank

Discord channel

To chat and organise with other people interested in this project, head over to our Discord and:

Follow the instructions on the #join-course channel
Join the neural-search-engine channel

Just make sure you comment here to indicate that you’ll be contributing to this project

abhibisht89 · November 15, 2021, 11:52am

Hi , I will be working on this cool project during the event.

lewtun · November 15, 2021, 1:47pm

Hey @abhibisht89, cool to hear that you’re tackling this project! I’ve created a Discord channel (see topic description) in case you and others want to use it

algomuffin · November 15, 2021, 5:44pm

Hi, I would like to work on this project during the event

Frasco996 · November 15, 2021, 6:06pm

Hi, I would like to work on this project, It’s very interesting

wilmerags · November 15, 2021, 6:57pm

Hi I would like to be part of this project!

kzuri · November 16, 2021, 12:56pm

Hi I will be working on this project

lewtun · November 16, 2021, 1:10pm

Hi @kzuri it seems like this project already has 4 team members, so just double check on Discord if that’s the case.

If it is, you can either do the project by yourself using your own compute (we reserve the Amazon SageMaker compute for teams), or pick / propose another project in #course:course-event

kzuri · November 17, 2021, 4:37pm

Hi @lewtun. Yes I have joined discord channel already. Thought I might as well update here that I am working on this project.

Matthieu · November 23, 2021, 5:33pm

Hi @algomuffin @kzuri @abhibisht89 very nice web app!

However, how did you deal with the wikipedia dataset and transform it into a list of passages?

When loading wikipedia dataset in the text field there a many undesirable text as titles, bibliography, external links…

How did you filter them out?

Thanks!

abhibisht89 · November 24, 2021, 5:59am

Hi,

Thanks. We use the “simplewiki-2020-11-01.jsonl.gz” dataset due to memory constraints. It is way smaller than the complete wikipedia dataset on huggingface.

This is the structure of the dataset:

{‘id’: ‘9824’, ‘title’: ‘Aileen Wuornos’, ‘paragraphs’: [‘Aileen Carol Wuornos Pralle (born Aileen Carol Pittman; February 29, 1956\xa0– October 9, 2002) was an American serial killer. She was born in Rochester, Michigan. She confessed to killing six men in Florida and was executed in Florida State Prison by lethal injection for the murders. Wuornos said that the men she killed had raped her or tried to rape her while she was working as a prostitute.’, ‘Wuornos was diagnosed with antisocial personality disorder and borderline personality disorder.’, ‘The movie, “Monster” is about her life. Two documentaries were made about her.’, ‘Wuornos was born Aileen Carol Pittman in Rochester, Michigan. She never met her father. Wuornos was adopted by her grandparents. When she was 13 she became pregnant. She started working as a prostitute when she was 14.’]}

this dataset is much cleaner i will say. we use the “paragraphs” and create the embedding out of it.

Not much cleaning is needed in this dataset as things are pretty clean.

Hope this will help

Fadela13 · December 29, 2024, 3:53pm

Hi can i work on this project now??

Topic		Replies	Views
Language model to search an answer in a huge collection of (unrelated) paragraphs Research	4	1514	July 6, 2021
A specific documents AI API for Hugging Face? Intermediate	0	229	May 12, 2023
A new Lang Chain Chat BOT for Educational Purpose: ChatterPY Beginners	0	72	August 30, 2024
Find document by keyword? Beginners	1	410	July 28, 2022
Use OpenAI's CLIP for image search 🤗 Course Projects	21	4351	June 4, 2024

Create your own search engine