Please read the topic category description to understand what this is all about
Create your own search engine
In Chapter 5 of the Course, you learned how to use FAISS to find documents that are most semantically similar to a given query. The goal of this project is to extend this idea to build a retrieval and reranking system, where the retriever returns possibly relevant results, while the reranker evaluates the how relevant these hits are to the query.
An example of the architecture might looks as follows (taken from the sentence-transformers library):
Implementing the full retriever-reranking architecture might be a challenge, so a simpler place to start is with a single long document. You can then chunk that document into paragraphs and compute the relevancy scores across each paragraph
Desired project outcomes
Create a Streamlit or Gradio app on Spaces that allows a user to enter a search query about a document (or a whole corpus of documents), and returns the top 5 most relevant paragraphs.
Donât forget to push all your models and datasets to the Hub so others can build on them!
Hey @abhibisht89, cool to hear that youâre tackling this project! Iâve created a Discord channel (see topic description) in case you and others want to use it
Hi @kzuri it seems like this project already has 4 team members, so just double check on Discord if thatâs the case.
If it is, you can either do the project by yourself using your own compute (we reserve the Amazon SageMaker compute for teams), or pick / propose another project in #course:course-event
Thanks. We use the âsimplewiki-2020-11-01.jsonl.gzâ dataset due to memory constraints. It is way smaller than the complete wikipedia dataset on huggingface.
This is the structure of the dataset:
{âidâ: â9824â, âtitleâ: âAileen Wuornosâ, âparagraphsâ: [âAileen Carol Wuornos Pralle (born Aileen Carol Pittman; February 29, 1956\xa0â October 9, 2002) was an American serial killer. She was born in Rochester, Michigan. She confessed to killing six men in Florida and was executed in Florida State Prison by lethal injection for the murders. Wuornos said that the men she killed had raped her or tried to rape her while she was working as a prostitute.â, âWuornos was diagnosed with antisocial personality disorder and borderline personality disorder.â, âThe movie, âMonsterâ is about her life. Two documentaries were made about her.â, âWuornos was born Aileen Carol Pittman in Rochester, Michigan. She never met her father. Wuornos was adopted by her grandparents. When she was 13 she became pregnant. She started working as a prostitute when she was 14.â]}
this dataset is much cleaner i will say. we use the âparagraphsâ and create the embedding out of it.
Not much cleaning is needed in this dataset as things are pretty clean.