Exploring contexts of occurrence of particular words in large datasets

nanom · August 25, 2022, 11:55pm

Hi everybody, how are you?. I am currently working on a project where we would like to explore and be able to obtain the contexts of occurrence of particular words or n-grams in large datasets used to train language models, such as GitHub - josecannete/spanish-corpora: Unannotated Spanish 3 Billion Words Corpora.

As you can imagine, the problem is that when dealing with such large datasets, conventional strategies like using libraries like pandas and the like require a lot of RAM and computing power, so here are my questions:

Does the platform have any tools already available to carry out different types of searches on large datasets, which facilitates this task?
Is there some kind of server/service within the platform with enough RAM and computing power that we can access to load the full datasets and use an API to interact from our Space?

Thank you very much! Hernán

aclifton314 · August 29, 2022, 10:10pm

@nanom I’ll try and take a shot at providing some assistance. I am still a beginner at the huggingface suite but I’ve been using various aspects of it recently.

Does the platform have any tools already available to carry out different types of searches on large datasets, which facilitates this task?

Perhaps one thing to consider is the datasets library (here). From what I gather, it utilizes Apache Arrow under the hood to efficiently build a memory map of the data for efficient loading and processing. Within datasets there is a map() function that I have used extensively with great success. If your dataset is some what customized, it might be worthwhile to build a loading script for the datasets object and then run map() over the data to perform your searches/calculations. I have done both of these recently and am happy to help and share my experience if you think it will benefit you.

Is there some kind of server/service within the platform with enough RAM and computing power that we can access to load the full datasets and use an API to interact from our Space?

This one I’m not super certain of. If I read your question correctly, you’re asking about the possibly to load some data, model, and training routine onto a set of compute hardware on hugginface’s end that has a lot of RAM (and possibly GPUs) available to run the training pipeline. If this is the case, then perhaps the hardware solution and/or the HF services might be of interest.

lucianabenotti · October 19, 2022, 5:27pm

Thank you very much for your response! @nanom managed to implement an inverted index to address the first problem but we are still struggling with hardware limitations. Do you know who we should contact to ask some questions about which is the best pricing option for a particular project regarding hardware? HF services

Topic		Replies	Views
I had collected data for a language text for translation How can I add it up into datsets 🤗Datasets	7	1589	August 23, 2021
Fetching rows of a large Dataset by index 🤗Datasets	10	1655	March 15, 2021
DPR Context tokenization in a GPU 🤗Datasets	4	1183	September 25, 2020
Training RoBERTa on a large corpus 🤗Transformers	5	3356	August 25, 2020
How to train a language model from scratch when my dataset is bigger than RAM? Beginners	19	9761	September 18, 2020

Exploring contexts of occurrence of particular words in large datasets

Related topics