Exploring contexts of occurrence of particular words in large datasets

Hi everybody, how are you?. I am currently working on a project where we would like to explore and be able to obtain the contexts of occurrence of particular words or n-grams in large datasets used to train language models, such as GitHub - josecannete/spanish-corpora: Unannotated Spanish 3 Billion Words Corpora.

As you can imagine, the problem is that when dealing with such large datasets, conventional strategies like using libraries like pandas and the like require a lot of RAM and computing power, so here are my questions:

  • Does the platform have any tools already available to carry out different types of searches on large datasets, which facilitates this task?
  • Is there some kind of server/service within the platform with enough RAM and computing power that we can access to load the full datasets and use an API to interact from our Space?

Thank you very much! Hernán

@nanom I’ll try and take a shot at providing some assistance. I am still a beginner at the huggingface suite but I’ve been using various aspects of it recently.

Does the platform have any tools already available to carry out different types of searches on large datasets, which facilitates this task?

Perhaps one thing to consider is the datasets library (here). From what I gather, it utilizes Apache Arrow under the hood to efficiently build a memory map of the data for efficient loading and processing. Within datasets there is a map() function that I have used extensively with great success. If your dataset is some what customized, it might be worthwhile to build a loading script for the datasets object and then run map() over the data to perform your searches/calculations. I have done both of these recently and am happy to help and share my experience if you think it will benefit you.

Is there some kind of server/service within the platform with enough RAM and computing power that we can access to load the full datasets and use an API to interact from our Space?

This one I’m not super certain of. If I read your question correctly, you’re asking about the possibly to load some data, model, and training routine onto a set of compute hardware on hugginface’s end that has a lot of RAM (and possibly GPUs) available to run the training pipeline. If this is the case, then perhaps the hardware solution and/or the HF services might be of interest.