Seeking Guidance on Training a Natural Language Model with Large Social Media Dataset for Query-Base

Hi everyone,

I hope you’re all doing well. I’m relatively new here and I’ve got some ideas brewing that I could use some guidance on. I’ll do my best to explain them clearly, but apologies in advance if I misuse any terminology.

Problem/Idea:

I’ve got a sizable dataset of social media posts at my disposal. Let’s say I’m working with around 600 million Twitter posts (although for testing purposes, I’ll start with just 1 million). My goal is to train a model to function somewhat like a database but in natural language. For instance, I want to be able to ask questions like, “What are the top two time ranges for posting on Fridays at 5 PM, based on likes, impressions, etc.?” and have the model provide relevant insights based on the past posts it’s been trained on. Here are my specific doubts:

Appropriateness of Approach: Am I on the right track with this approach, or is there a better tool/method for achieving what I’m aiming for? Essentially, I want to leverage a large dataset to enable the model to generate responses.

Training and Pretraining: While I’ve come across plenty of guides and tutorials on fine-tuning models, this task seems somewhat different. It feels more likf training/pretraining(?). For instance, I have JSON data representing individual posts, but I’m unsure how to structure the input for the model. I don’t have a clear “question-answer” format; instead, I have raw data for training purposes. Could you recommend any resources or tutorials that focus on this aspect?. I’ll put an example how one publications:

This is an example of 1 publication of the dataset:


{'publication_id': '659186612f8c52522243709b',
 'social_profile_id': 3333333,
 'platform_media_type': 'Twitter Post',
 'description': 'Nice to see everyone!',
 'geographic_location': '',
 'url': 'https://www.twitter.com/samplepost',
 'comments': 1,
 'impressions': 0,
 'likes': 43,
 'saves': 0,
 'video_views': 0,
 'engagement_rate': 0.0006439244266877405,
 'engagements': 44,
 'audience_size': 68331,
 'mentions': ['mention1', 'mention2'],
 'hashtags': ['hashtag1']}

Model Recommendations: Do you have any specific models in mind that would be well-suited for this task?

I already thought on a different approach(And somebody in the discord gave me the same idea). A model that translates natural language to communicate with the db. But I would like to test first if I can train the model directly with the data.

Thanks in advance for your help!