Hi everyone,
I hope you’re all doing well. I’m relatively new here and I’ve got some ideas brewing that I could use some guidance on. I’ll do my best to explain them clearly, but apologies in advance if I misuse any terminology.
Problem/Idea:
I’ve got a sizable dataset of social media posts at my disposal. Let’s say I’m working with around 600 million Twitter posts (although for testing purposes, I’ll start with just 1 million). My goal is to train a model to function somewhat like a database but in natural language. For instance, I want to be able to ask questions like, “What are the top two time ranges for posting on Fridays at 5 PM, based on likes, impressions, etc.?” and have the model provide relevant insights based on the past posts it’s been trained on. Here are my specific doubts:
Appropriateness of Approach: Am I on the right track with this approach, or is there a better tool/method for achieving what I’m aiming for? Essentially, I want to leverage a large dataset to enable the model to generate responses.
Training and Pretraining: While I’ve come across plenty of guides and tutorials on fine-tuning models, this task seems somewhat different. It feels more likf training/pretraining(?). For instance, I have JSON data representing individual posts, but I’m unsure how to structure the input for the model. I don’t have a clear “question-answer” format; instead, I have raw data for training purposes. Could you recommend any resources or tutorials that focus on this aspect?. I’ll put an example how one publications:
This is an example of 1 publication of the dataset:
{'publication_id': '659186612f8c52522243709b',
'social_profile_id': 3333333,
'platform_media_type': 'Twitter Post',
'description': 'Nice to see everyone!',
'geographic_location': '',
'url': 'https://www.twitter.com/samplepost',
'comments': 1,
'impressions': 0,
'likes': 43,
'saves': 0,
'video_views': 0,
'engagement_rate': 0.0006439244266877405,
'engagements': 44,
'audience_size': 68331,
'mentions': ['mention1', 'mention2'],
'hashtags': ['hashtag1']}
Model Recommendations: Do you have any specific models in mind that would be well-suited for this task?
I already thought on a different approach(And somebody in the discord gave me the same idea). A model that translates natural language to communicate with the db. But I would like to test first if I can train the model directly with the data.
Thanks in advance for your help!