Generate Mock but realistic data using NLP

I want to generate realistic test data. It should support customizable Fields, structure of data by specifying field names and types. It should support a wide range of data types, including names, addresses, email addresses, electrical products, household products etc.

I plan to use small language model or prefix based llm, because i don’t want full sentence, just word(s). model response is json format. I want to implement similar to https://www.mockaroo.com/ by using AI

The main purpose of the model is to replace data in production database.

1 Like

I can help you with that, but you gotta manage and The Resource part like GPU, and you will have to give me first 5k samples of kind of data you want to genrate.

1 Like

I play a lot with LLMs, and while they are creative, they are not very good at strict formatting output, etc.

So, what about using Function Calling, which I have only used a few times.
If you want LLMs to manipulate routine data, you can expect more than normal performance.

For specific code, etc., it would be quicker to find and imitate Space for similar uses in Spaces.

1 Like

I have created a Tamplate based Generation technique which can help in it but first we will need to fine tune model in a specific way. With totally diverse data. Then it can be used to generate High quality data and as much as want. We can use llms between 0.5B to 1.5B. for genration.

1 Like

Hard part is gathering totally diverse data and cleaning it. I plan to use existing model like Mistral or llama. Does it work

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.