Multiple-Token Input for Text Generations and PPLM?

Hello. I am trying to integrate the results of a LDA topic model and controlled-text generation, which is usually a set of keywords, to generate readable semantics/sentences. I have read some relevant papers and tried the codes at ‘transformers/examples/text-generation/pplm’ and ‘run_geneartion’, but still struggling to understand how to input “a list of strings” as input instead of a single string as the demo presents. Thank you!

1 Like

This may help:

https://towardsdatascience.com/data-to-text-generation-with-t5-building-a-simple-yet-advanced-nlg-model-b5cce5a6df45

1 Like

Hey thank you so much! But does this mean that the current transformer pipeline/implementation does not perform the type of tasks that I described?

1 Like

Honesty, I’m not sure. What I do know, however, is that the provided article would seem to satisfy what you wish to accomplish. If you check out my profile, I trained GPT-2 with keywords actually: ForceWords.

Here is an example of what an output may look like:

Input: [‘At the core’, ‘mismanagement of the Coronavirus’, ‘distrust of science.’]

Output: At the core of the United States’ mismanagement of the Coronavirus lies a distrust of science.

1 Like

Thank you. Did you write your output generate function differently from the article? I am having a little trouble understanding the data type of your input with your example, is that a list of strings? Do you happen to post a demo notebook on your github profile? Thanks!

1 Like

Sorry, I believe I didn’t articulate my thoughts clearly. Setting aside the linked article, I trained a GPT2 model with keywords that may also suit your needs. My strategy deviates from the article because I opted for GPT-2 as opposed to T5. Here is a notebook: https://colab.research.google.com/drive/16ctmbD03DrFJCwNN45Chy1jYf9Cm9pTp?usp=sharing. Note that it does not work perfectly, so the keyword may not always be included in the output.

3 Likes

Thank you and this is very helpful! I think under some circumstances your work is better than the one presented from that article. When solving this type of problem it is not necessary to include every individual input keyword or multiword expression, as long as the model comprehend and make a reasonable output based on tokens with similar meaning. I do have a few more questions if you don’t mind:

  1. Did you use the identical web-nlg-2020 dataset from this article? If not, would you reveal the type/domain of the training dataset you used?

  2. What is the difference between the ForceWords, ForceWords2 and ForceWordArvix models under your profile? I tried all of them and prefer ForceWords for now.

  3. Is there a specific reason why your input from your shared demo is in the form of
    “”"[‘At the core’, ‘mismanagement of the Coronavirus’, ‘lies its distrust of science’] At""", with 3 sets of quotations? I do understand that you are including a string of list as input words, with “At” as an “anchor” that starts the sentence. I am also not sure why in the output, the model would provided an additional set of input that is different but somehow related to the original input, along with a new output, even though I already set number of returned sequence to 1.

  4. Is there anything similar to a “seed” that we can set for output? When I ran your model with the identical input, some outputs were extremely outstanding and I wish I could’ve saved it.

Thank you!

2 Likes

Updated Google Colab:

No worries.

  1. If I remember correctly, I scraped Joe Biden’s Twitter account and used NLTK to extract keywords.
  2. I agree with your assessment. ForceWords2 (can’t remember dataset) and ForceWordArvix (trained on arXiv titles and abstracts) were trained on different datasets
  3. I updated the Google Colab to extract only the first sentence. You don’t need the three quotations, but I have a tendency to always do that.
  4. To be honest, I’m not sure.

If I find the time later, I will train another model with a larger dataset using this strategy.

2 Likes

Hey thanks again. As a beginner, I am also wondering to what extent can we access to models that are posted on https://huggingface.co/, like one of yours that can be accessed through transformer.AutoModelWithLMHead. e.g. If I really like what your model can accomplish, is it possible to fine-tune/retrain it with another domain-specific dataset? Thank you!

2 Likes

If you replace “GPT-2” with my model, that may work. I’m not completely sure, though. Here is the training code you can use. You’ll have to change a few things like the directories.

HuggingFace established a new way to upload models this week and I haven’t yet checked if it’s compatible with Google Colab. That’s why I didn’t include in this snippet how to upload.

from transformers import (

    AutoModelWithLMHead,

    AutoConfig,

    Trainer,

    AutoTokenizer,

    TextDataset,

    DataCollatorForLanguageModeling,

    TrainingArguments)

def modelTrainer(batch_size=1, conf='gpt2'): #replace gpt2 with my model

    config = AutoConfig.from_pretrained(conf)

    model = AutoModelWithLMHead.from_config(config)

    tokenizer = AutoTokenizer.from_pretrained(conf)

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

    train_dataset = TextDataset(

        tokenizer=tokenizer,

        #block_size = tokenizer.max_len

        block_size = tokenizer.max_len,

        file_path ="/content/Para.txt",

        cache_dir="/content/",

    )

    training_args =TrainingArguments(

        output_dir="/content/ParaphraseAgain",

        num_train_epochs=3.0,

        per_device_train_batch_size=batch_size,

        warmup_steps=500,

        logging_steps = 100,

        save_steps=500,

        seed =random.randint(0,2**32-1),

    )

    trainer = Trainer(

        model=model,

        args=training_args,

        data_collator=data_collator,

        train_dataset=train_dataset,

        prediction_loss_only=True,

    )

    #trainer.train('/content/ParaphraseAgain/checkpoint-2000')

    trainer.train()

    trainer.save_model()

    modelTrainer()
2 Likes

I really appreciate your patience. You have been incredibly helpful!

1 Like

My pleasure. Feel free to reach out if you have any more questions.

Is it also possible for me to look at a snippet of your training and test set’s format so I can edit mine as well? Thanks!

1 Like

[‘longest streak’, ‘job growth’, ‘80 years’, ‘history’, ‘facing’, ‘delivered’] -> Facing the worst financial crisis in 80 years, you delivered the longest streak of job growth in our history.

I only used a training set.

1 Like