Multiple-Token Input for Text Generations and PPLM?

Zenith · November 9, 2020, 6:47pm

Hello. I am trying to integrate the results of a LDA topic model and controlled-text generation, which is usually a set of keywords, to generate readable semantics/sentences. I have read some relevant papers and tried the codes at ‘transformers/examples/text-generation/pplm’ and ‘run_geneartion’, but still struggling to understand how to input “a list of strings” as input instead of a single string as the demo presents. Thank you!

zanderbush · November 9, 2020, 11:33pm

This may help:

https://towardsdatascience.com/data-to-text-generation-with-t5-building-a-simple-yet-advanced-nlg-model-b5cce5a6df45

Zenith · November 12, 2020, 10:42pm

Hey thank you so much! But does this mean that the current transformer pipeline/implementation does not perform the type of tasks that I described?

zanderbush · November 13, 2020, 12:19am

Honesty, I’m not sure. What I do know, however, is that the provided article would seem to satisfy what you wish to accomplish. If you check out my profile, I trained GPT-2 with keywords actually: ForceWords.

Here is an example of what an output may look like:

Input: [‘At the core’, ‘mismanagement of the Coronavirus’, ‘distrust of science.’]

Output: At the core of the United States’ mismanagement of the Coronavirus lies a distrust of science.

Zenith · November 13, 2020, 3:48am

Thank you. Did you write your output generate function differently from the article? I am having a little trouble understanding the data type of your input with your example, is that a list of strings? Do you happen to post a demo notebook on your github profile? Thanks!

zanderbush · November 13, 2020, 3:49pm

Sorry, I believe I didn’t articulate my thoughts clearly. Setting aside the linked article, I trained a GPT2 model with keywords that may also suit your needs. My strategy deviates from the article because I opted for GPT-2 as opposed to T5. Here is a notebook: https://colab.research.google.com/drive/16ctmbD03DrFJCwNN45Chy1jYf9Cm9pTp?usp=sharing. Note that it does not work perfectly, so the keyword may not always be included in the output.

Zenith · November 13, 2020, 4:38pm

Thank you and this is very helpful! I think under some circumstances your work is better than the one presented from that article. When solving this type of problem it is not necessary to include every individual input keyword or multiword expression, as long as the model comprehend and make a reasonable output based on tokens with similar meaning. I do have a few more questions if you don’t mind:

Did you use the identical web-nlg-2020 dataset from this article? If not, would you reveal the type/domain of the training dataset you used?
What is the difference between the ForceWords, ForceWords2 and ForceWordArvix models under your profile? I tried all of them and prefer ForceWords for now.
Is there a specific reason why your input from your shared demo is in the form of
“”"[‘At the core’, ‘mismanagement of the Coronavirus’, ‘lies its distrust of science’] At""", with 3 sets of quotations? I do understand that you are including a string of list as input words, with “At” as an “anchor” that starts the sentence. I am also not sure why in the output, the model would provided an additional set of input that is different but somehow related to the original input, along with a new output, even though I already set number of returned sequence to 1.
Is there anything similar to a “seed” that we can set for output? When I ran your model with the identical input, some outputs were extremely outstanding and I wish I could’ve saved it.

Thank you!

zanderbush · November 13, 2020, 5:26pm

Updated Google Colab:

No worries.

If I remember correctly, I scraped Joe Biden’s Twitter account and used NLTK to extract keywords.
I agree with your assessment. ForceWords2 (can’t remember dataset) and ForceWordArvix (trained on arXiv titles and abstracts) were trained on different datasets
I updated the Google Colab to extract only the first sentence. You don’t need the three quotations, but I have a tendency to always do that.
To be honest, I’m not sure.

If I find the time later, I will train another model with a larger dataset using this strategy.

Zenith · November 13, 2020, 11:23pm

Hey thanks again. As a beginner, I am also wondering to what extent can we access to models that are posted on https://huggingface.co/, like one of yours that can be accessed through transformer.AutoModelWithLMHead. e.g. If I really like what your model can accomplish, is it possible to fine-tune/retrain it with another domain-specific dataset? Thank you!

zanderbush · November 13, 2020, 11:30pm

If you replace “GPT-2” with my model, that may work. I’m not completely sure, though. Here is the training code you can use. You’ll have to change a few things like the directories.

HuggingFace established a new way to upload models this week and I haven’t yet checked if it’s compatible with Google Colab. That’s why I didn’t include in this snippet how to upload.

from transformers import (

    AutoModelWithLMHead,

    AutoConfig,

    Trainer,

    AutoTokenizer,

    TextDataset,

    DataCollatorForLanguageModeling,

    TrainingArguments)

def modelTrainer(batch_size=1, conf='gpt2'): #replace gpt2 with my model

    config = AutoConfig.from_pretrained(conf)

    model = AutoModelWithLMHead.from_config(config)

    tokenizer = AutoTokenizer.from_pretrained(conf)

    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

    train_dataset = TextDataset(

        tokenizer=tokenizer,

        #block_size = tokenizer.max_len

        block_size = tokenizer.max_len,

        file_path ="/content/Para.txt",

        cache_dir="/content/",

    )

    training_args =TrainingArguments(

        output_dir="/content/ParaphraseAgain",

        num_train_epochs=3.0,

        per_device_train_batch_size=batch_size,

        warmup_steps=500,

        logging_steps = 100,

        save_steps=500,

        seed =random.randint(0,2**32-1),

    )

    trainer = Trainer(

        model=model,

        args=training_args,

        data_collator=data_collator,

        train_dataset=train_dataset,

        prediction_loss_only=True,

    )

    #trainer.train('/content/ParaphraseAgain/checkpoint-2000')

    trainer.train()

    trainer.save_model()

    modelTrainer()

Zenith · November 14, 2020, 9:32pm

I really appreciate your patience. You have been incredibly helpful!

zanderbush · November 14, 2020, 10:48pm

My pleasure. Feel free to reach out if you have any more questions.

Zenith · November 16, 2020, 7:01am

Is it also possible for me to look at a snippet of your training and test set’s format so I can edit mine as well? Thanks!

zanderbush · November 16, 2020, 10:00pm

[‘longest streak’, ‘job growth’, ‘80 years’, ‘history’, ‘facing’, ‘delivered’] -> Facing the worst financial crisis in 80 years, you delivered the longest streak of job growth in our history.

I only used a training set.

Topic		Replies	Views
Keyword generation using T5 Models	4	1990	November 2, 2022
Prevent repeat tokens in GPT2LMHeadModel text generation with max_new_tokens=1 Beginners	0	1116	November 19, 2021
Text Generation, adding random words, weird linebreaks & symbols at random Beginners	5	982	May 24, 2021
GPT2 finetuning for text generation is getting overfitted Beginners	0	1109	August 27, 2021
Generate sentences from keywords only Beginners	4	3019	November 26, 2021

Multiple-Token Input for Text Generations and PPLM?

Related topics