Iām using code 99% provided by huggingface, which is the main source of confusion. I am attempting summarization of medical scientific documents. I am on transformers version 4.2.0
My code comes from 3 locations, and for the most part, is unmodified.
https://huggingface.co/transformers/model_doc/bart.html#bartforconditionalgeneration
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large')
inputs = tokenizer([text], max_length=1024, return_tensors='pt')
# Generate Summary
summary_ids = model.generate(inputs['input_ids'], num_beams=4, max_length=150, min_length = 40, early_stopping=True)
print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])
https://huggingface.co/transformers/task_summary.html
#model = AutoModelWithLMHead.from_pretrained("")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large")
# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer.encode(abstract, return_tensors="pt", max_length=512)
outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
print(tokenizer.decode(outputs[0]))
The main significance here is that I changed the LMHead to Seq2SeqLM, as recommended by the warning when I run it.
#Pipelines ā transformers 4.3.0 documentation
This is the current third method Iām using to run the code.
summarizer = pipeline("summarization", model="facebook/bart-large", tokenizer="facebook/bart-large", framework="pt")
summary = summarizer(text, min_length=40, max_length=150, length_penalty = 2.0, num_beams = 4, early_stopping = True)
print(summary)
Iāll summarize some results below.
1 is pipeline, 2 is AutoModel, 3 is BartForConditional
When using facebook/bart-large, 1 & 3 ā¦ will give the same results. However, the second one, AutoModel, gives different results, despite documentation indicating (to me) that AutoModelForSeq2SeqLM should, in this case, be identical to the BArtForConditionalGeneration.
Results get stranger when using facebook/bart-large-xsum, which gives the same results for 2/3ā¦ While 1 actually comes back with a result thatās nowhere to be found in the original text.
Using facebook/bart-large-cnn, all 3 results are the same.
I havenāt tested more than this. I donāt know if this is just major user error, or something for GitHub.
Please let me know.
Input text is from a medical abstract, located below.
text = "Oesophageal squamous cell carcinoma (ESCC) is an aggressive malignancy and a leading cause of cancer-related death worldwide. Lack of effective early diagnosis strategies and ensuing complications from tumour metastasis account for the majority of ESCC death. Thus, identification of key molecular targets involved in ESCC carcinogenesis and progression is crucial for ESCC prognosis. In this study, four pairs of ESCC tissues were used for mRNA sequencing to determine differentially expressed genes (DEGs). 347 genes were found to be upregulated whereas 255 genes downregulated. By screening DEGs plus bioinformatics analyses such as KEGG, PPI and IPA, we found that there were independent interactions between KRT family members. KRT17 upregulation was confirmed in ESCC and its relationship with clinicopathological features were analysed. KRT17 was significantly associated with ESCC histological grade, lymph node and distant metastasis, TNM stage and five-year survival rate. Upregulation of KRT17 promoted ESCC cell growth, migration, and lung metastasis. Mechanistically, we found that KRT17-promoted ESCC cell growth and migration was accompanied by activation of AKT signalling and induction of EMT. These findings suggested that KRT17 is significantly related to malignant progression and poor prognosis of ESCC patients, and it may serve as a new biological target for ESCC therapy. SIGNIFICANCE: Oesophageal cancer is one of the leading causes of cancer mortality worldwide and oesophageal squamous cell carcinoma (ESCC) is the major histological type of oesophageal cancer in Eastern Asia. However, the molecular basis for the development and progression of ESCC remains largely unknown. In this study, RNA sequencing was used to establish the whole-transcriptome profile in ESCC tissues versus the adjacent non-cancer tissues and the results were bioinformatically analysed to predict the roles of the identified differentially expressed genes. We found that upregulation of KRT17 was significantly associated with advanced clinical stage, lymph node and distant metastasis, TNM stage and poor clinical outcome. Keratin 17 (KRT17) upregulation in ESCC cells not only promoted cell proliferation but also increased invasion and metastasis accompanied with AKT activation and epithelial-mesenchymal transition (EMT). These data suggested that KRT17 played an important role in ESCC development and progression and may serve as a prognostic biomarker and therapeutic target in ESCC. "
EDIT1: I originally forgot ālength_penalty = 2.0ā in the BartConditinal/3. However, this had no effect on anything.