Hi, I’m using a pretrained model on my data. I’m iterating over a bunch of XML files and storing the output in a list, and I want the whole output, so that list, to be a JSON file. I need the JSON because I’d need to manipulate it later on. Thing is, the output of the model is in single quotes and if I replace those with double quotes to reuse the file later on, it becomes a problem for words like “isn’t”, “wouldn’t”, etc., and I cannot replace those manually because it’s incredibly time-consuming. I’m not sure what is the best way to go from here. What do you suggest?
My code is below:
label_list= ['literal',"metaphoric"]
label_dict_relations={ i : l for i, l in enumerate(label_list) }
tokenizer = AutoTokenizer.from_pretrained("lwachowiak/Metaphor-Detection-XLMR")
model = AutoModelForTokenClassification.from_pretrained("lwachowiak/Metaphor-Detection-XLMR", id2label=label_dict_relations)
words_input_dir = "MYDIRECTORY"
import os
os.chdir(words_input_dir)
resu = {}
sents = []
import json
for filename in os.listdir(words_input_dir):
if filename.endswith(".xml"):
tree = ET.parse(filename)
root = tree.getroot()
node = root.findall("./body/sec/p")
for x in node:
if x is not None:
coco = x.text
nerpipeline = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
data = nerpipeline(str(coco))
json_string = json.dumps(str(sents))
with open(r'MYPATH/RESULTS.json', "w") as outfile:
outfile.write(json_string)
TL&DR: how can I convert the output of this pipeline to a (correct) JSON?