Parsing maritime location ranges

Hi folks

I’m attempting to train a model to parse maritime location ranges. These are strings that can be resolved into a geographical area or a list of shipping ports.

An example could be AG NSOBI JUBAIL EXCL I+I
This translates to Arabian Gulf, not south of, but including Jubail exclusive Iran and Iraq

The ranges often include tons of abbreviations, acronyms, spelling mistakes and just different ways of representing the same thing, but essentially it’s a list of locations and operators (INCL, EXCL, NSOBI, etc.)

My goal with this model is to translate the ranges into a known and structured format. So the above would translate to [ARABIAN GULF|LOC] [NSOBI|OPR] [JUBAIL|LOC] [EXCL|OPR] [IRAN|LOC] [IRAQ|LOC]. This I could further process with a deterministic program, by looking up the location and applying the operators.

I’ve created a few hundred training examples and retrained the T5-small model, which initially looked good, but it’s like its struggling to learn any generalisations. If I take the above example (which is a well known range) and just add something simple like “INCL FUJAIRAH” at the end, it’ll fail because it has never seen that sequence before (I assume).

I’m looking for input on how to solve this problem. Other approaches/models I can try out?

I’ll add some more examples to explain the challenge:

  1. USG IF MISS RIVER NNOBIBR

    [US GULF|LOCATION] [IF|CONDITION] [MISSISSIPPI RIVER|LOCATION] [NNOBI|OPERATOR] [BATON ROUGE|LOCATION]

    This is an interesting example because NNOBI (Not north of, but including) and BR (Baton Rouge) was concatenated. This is not a spelling mistake, shipping traders just use the range enough that they know what it means.

  2. EUROMED NEOBIG EXCL Y,FY,AL BUT INCL R+O

    [EUROMED|LOCATION] [NEOBI|OPERATOR] [GREECE|LOCATION] [EXCL|OPERATOR] [YUGOSLAVIA|LOCATION] [FORMER YUGOSLAVIA|LOCATION] [ALBANIA|LOCATION] [INCL|OPERATOR] [RIJEKA|LOCATION] [OMISALJ|LOCATION]

    Lots of stuff here that can only be understood in the context of the full sequence.

Thanks!

1 Like

This code has no practical meaning, but I feel that using two types of models in series like this would be easier to implement.

from transformers import pipeline
input_text = "My name is Sylvain and I work at Hugging Face in Brooklyn."
translator = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es")
classifier = pipeline("token-classification", model="huggingface-course/bert-finetuned-ner", aggregation_strategy="simple")
print(classifier(input_text))
# [{'entity_group': 'PER', 'score': 0.9988506, 'word': 'Sylvain', 'start': 11, 'end': 18}, {'entity_group': 'ORG', 'score': 0.96476245, 'word': 'Hugging Face', 'start': 33, 'end': 45}, {'entity_group': 
# 'LOC', 'score': 0.9986118, 'word': 'Brooklyn', 'start': 49, 'end': 57}]
print(translator(input_text))
# [{'translation_text': 'Me llamo Sylvain y trabajo en Hugging Face en Brooklyn.'}]
print(classifier(translator(input_text)[0]["translation_text"]))
# [{'entity_group': 'MISC', 'score': 0.4424469, 'word': '##yl', 'start': 10, 'end': 12}, {'entity_group': 'LOC', 'score': 0.99720335, 'word': 'Brooklyn', 'start': 46, 'end': 54}]