Using geographical distance matrix in transformers

Hello everyone!

I wanted to ask for a reference/guide/teaching material that would show how to use geographical distance matrix in a neural network. I have roughly the following idea in mind: I have n categories each with multiple data points. For each n, distance with other categories will be computed, so I will have n×n matrix with diagonal entries = 0 (distance to itself) and non-diagonal entries being whatever distance category n[i] has to some other category n[j]. The idea of using this matrix is making model attend more (or introduce autocorrelation) to categories that are closer in terms of distance to one another, i.e., if two observations n[i] and n[j] have low distance between them, then model will be biased to make those two have more similar parameters.

I tried to make example a bit vague on purpose, as the application is going to be very field-specific (computational historical linguistics) and I expect people would get more confused if I actually say the problem.

Maybe there is an easier way of doing something with geographical information than just using this geographical distance. I would be happy to hear your takes on it. Note that distance to some independent arbitrary point would be meaningless in case of my problem, as I have read some people suggesting to compute the distance from data point to, e.g., closest hospital/school for problems predicting housing price.

Thanks a lot in advance!