Feat: Fuzzy match
Using the Levenshtein distance (aka edit distance) to fuzzy-match misspelled départements in source data (e.g. "Haut-de-Seine" or "Réunion" instead of "Hauts-de-Seine" and "La Réunion").
This is a very basic implementation that does the job as far as my use case is concerned. There's room for improvement, including :
- use the distance in proportion to the size of the bigger string being compared to avoid false positive such as "paca" and "aura" which have an edit distance of 3 despite being totally different regions. I am not too concerned about this as we typically handle abbreviations as explicit cases in
insitu/importer/validators.py
. - allow for specifying a per-dataset
max_distance
value in config file (3 seems like a pretty arbitrary choice although it does the job as far as my use-cases are concerned) - use ngrams as in addok https://github.com/addok/addok/blob/master/addok/helpers/text.py#L164-L177
I'm thinking we figure it out as we walk, unless one of you really has an issue with the current implementation ?