Skip to content

Conversation

@missinglink
Copy link
Member

@missinglink missinglink commented Feb 17, 2021

I tried to do a /v1/autocomplete for a partial postalcode with ?layers=postalcode today and discovered that it doesn't return anything until the final character has been typed.

This DRAFT PR is a simple attempt at maybe fixing that

Would require more unit tests and some reasonable level of perf/recall/precision testing to get merged.

@missinglink
Copy link
Member Author

Screenshot 2021-02-17 at 21 35 51

@missinglink
Copy link
Member Author

Screenshot 2021-02-17 at 21 37 27

@missinglink
Copy link
Member Author

huh actually works pretty well @orangejulius @Joxit thoughts?

Copy link
Member

@Joxit Joxit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I did some acceptance-tests, no regression and this will improve UX on postalcode autocomplete !

orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Feb 17, 2021
This adds autocomplete tests specifically around postalcode behavior.

Most of them were missed and I actually meant to commit them as part of
#544.

Additionally, I've added one more test for an incomplete postalcode, to
measure progress against pelias/pelias#676.

We may be able to improve results with
pelias/api#1514
@orangejulius
Copy link
Member

Oh, very nice. This has been really annoying for a long time. We've had the issue reported in the past going all the way back to 2017 pelias/pelias#676

I just added another acceptance test for this in pelias/acceptance-tests#545 and i'll follow up with a bit more testing.

@orangejulius
Copy link
Member

I can report that autocomplete queries with a focus.point (and no other params) can work really well now:
image

Postalcode queries are still pretty tough, there's a lot of potential for exact housenumber matches, but this looks like a great improvement overall.

More testing is needed, it looks like there might be some address autocomplete queries that got a bit worse. But this is exciting :)

@missinglink
Copy link
Member Author

it looks like there might be some address autocomplete queries that got a bit worse

Could you please post some examples.

@orangejulius
Copy link
Member

Ok, still investigating and making sure the differences I'm seeing aren't from differences in test settings/etc, but it looks like the differences aren't that major.

Here's one that's a little confusing: autocomplete takes a couple more characters to get 30 w 26 correct
Screenshot_2021-02-17_14-24-15

https://pelias.github.io/compare/#/v1/autocomplete?text=30+w+26
Screenshot_2021-02-17_14-30-07

It looks like this query has replaced the second of two 30 w 26th addresses with a bunch of addresses of the format 26XX route 30 west. If its just this one case and there's something weird with the 30 w 26th, new york address record, that's not a problem. But I wonder if it's indicative of other problems.

Very preliminary results so far, stay tuned for more :)

@missinglink
Copy link
Member Author

Hmm yeah that is weird, I'm on mobile ATM but I can't see the layers listed in the debug output.

The PR targets the clean.layers property which I assumed would always be there, and possibly is there under the hood.

@missinglink
Copy link
Member Author

Adding layers=coarse,address,street,venue,locality,neighbourhood,county,localadmin,region,macrocounty,country,macroregion,borough,postalcode (all of them) to the query mirrors the old behavior

@orangejulius
Copy link
Member

Ok, I figured out the difference in query.

It turns out all the records shown on the dev screenshot have the same score.

The baseline query has this in the must clause:
image

It will score records matching the phrase input in order the highest.

The query from this branch ends up looking like this:
image

All the documents shown in the screenshot above include 26 somewhere, and they have a perfect phrase match for 30 w, so that explains the scoring.

On the one hand, I don't like that this PR essentially adds another weird edge case to our tokenizing/parsing code, but on the other hand the underlying behavior is not the fault of this PR. It's just what our autocomplete queries already do.

I guess this could affect any address query ending in numbers, whether its a numeric street like in this case, or perhaps a "European style" street address, with the number written last?

@missinglink
Copy link
Member Author

I'm back at the keyboard again now and added this commit which seems to have done the trick 63c168c

@orangejulius does that fix the issues you were seeing?

Screenshot 2021-02-18 at 16 55 18

Screenshot 2021-02-18 at 16 56 12

@missinglink
Copy link
Member Author

missinglink commented Feb 18, 2021

I guess this could affect any address query ending in numbers, whether its a numeric street like in this case, or perhaps a "European style" street address, with the number written last?

It shouldn't affect any address queries because if the address layer is targeted then the existing behaviour remains unchanged.

The change of behaviour only applies when address is not present in the layers being queried.

@missinglink
Copy link
Member Author

missinglink commented Feb 18, 2021

This query confuses me, since it's explicitly targeting the street layer I would expect it to list all street documents starting with 3 but it doesn't...

https://pelias.github.io/compare/#/v1/autocomplete?boundary.gid=whosonfirst%3Alocality%3A85977539&layers=street&text=west+3&debug=1

It's not actually bad, just not what I was expecting and might indicate a bug or that I've missed something.

@missinglink
Copy link
Member Author

@missinglink missinglink force-pushed the autocomplete_postalcodes branch from 63c168c to 6b085e8 Compare April 9, 2021 01:10
@missinglink
Copy link
Member Author

rebased origin/master

@missinglink missinglink marked this pull request as ready for review April 9, 2021 01:11
@missinglink
Copy link
Member Author

missinglink commented Apr 9, 2021

After rebasing and manually reviewing the cases I mentioned above I don't see any regressions.

@missinglink
Copy link
Member Author

We can merge this once we review the differences for the tests in pelias/acceptance-tests#549 and confirm they are positive and no major regressions.

@missinglink missinglink force-pushed the autocomplete_postalcodes branch from 6b085e8 to 0778fd2 Compare April 28, 2021 01:54
@missinglink
Copy link
Member Author

@missinglink
Copy link
Member Author

missinglink commented Apr 28, 2021

For the vast majority of cases it's a marked improvement 🎉
The recall is much better with this PR but the precision for postcodes starting with a 0 have sorting issues.

The sorting is quite interesting:

  • looking at the first screenshot (the improvement), all 5 results are good but it's not clear why they are ordered the way they are 🤷
  • similarly the scoring for the second screenshot seems to be undefined, the postcodes are scored first and last with three other results in between
  • on the last screenshot it's including a bunch of "Country Road 4106" for the query "04106", I think this is undesirable. Once those are stripped away the sorting still seems to be undefined.

@missinglink
Copy link
Member Author

hmm... I think maybe I'm getting confused since I've been away from work for a while.

The problem with the second two queries is they don't specify layers=postalcode like the first one does, IIRC this PR is only supposed to change behaviour when the address layer isn't being targeted.

When adding layers=postalcode the results look great, so either the dev env and the prod env are different or this code is not honouring that address restriction.

@orangejulius orangejulius changed the title allow prefix matching numerals on non-address queries Enable autocomplete for postalcodes Jun 2, 2021
@orangejulius orangejulius force-pushed the autocomplete_postalcodes branch from 0778fd2 to 2ab2149 Compare June 2, 2021 18:08
@orangejulius
Copy link
Member

I just updated this PR with some new commit messages to provide a better "high level" context of the change, rather than the development-focused explanation that was originally present. The original commits are preserved in an archive branch.

Here's what I wrote for the explanation, hopefully it's clear and helpful as it will appear in the Pelias API release notes:

Enable support for autocompleting postalcode by relaxing constraints around when we allow partial matches on numeric inputs.

Historically, Pelias has not allowed the text parameter input to match a "partial" result when the entire input consists only of numbers. Considering there are often lots of results that could match a numeric input like 1, 10, or even 1014, and most of them are addresses that have little chance of being relevant, this approach isn't without merit.

However postalcodes are another possible match, and people want to search for them reasonably often. They also often consist purely of numbers

So this change allows those numeric partial matches when the address layer won't be queried. Thanks to some of our performance optimizations, this will always be the case for short inputs like text=1234 as long as the layers parameter is unset or doesn't explicitly contain the address layer.

In practice, what this means is that a query like text=9021 will now result in the 90210 US postalcode first, whereas previously it would return a bunch of irrelevant venue results scattered around the world.

@orangejulius orangejulius force-pushed the autocomplete_postalcodes branch from 2ab2149 to 32b6c39 Compare June 2, 2021 18:12
Enable support for autocompleting postalcode by relaxing constraints
around when we allow partial matches on numeric inputs.

Historically, Pelias has not allowed the `text` parameter input to match
a "partial" result when the entire input consists only of numbers.
Considering there are often _lots_ of results that could match a numeric
input like `1`, `10`, or even `1014`, and most of them are addresses
that have little chance of being relevant, this approach isn't without
merit.

However postalcodes are another possible match, and people want to
search for them reasonably often. They also often consist purely of
numbers.

So this change allows those numeric partial matches when the `address`
layer _won't_ be queried. Thanks to some of our performance
optimizations, this will always be the case for short inputs like
`text=1234` as long as the `layers` parameter is unset or doesn't
explicitly contain the address layer.

In practice, what this means is that a query like `text=9021` will now
result in the `90210` US postalcode first, whereas previously it would
return a bunch of irrelevant venue results scattered around the world.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants