-
Notifications
You must be signed in to change notification settings - Fork 109
optimize memory consumption of idnadata, take 3 #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Let me know if I can clarify, document, or change anything :-) |
|
Thanks! I am at the ICANN meeting in Marrakech this week so I won't have an opportunity to work on this until next week. I hope then I can push out a new release with this included. |
Optimizations to reduce memory footprint
|
Thanks very much! Just wanted to confirm: the outcome of the discussion was to move to Unicode version 8.0? |
|
That issue is not resolved, but possible moot with respect to these particular patches, as the classification of code points into scripts is theoretically purely additive, and this won't add any additional PVALID characters which are still derived from 6.3. I'll confirm this assumption before pushing a new version to PyPI. |
Optimizations to reduce memory footprint
|
v2.1 has been pushed that contains this. I made some minor changes that relate to how the data is encapsulated so it is easier to diff and debug, but the functionality is the same. Thanks again! |
|
Thanks! The wheel uploaded here is for Python 2 only --- should it be for both 2 and 3? https://pypi.python.org/pypi/idna/2.1 |
|
There should be a universal one there now. I am new at using wheels.... |
|
Thanks, that worked. You might want to take down the 2-only wheel? Not critical, though. |
I'm sorry I keep changing my mind here. I think I finally have an approach with no significant drawbacks:
DISALLOWEDcodepoints, as well as any scripts that aren't special-cased by RFC 5892.PVALIDcodepoint lists have long runs in them --- but instead of expanding those runs into a flat hash-set at runtime, leave them as a sequence of(start, end)tuples. These tuples can be binary-searched for integer membership with thebisectmodule. The binary search is faster than the one in optimize memory consumption of idnadata, take 2 #23 because it costs O(log(# runs)), not O(log(# codepoints)).Memory consumption is down under 2 MB, the naive benchmark is not significantly affected (the increase is less than 10%, with overall time still dominated by punycode), and there's no need to package a large opaque binary datafile.
Caveats about the Unicode version still apply (again, the idnadata rebuild is an isolated commit and can be rebased out).