Add XMLParser( recover=True ) option to pyoai/client; add --recover option to harvest.#31
Add XMLParser( recover=True ) option to pyoai/client; add --recover option to harvest.#31sdm7g wants to merge 15 commits into
Conversation
bloomonkey
left a comment
There was a problem hiding this comment.
Hi
First, thanks for taking the time and trouble to code this up and submit a PR. This looks like a very useful addition.
That said, I'm uneasy about adding this entire module into oaiharvest as this brings with it the burden of maintaining this code (which I didn't write and don't understand). I'd much prefer it if pyoai could approve your pull request and mint a new release... (I'll go and +1 your PR there shortly)
There a few other changes I've suggested to remove code you're replacing rather than commenting it out. The git repo will keep history for us, so there's no need for it to be in the latest version.
Thanks again, and good luck getting the pyoai change merged... 🤞
John
| if isinstance(metadata, str) and metadata.startswith("b'"): | ||
| metadata = ast.literal_eval(metadata).decode("utf-8") | ||
| yield (header, metadata, about) | ||
| if client.XMLParser.error_log and len(client.XMLParser.error_log) > 0 : |
There was a problem hiding this comment.
Please remove space before : (PEP 8)
|
|
||
| ch = logging.StreamHandler() | ||
| format='%(levelname)-8s %(message)s', | ||
| # format='%(asctime)s %(name)-16s %(levelname)-8s %(message)s', |
There was a problem hiding this comment.
Please remove old code instead of commenting it out
| datefmt='[%Y-%m-%d %H:%M:%S]', | ||
| filename=os.path.join(appdir, 'registry.log') | ||
| format='%(levelname)-8s %(message)s' | ||
| # format='%(asctime)s %(name)-16s %(levelname)-8s %(message)s', |
There was a problem hiding this comment.
Please remove old code instead of commenting it out
| ) | ||
| logger = logging.getLogger(__name__) | ||
| ch = logging.StreamHandler() | ||
| # ch = logging.StreamHandler() |
There was a problem hiding this comment.
Please remove old code instead of commenting it out
| ch = logging.FileHandler( os.path.join( appdir, 'registry.log')) | ||
| ch.setLevel(logging.DEBUG) | ||
| formatter = logging.Formatter('%(levelname)-8s %(message)s') | ||
| # formatter = logging.Formatter('%(levelname)-8s %(message)s') |
There was a problem hiding this comment.
Please remove old code instead of commenting it out
| formatter = logging.Formatter( | ||
| '%(asctime)s %(name)-16s %(levelname)-8s %(message)s', | ||
| '[%Y-%m-%d %H:%M:%S]') | ||
| #formatter = logging.Formatter('%(levelname)-8s %(message)s') |
There was a problem hiding this comment.
Please remove old code instead of commenting it out
There was a problem hiding this comment.
OK: commented code removed. ( I had left them in originally, as I was unsure of the intent of the original code, and wanted to make clear what exactly what behavior I was changing. )
I also added the --recover option to the doc-string at the top, which I missed earlier.
|
On further use and testing, I have found a couple of cases where it does recover from parsing Mal-formed payload, but where the issue is unmatching tags, the errors cascade outward to the OAI container and cause the resumption token to not be found and parsed correctly, thus the harvest halts. ( At least that my initial diagnosis. ) It still gives plenty of parser warning messages before halting, however, I'ld like to figure out a better way to unambiguously identify this case. |
sometimes mismatched tag problems can propagate outwards to container and resumptionToken not found in exact path.
…ntities use unicode, not utf-8, so that it works under python 2.x as well as 3
…ATE_VERIFY_FAILED I don't know what needs to be fixed on the server ends (and I have no control over that end) but I found this fix googling the problem (several fixes suggested) https://www.howtouselinux.com/post/ssl-certificate_verify_failed-in-python
This adds XMLParser( recover=True ) option to pyoai/client and includes that module locally. #25
( I've had an outstanding pull request to fix code in that project. )
Adds an optional --recover / --no-recover option to command line args.
( default is --no-recover, to be conservative and keep the same behavior )
Adds code to log errors as warnings when using recover option, and to log which identifiers had parser errors, so that upstream feeds can be notified.
And fixed some issues with logger config that cropped up when testing this code:
harvest.py logging now goes to harvest.log, registry.py goes to registry.log #26 ;
and added some missing whitespace in help strings.