Skip to content

Add XMLParser( recover=True ) option to pyoai/client; add --recover option to harvest.#31

Open
sdm7g wants to merge 15 commits into
bloomonkey:developfrom
sdm7g:fix-pyoai
Open

Add XMLParser( recover=True ) option to pyoai/client; add --recover option to harvest.#31
sdm7g wants to merge 15 commits into
bloomonkey:developfrom
sdm7g:fix-pyoai

Conversation

@sdm7g
Copy link
Copy Markdown

@sdm7g sdm7g commented Apr 19, 2019

This adds XMLParser( recover=True ) option to pyoai/client and includes that module locally. #25
( I've had an outstanding pull request to fix code in that project. )
Adds an optional --recover / --no-recover option to command line args.
( default is --no-recover, to be conservative and keep the same behavior )
Adds code to log errors as warnings when using recover option, and to log which identifiers had parser errors, so that upstream feeds can be notified.
And fixed some issues with logger config that cropped up when testing this code:
harvest.py logging now goes to harvest.log, registry.py goes to registry.log #26 ;
and added some missing whitespace in help strings.

Copy link
Copy Markdown
Owner

@bloomonkey bloomonkey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi

First, thanks for taking the time and trouble to code this up and submit a PR. This looks like a very useful addition.

That said, I'm uneasy about adding this entire module into oaiharvest as this brings with it the burden of maintaining this code (which I didn't write and don't understand). I'd much prefer it if pyoai could approve your pull request and mint a new release... (I'll go and +1 your PR there shortly)

There a few other changes I've suggested to remove code you're replacing rather than commenting it out. The git repo will keep history for us, so there's no need for it to be in the latest version.

Thanks again, and good luck getting the pyoai change merged... 🤞

John

Comment thread oaiharvest/harvest.py Outdated
if isinstance(metadata, str) and metadata.startswith("b'"):
metadata = ast.literal_eval(metadata).decode("utf-8")
yield (header, metadata, about)
if client.XMLParser.error_log and len(client.XMLParser.error_log) > 0 :
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove space before : (PEP 8)

Comment thread oaiharvest/harvest.py Outdated

ch = logging.StreamHandler()
format='%(levelname)-8s %(message)s',
# format='%(asctime)s %(name)-16s %(levelname)-8s %(message)s',
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove old code instead of commenting it out

Comment thread oaiharvest/registry.py Outdated
datefmt='[%Y-%m-%d %H:%M:%S]',
filename=os.path.join(appdir, 'registry.log')
format='%(levelname)-8s %(message)s'
# format='%(asctime)s %(name)-16s %(levelname)-8s %(message)s',
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove old code instead of commenting it out

Comment thread oaiharvest/registry.py Outdated
)
logger = logging.getLogger(__name__)
ch = logging.StreamHandler()
# ch = logging.StreamHandler()
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove old code instead of commenting it out

Comment thread oaiharvest/registry.py Outdated
ch = logging.FileHandler( os.path.join( appdir, 'registry.log'))
ch.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(levelname)-8s %(message)s')
# formatter = logging.Formatter('%(levelname)-8s %(message)s')
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove old code instead of commenting it out

Comment thread oaiharvest/harvest.py Outdated
formatter = logging.Formatter(
'%(asctime)s %(name)-16s %(levelname)-8s %(message)s',
'[%Y-%m-%d %H:%M:%S]')
#formatter = logging.Formatter('%(levelname)-8s %(message)s')
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove old code instead of commenting it out

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK: commented code removed. ( I had left them in originally, as I was unsure of the intent of the original code, and wanted to make clear what exactly what behavior I was changing. )
I also added the --recover option to the doc-string at the top, which I missed earlier.

@sdm7g
Copy link
Copy Markdown
Author

sdm7g commented Apr 30, 2019

On further use and testing, I have found a couple of cases where it does recover from parsing Mal-formed payload, but where the issue is unmatching tags, the errors cascade outward to the OAI container and cause the resumption token to not be found and parsed correctly, thus the harvest halts. ( At least that my initial diagnosis. ) It still gives plenty of parser warning messages before halting, however, I'ld like to figure out a better way to unambiguously identify this case.
I still consider this behavior better than before, but perhaps it should be documented that --recover could have this side effect.

Steve Majewski added 5 commits May 2, 2019 15:17
sometimes mismatched tag problems can propagate outwards to container and resumptionToken not found in exact path.
…ntities

use unicode, not utf-8, so that it works under python 2.x as well as 3
…ATE_VERIFY_FAILED

I don't know what needs to be fixed on the server ends (and I have no control over that end)
but I found this fix googling the problem (several fixes suggested)
https://www.howtouselinux.com/post/ssl-certificate_verify_failed-in-python
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants