PDF to text parsing

As a final fallback, we try to use the full text of the PDF to get DOI information; I see occasional errors with this, with my ~8000 file library.

We might want to try to use something more sophisticated here -- I'm not familiar with this space, but https://allenai.org/blog/olmocr-2 has come up.