Skip to content

Bug: Readability failure aborts archiving process with exception #847

@herrbischoff

Description

@herrbischoff

Describe the bug

Attempting to archive https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702 results in the process aborting entirely, throwing an exception instead of continuing with an error. This hints at some error checking not done thoroughly enough.

Steps to reproduce

  1. Ran ArchiveBox with the following config:
[SERVER_CONFIG]
SECRET_KEY = [REDACTED]

[ARCHIVE_METHOD_OPTIONS]
RESOLUTION = 1440,4320
YOUTUBEDL_BINARY = /usr/local/bin/yt-dlp

[GENERAL_CONFIG]
TIMEOUT = 1200

and the command

archivebox add https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702
  1. Relevant output:
[√] [2021-09-14 00:54:24] "Dead white man's clothes: How fast fashion is turning parts of Ghana into toxic landfill - ABC News"
    https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702
    √ ./archive/1631453820.320194
      > readability
    ! Failed to archive link: Exception: Exception in archive_methods.save_readability(Link(url=https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702))

Traceback (most recent call last):
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 114, in archive_link
    log_archive_method_finished(result)
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/logging_util.py", line 435, in log_archive_method_finished
    hints = hints if isinstance(hints, (list, tuple)) else hints.split('\n')
TypeError: a bytes-like object is required, not 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/archivebox/.local/bin/archivebox", line 8, in <module>
    sys.exit(main())
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/cli/archivebox_update.py", line 119, in main
    update(
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/main.py", line 783, in update
    archive_links(to_archive, overwrite=overwrite, **archive_kwargs)
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 181, in archive_links
    archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 130, in archive_link
    raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format(
Exception: Exception in archive_methods.save_readability(Link(url=https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702))

ArchiveBox version

ArchiveBox v0.6.2
Cpython FreeBSD FreeBSD-13.0-RELEASE-p4-amd64-64bit-ELF amd64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/home/archivebox/.local/bin/archivebox
 √  PYTHON_BINARY         v3.8.10         valid     /usr/local/bin/python3.8
 √  DJANGO_BINARY         v3.1.13         valid     /usr/home/archivebox/.local/lib/python3.8/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.78.0         valid     /usr/local/bin/curl
 √  WGET_BINARY           v1.21           valid     /usr/local/bin/wget
 √  NODE_BINARY           v14.17.0        valid     /usr/local/bin/node
 √  SINGLEFILE_BINARY     v0.3.29         valid     ./node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.3          valid     ./node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.32.0         valid     /usr/local/bin/git
 √  YOUTUBEDL_BINARY      v2021.06.09     valid     /usr/local/bin/yt-dlp
 √  CHROME_BINARY         v92.0.4515.159  valid     /usr/local/bin/chrome
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/local/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /usr/home/archivebox/.local/lib/python3.8/site-packages/archivebox
 √  TEMPLATES_DIR         3 files         valid     /usr/home/archivebox/.local/lib/python3.8/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            9 files         valid     /var/db/archivebox
 √  SOURCES_DIR           48 files        valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           1474 files      valid     ./archive
 √  CONFIG_FILE           861.0 Bytes     valid     ./ArchiveBox.conf
 √  SQL_INDEX             13.3 MB         valid     ./index.sqlite3

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions