Skip to content

Anserini from document collection regressions need Pyserini: eliminate dependency #3190

@lintool

Description

@lintool

For these two:

  • wiki-all-6-3-tamber-bm25
  • wikipedia-dpr-100w-bm25

Both in src/main/resources/reproduce/from-document-collection/configs/

The reproductions call out to Pyserini, which errors when we upgraded to Lucene 10, e.g.

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/tuna1/scratch/jimmylin/pyserini/pyserini/eval/convert_trec_run_to_dpr_retrieval_run.py", line 48, in <module>
    searcher = LuceneSearcher(args.index)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tuna1/scratch/jimmylin/pyserini/pyserini/search/lucene/_searcher.py", line 50, in __init__
    self.object = JSimpleSearcher(index_dir)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "jnius/jnius_export_class.pxi", line 285, in jnius.JavaClass.__init__
  File "jnius/jnius_export_class.pxi", line 403, in jnius.JavaClass.call_constructor
  File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception
jnius.JavaException: JVM exception occurred: java.lang.IllegalArgumentException: indexCreatedVersionMajor is in the future: 10
java.lang.IllegalArgumentException: indexCreatedVersionMajor is in the future: 10
	org.apache.lucene.index.SegmentInfos.<init>(SegmentInfos.java:180)
	org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:363)
	org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:304)
	org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:88)
	org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:77)
	org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:820)
	org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109)
	org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:67)
	org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:60)
	io.anserini.search.SimpleSearcher.<init>(SimpleSearcher.java:132)
	io.anserini.search.SimpleSearcher.<init>(SimpleSearcher.java:114)
Exception in thread "main" java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Command failed: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run --index indexes/lucene-index.wiki-all-6-3-tamber/ --topics dpr-nq-test --input runs/run.index.wiki-all-6-3-tamber.dpr-nq-test.bm25.txt --output runs/run.index.wiki-all-6-3-tamber.dpr-nq-test.bm25.txt.json --combine-title-text
	at io.anserini.reproduce.ReproduceFromDocumentCollection.runCommandsInThreadPool(ReproduceFromDocumentCollection.java:753)
	at io.anserini.reproduce.ReproduceFromDocumentCollection.run(ReproduceFromDocumentCollection.java:320)
	at io.anserini.reproduce.ReproduceFromDocumentCollection.main(ReproduceFromDocumentCollection.java:195)
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Command failed: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run --index indexes/lucene-index.wiki-all-6-3-tamber/ --topics dpr-nq-test --input runs/run.index.wiki-all-6-3-tamber.dpr-nq-test.bm25.txt --output runs/run.index.wiki-all-6-3-tamber.dpr-nq-test.bm25.txt.json --combine-title-text
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at io.anserini.reproduce.ReproduceFromDocumentCollection.runCommandsInThreadPool(ReproduceFromDocumentCollection.java:751)
	... 2 more
Caused by: java.lang.RuntimeException: Command failed: python -m pyserini.eval.convert_trec_run_to_dpr_retrieval_run --index indexes/lucene-index.wiki-all-6-3-tamber/ --topics dpr-nq-test --input runs/run.index.wiki-all-6-3-tamber.dpr-nq-test.bm25.txt --output runs/run.index.wiki-all-6-3-tamber.dpr-nq-test.bm25.txt.json --combine-title-text
	at io.anserini.reproduce.ReproduceFromDocumentCollection.runCommand(ReproduceFromDocumentCollection.java:731)
	at io.anserini.reproduce.ReproduceFromDocumentCollection.lambda$runCommandsInThreadPool$0(ReproduceFromDocumentCollection.java:741)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)

We would like to eliminate this dependence.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions