Skip to content

Conversation

@meedstrom
Copy link
Collaborator

@meedstrom meedstrom commented Nov 3, 2025

Motivation for this change

This is an alternative implementation of: #2546

It's probably better.

Some differences:

  • Evalling (org-roam-list-files) now takes me 0.01-0.05s!
  • Evalling (org-roam-db-sync) now takes me 0.10s (when there is nothing to update)
    • Thus it becomes realistic to just throw this in as a preamble to any function or command that may need to ensure a fresh DB.
  • Does not use org-roam-file-p directly to filter the output of org-roam-list-files, because it is not necessary -- several facts, such as being a descendant of org-roam-directory, can be safely assumed! The org-roam-file-p predicate is more useful for checking any wild file name in other programming contexts.
    • As a consequence, there is no use of file-truename nor file-relative-name, so file-name-handler-alist would not affect the runtime much (the earlier PR discussed overriding that setting for perf gains).
  • Splits org-roam-file-exclude-regexp into 4 separate options, which are both more friendly to modification by users, and easier to optimize with.
  • Controversial: the performance overhead in org-roam-db-sync from comparing content hashes is eliminated, because it now just checks filesystem mtime. I do not know if that is sane (feedback please), but I can't see why not.

Content hashes

Note that the content hash is still written into the DB, because I haven't changed org-roam-db-update-file.

Likewise, it still won't update if it sees that the content hash has not changed.

So the difference is only at org-roam-db-sync time, where if hypothetically there exists some file whose contents have changed, but its mtime was held unchanged, then it would not update the DB for that file.

I imagine some rsync command that deliberately doesn't update mtime, but I think that sort of thing has to be the responsibility of the user...

Data points:

@meedstrom meedstrom force-pushed the faster-list-files2 branch 7 times, most recently from 3238434 to 1083827 Compare November 3, 2025 12:18
@akashpal-21
Copy link
Contributor

akashpal-21 commented Nov 3, 2025

Haven't checked the code, but regarding the hashing - I have had eliminated this much before from my system

;; (defun org-roam-db--file-hash (file-path)
;;   "Compute the hash of FILE-PATH."
;;   (with-temp-buffer
;;     (set-buffer-multibyte nil)
;;     (insert-file-contents-literally file-path)
;;     (secure-hash 'sha1 (current-buffer))))

(defun org-roam-db--file-hash (file-path)
  "Compute the hash of FILE-PATH concatenated with its modification time."
  (secure-hash 'sha1 (format "%s %s" file-path (nth 5 (file-attributes file-path)))))

This hasn't caused me any issues whatsoever.
Don't see any reason to check file content hash when there could be more cheap check for modification

Unless something constantly corrupts these attributes for some users!

@dmgerman
Copy link

dmgerman commented Nov 3, 2025

It looks great. One suggestion:

instead of using dolist and cl-loop use mapcar. This turns the code from mutable to immutable and makes it easier to read (in my opinion) especially if you extract the mapcar larger lambdas (like in org-roam--list-recursive-subdirs)

mapcan does mutate the original list to the result list. I think that is desirable in this case. Otherwise it should be replaced with mapcar. the advantage of mapcan is that it does not create a new list. See the documentation for each. The code below uses mapcan

For example:

 (defun org-roam--process-subdir-entry (name dir deny-re follow-symlinks)
   "Process a single directory entry NAME in DIR.
 Returns a list of the directory and its subdirectories if it matches the criteria.
 DENY-RE is the exclusion regex pattern.
 FOLLOW-SYMLINKS determines whether to follow symlinks."
   (when (and (directory-name-p name)
              (not (member name '("./" "../"))))
     (let ((full-name (concat dir name)))
       (when (and (not (string-match-p deny-re full-name))
                  (or follow-symlinks (not (file-symlink-p full-name))))
         (cons full-name
               (org-roam--list-recursive-subdirs full-name follow-symlinks))))))
 
 (defun org-roam--list-recursive-subdirs (dir &optional follow-symlinks)
   "Return a list of DIR, its subdirs, sub-subdirs and so on.
 FOLLOW-SYMLINKS means to include dirs that are symlinks and search them
 for subdirs too, but can result in an infinite loop.
 
 Affected by user options:
 - `org-roam-dir-exclude-literals'
 - `org-roam-dir-exclude-regexps'"
   (let ((deny-re (org-roam--dir-deny-re)))
     (mapcan
      (lambda (name)
        (org-roam--process-subdir-entry name dir deny-re follow-symlinks))
      (file-name-all-completions "" dir))))

@meedstrom
Copy link
Collaborator Author

meedstrom commented Nov 4, 2025

Thank you for your kind words. I don't think I agree on easier to read, and you'll have to explain what you mean by turning it from mutable to immutable. AFAIK these are identical in effect:

(cl-loop for x in '(1 2 3 4)
         collect (+ x 10))
(let (result)
  (dolist (x '(1 2 3 4))
    (push (+ x 10) result))
  (nreverse result))
(mapcar (lambda (x)
          (+ x 10))
        '(1 2 3 4))

I generally prefer cl-loop as my go-to tool, since it also has destructuring (the (file . attr) here):

;; return directories in current dir
(cl-loop for (file . attr) in (directory-files-and-attributes ".")
         if (eq t (file-attribute-type attr))
         collect file)

and many other things, but destructuring in particular is really annoying to lack when using mapcar and friends. Anyway, just a tangent :)

@meedstrom
Copy link
Collaborator Author

meedstrom commented Nov 4, 2025

Btw, the design of org-roam--list-recursive-subdirs is essentially copy-pasted from the built-in directory-files-recursively and simplified. That one uses dolist.

Not that the built-ins are well-known for ease of reading... but that's why it's how it is.

@akashpal-21
Copy link
Contributor

Unrelated but I too prefer cl-loop. I don't get why some library creators insist on excluding cl - for example the first thing that is written in dash library readme is that it excludes cl and so on.
Cl-loop is very simple and nonverbose.

I think a reader will find a cl-loop much easy to reason about than something built from first principles.

@dmgerman
Copy link

dmgerman commented Nov 4, 2025

you'll have to explain what you mean by turning it from mutable to immutable. AFAIK these are identical in effect:

I am of the opinion that mutating variables increases the probability of bugs (as one has to keep track of how the variable is mutated).

That is why I think it is not very good idea to define bindings equal to nil and then assign them. That means that until they are assigned, they cannot be used, which is what I mean as increasing the mental load on whoever is reading the code.

That is why I prefer do-loop and mapcar. But do-loop is a macro, so we might need to see if the macro creates garbage. mapcar is low level and very fast.

In this case, I suspect that mapcan will be slightly faster.

But I am happy with cl-loop. Much better than using do-list. cl-loop is basically for python people ;) as it mimics a comprehension.

@dmgerman
Copy link

dmgerman commented Nov 7, 2025

This might be relevant to speeding up org-roam

https://mahmoodsh.com/efficiently_parsing_org_files.html

@meedstrom
Copy link
Collaborator Author

This might be relevant to speeding up org-roam

https://mahmoodsh.com/efficiently_parsing_org_files.html

As yantar92 pointed out on the Reddit discussion, that article does one thing wrong: (let ((major-mode 'org-mode))). Alas, you should actually call (org-mode) to guarantee it's sane.

But yes, some of the other tricks are good. I said last month (#2547 (comment)) that it'd make sense to rework org-roam-db-sync to not use org-roam-with-file, but rather org-roam-with-temp-buffer.

@meedstrom meedstrom force-pushed the faster-list-files2 branch 4 times, most recently from 8a2fe7e to e9677d4 Compare November 10, 2025 08:34
@meedstrom
Copy link
Collaborator Author

Finally, the checks pass. The tests did good.

@meedstrom meedstrom force-pushed the faster-list-files2 branch 5 times, most recently from 3753313 to 4f42e97 Compare November 10, 2025 10:47
@meedstrom
Copy link
Collaborator Author

meedstrom commented Nov 10, 2025

Added debatable fifth commit that sets org-roam-list-files-commands to nil by default.

Since it is my impression these alternative code paths will be unnecessary -- they only ever existed because the elisp version was thought to be slow, right?

The bottleneck all along seems to have been org-roam-file-p, which the other code paths still apply -- and the elisp path no longer does, hence being faster.

@meedstrom meedstrom closed this Nov 12, 2025
@meedstrom meedstrom reopened this Nov 12, 2025
@meedstrom meedstrom force-pushed the faster-list-files2 branch 5 times, most recently from 0654094 to 715a2f3 Compare November 12, 2025 16:30
@meedstrom
Copy link
Collaborator Author

Tests pass.

@meedstrom meedstrom force-pushed the faster-list-files2 branch 5 times, most recently from ca7f88e to bed80a6 Compare November 13, 2025 08:32
@meedstrom meedstrom force-pushed the faster-list-files2 branch 3 times, most recently from d56fd8b to 86067a8 Compare November 16, 2025 15:35
@meedstrom
Copy link
Collaborator Author

Two things I can do after this is merged:

  • I can make a quick PR for simplifying org-roam-file-p, similar to cabb0ee.

  • I can start work on some optional async code-flow like https://github.com/meedstrom/org-roam-async.

    • It could remain an extension, but either way, I'd want to figure out how that would best work, before I want to touch org-roam-node-read--completions and all its hairy callees.

      • Especially because it seems like there should be some kind of after-sync-hook which just runs on after-save-hook by default -- along with DB sync -- but can be deferred for later if async is enabled. Such a hook would be useful for caching completions as soon as the DB is updated.

        I don't know if caching will be necessary, but anyway, first this PR and then some thinking about async and then we'll get to org-roam-node-read--completions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants