Skip to content

Datascience question #256

@thiswillbeyourgithub

Description

@thiswillbeyourgithub

Some interesting datascience questions raised during email exchanged with the maintainer:

-what is the distribution of how often a TinyIndex page gets updated? Meaning the crawler uploaded a better data point. There might be pages that get updated way more often than others, and are in fact never
stable. Tokens that I'm thinking of are for example "fun" or "news". I don't remember the number of urls that a single page can hold but it would be interesting to know on average after how much time we estimate a page would get fully replaced (ship of theseus style) (even though obviously it's not a uniform law among the docs of a page but would still be interesting to estimate the "stability" of a page)

  • what is the distribution of TinyIndex pages that are full vs not full.

  • what is the relationship between a page's size and a page's "stability". This might indicate that some page are full but instable so we are missing content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions