buc.ci is a Fediverse instance that uses the ActivityPub protocol. In other words, users at this host can communicate with people that use software like Mastodon, Pleroma, Friendica, etc. all around the world.

This server runs the snac software and there is no automatic sign-up process.

Admin email
abucci@bucci.onl
Admin account
@abucci@buc.ci

Search results for tag #ml

[?]Metin Seven 🎨 » 🌐
@metin@graphics.social

I've recently summed up my thoughts on generative "AI" on my homepage. Here's a screenshot of that section.

My thoughts on generative "AI"

I'm glad generative artificial "intelligence" was not a thing yet during the vast majority of my career. A number of realizations arose while exploring generative Large Language Models…

Generative AI is based on massive theft from creatives, without consent, credit or compensation. Using gen-AI is asking a chatbot to spit out the combined efforts of ripped-off creatives. It is industrializing and devaluing human expression, artistry and craftsmanship. Creatives are losing their jobs and motivation because tech corporations unscrupulously absorb and exploit their work. If you appreciate art, support the artists, not the thieves of their labor.

Tech corporations are building more and more huge data centers for AI processing, consuming lots of internet bandwidth, energy, water and more, increasing scarcity, prices and emissions, degrading the already fragile environment.

Unless you're using a fully local AI configuration, every bit of data you submit contributes to the power and reach of corporations and governments, decreasing your privacy and security.

Generative AI enables deepfakes that are widely used for abuse, deception, cybercrime, misinformation and propaganda, polluting justice, science advancement and news report credibility.

More text doesn't fit in this Alt text, but everything can be read over at https://metinseven.nl

Alt...My thoughts on generative "AI" I'm glad generative artificial "intelligence" was not a thing yet during the vast majority of my career. A number of realizations arose while exploring generative Large Language Models… Generative AI is based on massive theft from creatives, without consent, credit or compensation. Using gen-AI is asking a chatbot to spit out the combined efforts of ripped-off creatives. It is industrializing and devaluing human expression, artistry and craftsmanship. Creatives are losing their jobs and motivation because tech corporations unscrupulously absorb and exploit their work. If you appreciate art, support the artists, not the thieves of their labor. Tech corporations are building more and more huge data centers for AI processing, consuming lots of internet bandwidth, energy, water and more, increasing scarcity, prices and emissions, degrading the already fragile environment. Unless you're using a fully local AI configuration, every bit of data you submit contributes to the power and reach of corporations and governments, decreasing your privacy and security. Generative AI enables deepfakes that are widely used for abuse, deception, cybercrime, misinformation and propaganda, polluting justice, science advancement and news report credibility. More text doesn't fit in this Alt text, but everything can be read over at https://metinseven.nl

    AodeRelay boosted

    [?]Gary McGraw » 🌐
    @cigitalgem@sigmoid.social

    Silver Bullet Security Podcast episode 155 features Giovanni Vigna talking about and hacking. Timely.

    Please RT for reach.

    berryvilleiml.com/2026/04/01/s

      AodeRelay boosted

      [?]Metin Seven 🎨 » 🌐
      @metin@graphics.social

      AodeRelay boosted

      [?]Metin Seven 🎨 » 🌐
      @metin@graphics.social

      [?]Metin Seven 🎨 » 🌐
      @metin@graphics.social

      NVIDIA DLSS 5 be like…

      Two similar Mario game character heads placed next to each other. The left one is an actual 3D game head, the right one is a creepy realistic interpretation of the left head.

      Alt...Two similar Mario game character heads placed next to each other. The left one is an actual 3D game head, the right one is a creepy realistic interpretation of the left head.

        AodeRelay boosted

        [?]Metin Seven 🎨 » 🌐
        @metin@graphics.social

        😆

        Comparison between 3D game characters with and without DLSS 5 AI processing. The version with DLSS processing has turned a grey-haired man into a long-haired woman.

        Alt...Comparison between 3D game characters with and without DLSS 5 AI processing. The version with DLSS processing has turned a grey-haired man into a long-haired woman.

          AodeRelay boosted

          [?]Metin Seven 🎨 » 🌐
          @metin@graphics.social

          AodeRelay boosted

          [?]Gary McGraw » 🌐
          @cigitalgem@sigmoid.social

          What is "beigification" in AI, and is it good or bad?

          berryvilleiml.com/2026/03/12/o

            [?]Nate Gaylinn » 🌐
            @ngaylinn@tech.lgbt

            PyTorch has such frustrating documentation.

            It's very thorough, but the API documents always fall into the trap of merely describing the code, without giving any sort of explanation for how you ought to use it. The tutorials similarly show you how to build a thing, but don't bother explaining why they wrote the code like they did.

            What's the point? How is this better than just reading the source code?

            Also, so much of PyTorch is just wrapper objects that let you hook up any sort of data with any sort of model and run it on any sort of hardware. This means knowing ML won't help you much, because PyTorch isn't about ML. It's about this system of abstractions the PyTorch devs came up with to glue together all the ML code in the world, then failed to document. 🙄

              [?]Gary McGraw » 🌐
              @cigitalgem@sigmoid.social

              and deeply impacting time to exploit. The zero day clock shows this.

              This is impacted by ML...not

              Guess we should have learned those lessons from 25 years ago

              zerodayclock.com/

                AodeRelay boosted

                [?]Thinking Elixir » 🌐
                @ThinkingElixir@genserver.social

                ... [SENSITIVE CONTENT]News includes Expert LSP releasing its first RC, Elixir v1.20 compile time improvements up to 20% faster, #Livebook Desktop moving to Tauri with #Linux support, a new erlang-python library for ML/AI integration, and more! @elixirlang #ElixirLang #AI #ML https://www.youtube.com/watch?v=Ed83ckXzHcQ

                  [?]Metin Seven 🎨 » 🌐
                  @metin@graphics.social

                  😆😆😆

                  The Trending Mastodon bot account mentions that the "Microslop" hashtag is now trending across Mastodon.

                  Alt...The Trending Mastodon bot account mentions that the "Microslop" hashtag is now trending across Mastodon.

                    [?]Nate Gaylinn » 🌐
                    @ngaylinn@tech.lgbt

                    Backprop is an algorithm I run to optimize an ANN. It needs a top-down view of the network topology and the weights of all synapses. It solves the credit-assignment problem in a clever way, usually based on the error rate compared to a known target. Then it simultaneously updates all the link weights in the network based on how the ANN responded as a whole. First you train your network, then you can use it, but not both at once.

                    Rather than being tuned by some external actor, brain cells manage their own relationships with their neighbors. They grow, prune, and modulate their synapses, and they decide when and how to do that based on imperfect feedback, limited information, and evolved heuristics. Brains track and minimize errors, but the targets are internally generated. This is happening continuously, with fluid transitions between acting in the real world, imagining, thinking, and learning.

                    I'd argue what the brain does is much harder, and much more interesting.

                    (3/3)

                      [?]Nate Gaylinn » 🌐
                      @ngaylinn@tech.lgbt

                      There are two big reasons I dislike saying brains do something "backprop-like." The first issue is how work is split between nodes and links.

                      In ANNs, the nodes themselves are trivial, and they're completely homogeneous across a full layer of the network, if not all the layers. Any deeper computation is about how the nodes are wired together. That is, the program is in the links (synapse weights), not the nodes.

                      By comparison, brain cells are both complex and diverse. We don't know how much of the computation happens within cells vs. between them. We're just starting to figure out what all the different kinds of cells are, but have little idea of what they're doing. It's clear that individual neurons do a lot, and that ensembles of cells manage each other in complex ways.

                      I worry saying the brain "does backprop" implies a network of trivial nodes, where tuning weight vectors is the place where learning happens. That's likely wrong, and it obscures other possibilities.

                      (2/3)

                        [?]Nate Gaylinn » 🌐
                        @ngaylinn@tech.lgbt

                        I've been thinking about this article @UlrikeHahn shared with me recently. Apparently, I have strong opinions about why we shouldn't say that the brain is doing something "backprop-like" when we learn!

                        Before we start, the key thing to know is that computation in a neural network is distributed across many nodes connected by links. To tune the behavior of the network as a whole, you need to tune each of the nodes and links, but how do you know how any one node or link contributed to the final answer? It's complicated, and each one depends on many others. We call that "credit assignment."

                        I think both brains and artificial neural networks (ANNs) need to solve the credit assignment problem. For ANNs, an algorithm called "back propagation" or just "backprop" is the industry standard solution, and it works very well. I think what brains do is different.

                        (1/3)

                        EDIT: Free PDF

                          [?]William Whitlow » 🌐
                          @wwhitlow@indieweb.social

                          Interesting paper that was recently updated today engaging Heidegger's philosophy with contemporary Machine Learning techniques. I hope to take more time to engage with this paper over the next couple of days. Still, it is encouraging to see such consideration connecting AGI concerns with philosophical principles. Exploring how contemporary design principles lead more to tool use, than AGI.
                          arxiv.org/abs/2602.19028

                            AodeRelay boosted

                            [?]Metin Seven 🎨 » 🌐
                            @metin@graphics.social

                            [?]noplasticshower » 🌐
                            @noplasticshower@infosec.exchange

                            @baldur no real experts will be replaced by . Lots of pretenders will be replaced by .

                              AodeRelay boosted

                              [?]Metin Seven 🎨 » 🌐
                              @metin@graphics.social

                              How AI slop is causing a crisis in computer science…

                              Preprint repositories and conference organizers are having to counter a tide of ‘AI slop’ submissions.

                              nature.com/articles/d41586-025

                              ( No paywall: archive.is/VEh8d )

                                AodeRelay boosted

                                [?]Gary McGraw » 🌐
                                @cigitalgem@sigmoid.social

                                I have some thoughts about the state of the practice in , mostly in my head because of my experience on the [un]prompted program committee.

                                berryvilleiml.com/2026/02/04/u

                                  [?]William Whitlow » 🌐
                                  @wwhitlow@indieweb.social

                                  Can someone clarify, in academia and industry are LLM hallucinations the result of overfitting, or simply a false positive?

                                  I'm beginning to think that hallucinations are evidence of overfitting. It seems surprising that there are few attempts to articulate the underlying cause of hallucinations. Also, if the issue is overfitting, then increasing training time and datasets may not be an appropriate solution to the problem of hallucinations.

                                    [?]William Whitlow » 🌐
                                    @wwhitlow@indieweb.social

                                    @chris_e_simpson

                                    Is there a particular article you have in mind?

                                    I ask because the term ambiguity on this topic has really begun to fascinate me. As the topic has gone more and more mainstream, it has been difficult to sit back and listen to people talk about with no idea what the underlying algorithms are doing. There seems to be a gap between how general users understand or and how developers understand them. My fear is that this gap is what is causing so many issues right now.

                                      AodeRelay boosted

                                      [?]Gary McGraw » 🌐
                                      @cigitalgem@sigmoid.social

                                      Another kind of dangerous feedback loop in involves the user.

                                      nytimes.com/2026/01/26/us/chat

                                        AodeRelay boosted

                                        [?]Gary McGraw » 🌐
                                        @cigitalgem@sigmoid.social

                                        Just got a briefing on how OpenAI develops and secures code internally using Codex5.1. Also got another briefing on the ONE MLsec-ssg that any of us have ever seen in a major enterprise.

                                        The world is changing.

                                          AodeRelay boosted

                                          [?]Gary McGraw » 🌐
                                          @cigitalgem@sigmoid.social

                                          Good solid coverage of ChatGPT-Health. Many of the risks BIML has been warning about are coming into play fast.

                                          darkreading.com/remote-workfor

                                            AodeRelay boosted

                                            [?]Gary McGraw » 🌐
                                            @cigitalgem@sigmoid.social

                                            I have been reviewing submissions for [un]prompted. Some good stuff out there. Also some terrible stuff. Bwahaha.

                                            unpromptedcon.org/

                                              AodeRelay boosted

                                              [?]Gary McGraw » 🌐
                                              @cigitalgem@sigmoid.social

                                              AodeRelay boosted

                                              [?]Gary McGraw » 🌐
                                              @cigitalgem@sigmoid.social

                                              AodeRelay boosted

                                              [?]Gary McGraw » 🌐
                                              @cigitalgem@sigmoid.social

                                              AodeRelay boosted

                                              [?]Laurent Cimon » 🌐
                                              @clf@mastodon.bsd.cafe

                                              New blog post, first in a long time.

                                              Why I chose to go towards Machine Learning research

                                              nilio.ca/post?title=Why%20I%20

                                                AodeRelay boosted

                                                [?]Joan of Contention 😷 » 🌐
                                                @clickhere@mastodon.ie

                                                "To reiterate first principles, my main problem with Artificial Intelligence, as its currently sold, is that it’s a lie."

                                                Seamas O'Reilly in @Tupp_ed's (Guest) Gist, today:

                                                thegist.ie/guest-gist-2026-our

                                                  AodeRelay boosted

                                                  [?]Chi Kim » 🌐
                                                  @chikim@mastodon.social

                                                  Wow, Chatterbox-Turbo is pretty good! As a quick test, I let two local LLMs ramble about random topics of their choice and generated audio using zero-shot voice cloning with Chatterbox-Turbo.
                                                  resemble.ai/chatterbox-turbo/

                                                  @ZBennoui

                                                  Alt...Two LLMs rambling about random topics such as pizza , seagull, pigeon, etc.

                                                    AodeRelay boosted

                                                    [?]Gary McGraw » 🌐
                                                    @cigitalgem@sigmoid.social

                                                    Psyched to serve on the conference committee and review board for [un]prompted, a new AI security practitioner conference, happening March 3/4 in SF's Salesforce Tower.

                                                    This is a community-focused event with a bead on what actually works in /#AI security, from simple tools that just work, through strategy, all the way to offense and defense.

                                                    Submit a talk. Check the conference out.

                                                    Let's see some real

                                                    unpromptedcon.org/

                                                      AodeRelay boosted

                                                      [?]Chi Kim » 🌐
                                                      @chikim@mastodon.social

                                                      Meta released their Segment Anything Model for Audio that lets you separate audio with text prompts. It could be guitar, speech, bird sound, etc. ai.meta.com/samaudio/

                                                        AodeRelay boosted

                                                        [?]JTI » 🌐
                                                        @jti42@infosec.exchange

                                                        I see a lot of blank, outright rejection of , LLMs general or coding LLMs like in special here on the Fediverse.
                                                        Often, the actual impact of the AI / in use is not even understood by those criticizing it, at times leading to tantrums about AI where there is....no AI involved.

                                                        The technology (LLM et al) in itself is not likely to go away for a few more years. The smaller variations that aren't being yapped about as much are going to remain here as they have been for the past decades.
                                                        I assume that what will indeed happen is a move from centralized cloud models to on-prem hardware as the hardware becomes more powerful and the models more efficient. Think migration from the large mainframes to the desktop PCs. We're seeing a start of this with devices such as the ASUS Ascent / .

                                                        Imagine having the power of under your desk, powered for free by cells on your roof with some nice solar powered AC to go with it.

                                                        Would it not be wise to accept the reality of the existence of this technology and find out how this can be used in a good way that would improve lives? And how smart, small regulation can be built and enforced that balances innovation and risks to get closer to (tm)?

                                                        Low-key reminds me of the Maschinenstürmer of past times...

                                                          AodeRelay boosted

                                                          [?]Patrick :neocat_flag_bi: » 🌐
                                                          @patrick@hatoya.cafe

                                                          One Open-source Project Daily

                                                          ML.NET is an open source and cross-platform machine learning framework for .NET.

                                                          https://github.com/dotnet/machinelearning

                                                            3 ★ 4 ↺
                                                            Anais boosted

                                                            [?]Anthony » 🌐
                                                            @abucci@buc.ci

                                                            The present perspective outlines how epistemically baseless and ethically pernicious paradigms are recycled back into the scientific literature via machine learning (ML) and explores connections between these two dimensions of failure. We hold up the renewed emergence of physiognomic methods, facilitated by ML, as a case study in the harmful repercussions of ML-laundered junk science. A summary and analysis of several such studies is delivered, with attention to the means by which unsound research lends itself to social harms. We explore some of the many factors contributing to poor practice in applied ML. In conclusion, we offer resources for research best practices to developers and practitioners.
                                                            From The reanimation of pseudoscience in machine learning and its ethical repercussions here: https://www.cell.com/patterns/fulltext/S2666-3899(24)00160-0. It's open access.

                                                            In other words ML--which includes generative AI--is smuggling long-disgraced pseudoscientific ideas back into "respectable" science, and rejuvenating the harms such ideas cause.


                                                              9 ★ 7 ↺

                                                              [?]Anthony » 🌐
                                                              @abucci@buc.ci

                                                              R.A. Fisher wrote that the purpose of statisticians was "constructing a hypothetical infinite population of which the actual data are regarded as constituting a random sample." ( p. 311 here ). In The Zeroth Problem Colin Mallows wrote "As Fisher pointed out, statisticians earn their living by using two basic tricks-they regard data as being realizations of random variables, and they assume that they know an appropriate specification for these random variables."

                                                              Some of the pathological beliefs we attribute to techbros were already present in this view of statistics that started forming over a century ago. Our writing is just data; the real, important object is the “hypothetical infinite population” reflected in a large language model, which at base is a random variable. Stable Diffusion, the image generator, is called that because it is based on latent diffusion models, which are a way of representing complicated distribution functions--the hypothetical infinite populations--of things like digital images. Your art is just data; it’s the latent diffusion model that’s the real deal. The entities that are able to identify the distribution functions (in this case tech companies) are the ones who should be rewarded, not the data generators (you and me).

                                                              So much of the dysfunction in today’s machine learning and AI points to how problematic it is to give statistical methods a privileged place that they don’t merit. We really ought to be calling out Fisher for his trickery and seeing it as such.


                                                                David Gerard boosted

                                                                [?]Kee Hinckley » 🌐
                                                                @nazgul@infosec.exchange

                                                                Something I’ve been thinking about a lot in the current battle over the future of (pseudo) AI is the cotton gin.

                                                                I live in a country where industrial progress is always considered a positive. It’s such a fundamental concept to the American exceptionalism claim that we are taught never to question it, let alone realize that it’s propaganda.

                                                                One such myth, taught early in grade school, is the story of Eli Whitney and the cotton gin. Here was a classic example of a labor-saving device that made millions of lives better. No more overworked people hand cleaning the cotton (slaves, though that was only mentioned much later, if at all). Better clothes and bedding for the world. Capitalism at its best.

                                                                But that’s only half the story of this great industrial time saver. Where did those cotton cleaners go? And what was the impact of speeding up the process?

                                                                Now that the cleaning bottleneck was gone, the focus was on picking cotton as fast as possible. Those cotton cleaners likely, and millions of other slaves definitely, were sent to the fields to pick cotton. There was an unprecedented explosion in the slave trade. Industrial time management and optimization methods were applied to human beings using elaborate rule-based systems written up in books. How hard to punish to get optimal productivity. How long their lifespans needed to be to get the lost production per dollar. Those techniques, practiced on the backs and lives of slaves, became the basis of how to run the industrial mills in the North. They are the ancestors of the techniques that your manager uses now to improve productivity.

                                                                Millions of people were sold into slavery and worked to death *because* of the cotton gin. The advance it provided did not, in fact save labor overall. Nor did it make life better overall. It made a very small set of people much much richer; especially the investors around the world who funded the banks who funded the slave purchases. It made a larger set of consumers more comfortable at the cost of the lives of those poorer. Over a hundred years later this model is still the basis for our society.

                                                                Modern “AI” is a cotton gin. It makes a lot of painstaking things much easier and available to everyone. Writing, reading, drawing, summarizing, reviewing medical cases, hiring, firing, tracking productivity, driving, identifying people in a lineup…they all can now be done automatically. Put aside whether it’s actually capable of doing any of those things *well*; the investors don’t care if their products are good, they only care if they can make more money off of them. So long as they work enough to sell, the errors, and the human cost of those errors, are irrelevant. And like the cotton gin, AI has other side effects. When those jobs are gone, are the new jobs better? Or are we all working that much harder, with even more negative consequences to our life if we fall off the treadmill? One more fear to keep us “productive”.

                                                                The Luddites learned this lesson the hard way, and history demonizes them for it; because history isn’t written by the losers.

                                                                They’ve wrapped “AI” with a shiny ribbon to make it fun and appealing to the masses. How could something so fun to play with be dangerous? But like the story we are told about the cotton gin, the true costs are hidden.

                                                                  4 ★ 0 ↺

                                                                  [?]Anthony » 🌐
                                                                  @abucci@buc.ci

                                                                  Speaking of machine learning, I once had a paper rejected from (International Conference on Machine Learning) in the early 2000s because it "wasn't about machine learning" (minor paraphrase of comments in 2 of the 3 reviews if I recall correctly). That field was consolidating--in a bad way, in my view--around a very small set of ideas even back then. My co-author and I wrote a rebuttal to the rejection, which we had the opportunity to do, arguing that our work was well within the scope of machine learning as set out by Arthur Samuel's pioneering work in the late 1950s/early 1960s that literally gave the field its name (Samuel 1959, Some studies in machine learning using the game of checkers). Their retort was that machine learning consisted of: learning probability distributions of data (unsupervised learning); learning discriminative or generative probabilistic models from data (supervised learning); or reinforcement learning. Nothing else. OK maybe I'm missing one, but you get the idea.

                                                                  We later expanded this work and landed it as a chapter in a 2008 book Multiobjective Problem Solving from Nature, which is downloadable from https://link.springer.com/book/10.1007/978-3-540-72964-8 . You'll see the chapter starting on page 357 of that PDF (p 361 in the PDF's pagination). We applied a technique from the theory of coevolutionary algorithms to examine small instances of the game of Nim, and were able to make several interesting statements about that game. Arthur Samuel's original papers on checkers were about learning by self-play, a particularly simple form of coevolutionary algorithm, as I argue in the introductory chapter of my PhD dissertation. Our technique is applicable to Samuel's work and any other work in that class--in other words, it's squarely "machine learning" in the sense Samuel meant the term.

                                                                  Whatever you may think of this particular work of mine, it's bad news when a field forgets and rejects its own historical origins and throws away the early fruitful lines of work that led to its own birth. threatens to have a similar wilting effect on artificial intelligence and possibly on computer science more generally. The marketplace of ideas is monopolizing, the ecosystem of ideas collapsing. Not good.


                                                                    2 ★ 0 ↺

                                                                    [?]Anthony » 🌐
                                                                    @abucci@buc.ci

                                                                    Haven't read this one yet, but I'm itching to:

                                                                    https://mastodon.world/@Mer__edith/113197090927589168

                                                                    Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI

                                                                    With the growing attention and investment in recent AI approaches such as large language models, the narrative that the larger the AI system the more valuable, powerful and interesting it is is increasingly seen as common sense. But what is this assumption based on, and how are we measuring value, power, and performance? And what are the collateral consequences of this race to ever-increasing scale? Here, we scrutinize the current scaling trends and trade-offs across multiple axes and refute two common assumptions underlying the 'bigger-is-better' AI paradigm: 1) that improved performance is a product of increased scale, and 2) that all interesting problems addressed by AI require large-scale models. Rather, we argue that this approach is not only fragile scientifically, but comes with undesirable consequences. First, it is not sustainable, as its compute demands increase faster than model performance, leading to unreasonable economic requirements and a disproportionate environmental footprint. Second, it implies focusing on certain problems at the expense of others, leaving aside important applications, e.g. health, education, or the climate. Finally, it exacerbates a concentration of power, which centralizes decision-making in the hands of a few actors while threatening to disempower others in the context of shaping both AI research and its applications throughout society.
                                                                    Currently this is on which, if you've read any of my critiques, is a dubious source. I'd love to see this article appear in a peer-reviewed or otherwise vetted venue, given the importance of its subject.

                                                                    I've heard through the grapevine that US federal grantmaking agencies like the (National Science Foundation) are also consolidating around generative AI. This trend is evident if you follow directorates like CISE (Computer and Information Science and Engineering). A friend told me there are several NSF programs that tacitly demand LLMs of some form be used in project proposals, even when doing so is not obviously appropriate. A friend of a friend, who is a university professor, has said "if you're not doing LLMs you're not doing machine learning".

                                                                    This is an absolutely devastating mindset. While it might be true at a certain cynical, pragmatic level, it's clearly indefensible at an intellectual, scholarly, scientific, and research level. Willingly throwing away the diversity of your own discipline is bizarre, foolish, and dangerous.


                                                                      3 ★ 1 ↺
                                                                      AI Channel boosted

                                                                      [?]Anthony » 🌐
                                                                      @abucci@buc.ci

                                                                      A Handy AI Glossary

                                                                      = Automated Immiseration
                                                                      = Generative Automated Immiseration
                                                                      = Automated General Immiseration
                                                                      = Large Labor-exploitation Model
                                                                      = Machine Labor-exploitation

                                                                        6 ★ 2 ↺

                                                                        [?]Anthony » 🌐
                                                                        @abucci@buc.ci

                                                                        "Data is the new oil" has never made any sense to me. I think I understand why people say this as a shorthand. But about people, which is often what's being referred to, is more like perishable food. Data goes stale. It goes bad. In some applications it's stale moments after you collect it. Data can even be toxic ( and models can be data poisoned!)

                                                                        If you accept the viewpoint of ecological rationality, data (about people) is not nearly as useful for predictive purposes as it's made out to be. There is a "less is more" phenomenon in many applications, especially those that claim to predict behaviors or outcomes of some kind. See also this talk: https://www.cs.princeton.edu/news/how-recognize-ai-snake-oil .

                                                                        There is a "less is more" effect with food, too. People need a baseline amount of food to maintain health, but having significantly more food than that doesn't confer significantly more health. Also food can spoil if one hoards it.

                                                                        If data is a liquid, it's more like milk than oil.

                                                                          4 ★ 4 ↺

                                                                          [?]Anthony » 🌐
                                                                          @abucci@buc.ci

                                                                          Just to clarify the point I was making yesterday about arXiv, below I've included a plot from arXiv's own stats page https://info.arxiv.org/help/stats/2021_by_area/index.html . The image contains two charts side-by-side. The chart on the left is a stacked area chart tracking the number of submissions to each of several arXiv categories through time, from 1991 to 2021. I obtained this screenshot today; arXiv's site, at time of writing, says the chart had been updated 3 January 2022. The caption to this plot on the arXiv page I linked has more detail about it.

                                                                          What you're seeing here is that for most categories, there is a linear increase in the number of submissions to the category year-over-year up until the end of the data series in 2021. Computer science is dramatically different: its increase looks exponential, and it looks like its rate of increase may have accelerated circa 2017. The chart on the right, which is the same data shown proportional instead of as raw counts, suggests computer science might be "eating" mathematics starting around 2017.

                                                                          2017 is around when generative AI papers started to appear in large quantities. There was a significant advance in machine learning published around 2018 but known before then that made deep learning significantly more effective. Tech companies were already pushing this technology. (the / maker) was founded in 2015; GPT-2 was released in early 2019. arXiv's charts don't show this, but I suspect these factors play a role in the seeming phase shift in their CS submissions in 2017.

                                                                          We don't know what 2022 and 2023 would look like on a chart like this but I expect the exponential increase will have continued and possibly accelerated.

                                                                          In any case, this trend is extremely concerning. The exponential increase in number of submissions to what is supposed to be an academic pre-print service is not reasonable. There hasn't been an exponential increase in the number of computer scientists, nor in research funding, nor in research labs, nor in the output-per-person of each scientist. Furthermore, these new submissions threaten to completely swamp all other material: before long computer science submissions will dwarf those of all over fields combined; since this chart stops at 2021 they may have already! arXiv's graphs do not break down the CS submissions by subtopic, but I suspect they are in the machine learning/generative AI/LLM space and that submissions on these topics dwarf the other subdisciplines of computer science. Finally, to the extent that arXiv has quality controls in place for its archive, these can't possibly keep up with an exponentially-increasing rate of submissions. They will eventually fail if they haven't already (as I suggested in a previous post I think there are signs that their standards are slipping; perhaps that started circa 2017 and that's partly why the rate of submissions accelerated then?).


                                                                          Description is in the body of the post.

                                                                          Alt...Description is in the body of the post.