buc.ci is a Fediverse instance that uses the ActivityPub protocol. In other words, users at this host can communicate with people that use software like Mastodon, Pleroma, Friendica, etc. all around the world.

This server runs the snac software and there is no automatic sign-up process.

Admin email
abucci@bucci.onl
Admin account
@abucci@buc.ci

Search results for tag #nlp

AodeRelay boosted

[?]BSidesLuxembourg Β» 🌐
@BSidesLuxembourg@infosec.exchange

⚑⚑⚑ Lightning Talk! ⚑⚑⚑
πŸͺ¦πŸ”π—ͺ𝗛𝗔𝗧 π—œπ—¦ π—§π—›π—˜ 𝗗𝗔π—₯π—ž π—ͺπ—˜π—• π—§π—”π—Ÿπ—žπ—œπ—‘π—š 𝗔𝗕𝗒𝗨𝗧? - 𝗗𝗔π—₯π—ž 𝗝𝗔π—₯π—šπ—’π—‘ π——π—˜π—§π—˜π—–π—§π—œπ—’π—‘ 𝗔𝗑𝗗 π—œπ——π—˜π—‘π—§π—œπ—™π—œπ—–π—”π—§π—œπ—’π—‘ - Laura Bernardy πŸ”πŸ•΅οΈβ€β™‚οΈ
The dark web hides in code, and its language is built to confuse. In this talk, Laura Bernardy shows how NLP can decode the slang, jargon, and encrypted phrases used by cybercriminals

Laura Bernardy lu.linkedin.com/in/laura-berna is a PhD candidate at SnT Luxembourg, researching dark web content and cyber threat intelligence using natural language processing. She holds a master’s in computational linguistics and has worked on low-resource language NLP. Her work combines linguistics, cybersecurity, and AI to decode what’s being said and who’s saying it.

πŸ“… Conference Dates: 6–8 May 2026 | 09:00–18:00
πŸ“ 14, Porte de France, Esch-sur-Alzette, Luxembourg
🎟️ Tickets: 2026.bsides.lu/tickets/
πŸ“… Schedule Link: pretalx.com/bsidesluxembourg-2

    AodeRelay boosted

    [?]pki Β» 🌐
    @pki@mastodon.bsd.cafe

    Built a cybersecurity NER model. 13 entity types. 1,500+ security entities. It's on HuggingFace.

    Spent months extracting and annotating cybersecurity entities from real job postings, threat reports, and compliance docs. Turning it into a tool anyone can use.

    What it extracts:
    - Security roles (CISO, SOC Analyst, Pen Tester)
    - Certifications (CISSP, OSCP, CEH)
    - Tools (Splunk, CrowdStrike, Metasploit)
    - Threats (APT, ransomware, phishing)
    - Attack techniques (SQLi, XSS, RCE)
    - CVEs, frameworks (MITRE ATT&CK, NIST), regulations (GDPR, PCI-DSS)
    - Technical skills, acronyms, compliance terms

    Built for:
    - Threat intel parsing
    - Security talent matching
    - Skills inventory extraction
    - Compliance doc analysis

    The tech:
    - RoBERTa transformer, domain-adapted on 40K security texts
    - spaCy pipeline for easy integration
    - 69% F1 score (and improving)

    Where I need help:
    - More annotated security text (CVs, job posts, threat reports)
    - Edge cases the model misses
    - Ideas for entity types I haven't covered

    Model: huggingface.co/pki/cybersec-ne

      12 ★ 6 ↺
      planetscape boosted

      [?]Anthony Β» 🌐
      @abucci@buc.ci

      This misguided trend has resulted, in our opinion, in an unfortunate state of affairs: an insistence on building NLP systems using β€˜large language models’ (LLM) that require massive computing power in a futile attempt at trying to approximate the infinite object we call natural language by trying to memorize massive amounts of data. In our opinion this pseudo-scientific method is not only a waste of time and resources, but it is corrupting a generation of young scientists by luring them into thinking that language is just data – a path that will only lead to disappointments and, worse yet, to hampering any real progress in natural language understanding (NLU). Instead, we argue that it is time to re-think our approach to NLU work since we are convinced that the β€˜big data’ approach to NLU is not only psychologically, cognitively, and even computationally implausible, but, and as we will show here, this blind data-driven approach to NLU is also theoretically and technically flawed.
      From Machine Learning Won't Solve Natural Language Understanding, https://thegradient.pub/machine-learning-wont-solve-the-natural-language-understanding-challenge/


        2 ★ 1 ↺

        [?]Anthony Β» 🌐
        @abucci@buc.ci

        If I had the time, energy, and education to pull it off, I'd do some scholarship and writing elaborating on this juxtaposition:

        - Statistics, as a field of study, gained significant energy and support from eugenicists with the purpose of "scientizing" their prejudices. Some of the major early thinkers in modern statistics, like Galton, Pearson, and Fisher, were eugenicists out loud; see https://nautil.us/how-eugenics-shaped-statistics-238014/
        - Large language models and diffusion models rely on certain kinds of statistical methods, but discard any notion of confidence interval or validation that's grounded in reality. For instance, the LLM inside GPT outputs a probability distribution over the tokens (words) that could follow the input prompt. However, there is no way to even make sense of a probability distribution like this in real-world terms, let alone measure anything about how well it matches reality. See for instance https://aclanthology.org/2020.acl-main.463.pdf and Michael Reddy's The conduit metaphor: A case of frame conflict in our language about language

        Early on in this latest AI hype cycle I wrote a note to myself that this style of AI is necessarily biased. In other words, the bias coming out isn't primarily a function of biased input data (though of course that's a problem too). That'd be a kind of contingent bias that could be addressed. Rather, the bias these systems exhibit is a function of how the things are structured at their core, and no amount of data curating can overcome it. I can't prove this, so let's call it a hypothesis, but I believe it.