buc.ci is a Fediverse instance that uses the ActivityPub protocol. In other words, users at this host can communicate with people that use software like Mastodon, Pleroma, Friendica, etc. all around the world.
This server runs the snac software and there is no automatic sign-up process.
β‘β‘β‘ Lightning Talk! β‘β‘β‘
πͺ¦ππͺπππ§ ππ¦ π§ππ πππ₯π πͺππ π§πππππ‘π πππ’π¨π§? - πππ₯π πππ₯ππ’π‘ πππ§πππ§ππ’π‘ ππ‘π ππππ‘π§ππππππ§ππ’π‘ - Laura Bernardy ππ΅οΈββοΈ
The dark web hides in code, and its language is built to confuse. In this talk, Laura Bernardy shows how NLP can decode the slang, jargon, and encrypted phrases used by cybercriminals
Laura Bernardy https://lu.linkedin.com/in/laura-bernardy-a95315177 is a PhD candidate at SnT Luxembourg, researching dark web content and cyber threat intelligence using natural language processing. She holds a masterβs in computational linguistics and has worked on low-resource language NLP. Her work combines linguistics, cybersecurity, and AI to decode whatβs being said and whoβs saying it.
π
Conference Dates: 6β8 May 2026 | 09:00β18:00
π 14, Porte de France, Esch-sur-Alzette, Luxembourg
ποΈ Tickets: https://2026.bsides.lu/tickets/
π
Schedule Link: https://pretalx.com/bsidesluxembourg-2026/schedule/
#BSidesLuxembourg #DarkWeb #NLP #CyberThreatIntelligence #OSINT #Linguistics
Built a cybersecurity NER model. 13 entity types. 1,500+ security entities. It's on HuggingFace.
Spent months extracting and annotating cybersecurity entities from real job postings, threat reports, and compliance docs. Turning it into a tool anyone can use.
What it extracts:
- Security roles (CISO, SOC Analyst, Pen Tester)
- Certifications (CISSP, OSCP, CEH)
- Tools (Splunk, CrowdStrike, Metasploit)
- Threats (APT, ransomware, phishing)
- Attack techniques (SQLi, XSS, RCE)
- CVEs, frameworks (MITRE ATT&CK, NIST), regulations (GDPR, PCI-DSS)
- Technical skills, acronyms, compliance terms
Built for:
- Threat intel parsing
- Security talent matching
- Skills inventory extraction
- Compliance doc analysis
The tech:
- RoBERTa transformer, domain-adapted on 40K security texts
- spaCy pipeline for easy integration
- 69% F1 score (and improving)
Where I need help:
- More annotated security text (CVs, job posts, threat reports)
- Edge cases the model misses
- Ideas for entity types I haven't covered
This misguided trend has resulted, in our opinion, in an unfortunate state of affairs: an insistence on building NLP systems using βlarge language modelsβ (LLM) that require massive computing power in a futile attempt at trying to approximate the infinite object we call natural language by trying to memorize massive amounts of data. In our opinion this pseudo-scientific method is not only a waste of time and resources, but it is corrupting a generation of young scientists by luring them into thinking that language is just data β a path that will only lead to disappointments and, worse yet, to hampering any real progress in natural language understanding (NLU). Instead, we argue that it is time to re-think our approach to NLU work since we are convinced that the βbig dataβ approach to NLU is not only psychologically, cognitively, and even computationally implausible, but, and as we will show here, this blind data-driven approach to NLU is also theoretically and technically flawed.From Machine Learning Won't Solve Natural Language Understanding, https://thegradient.pub/machine-learning-wont-solve-the-natural-language-understanding-challenge/
#AI #GenAI #GenerativeAI #LLM #LLMs #NLP #NLU #GPT #ChatGPT #Claude #Gemini #LLAMA
- Statistics, as a field of study, gained significant energy and support from eugenicists with the purpose of "scientizing" their prejudices. Some of the major early thinkers in modern statistics, like Galton, Pearson, and Fisher, were eugenicists out loud; see https://nautil.us/how-eugenics-shaped-statistics-238014/
- Large language models and diffusion models rely on certain kinds of statistical methods, but discard any notion of confidence interval or validation that's grounded in reality. For instance, the LLM inside GPT outputs a probability distribution over the tokens (words) that could follow the input prompt. However, there is no way to even make sense of a probability distribution like this in real-world terms, let alone measure anything about how well it matches reality. See for instance https://aclanthology.org/2020.acl-main.463.pdf and Michael Reddy's The conduit metaphor: A case of frame conflict in our language about language
Early on in this latest AI hype cycle I wrote a note to myself that this style of AI is necessarily biased. In other words, the bias coming out isn't primarily a function of biased input data (though of course that's a problem too). That'd be a kind of contingent bias that could be addressed. Rather, the bias these systems exhibit is a function of how the things are structured at their core, and no amount of data curating can overcome it. I can't prove this, so let's call it a hypothesis, but I believe it.
#AI #GenAI #GenerativeAI #ChatGPT #GPT #Gemini #Claude #Llama #StableDiffusion #Midjourney #DallE #LLM #DiffusionModel #linguistics #NLP