A robust Selenium-based web scraper designed to collect news headlines from DuckDuckGo. This tool was originally developed for the study "Apocalypse now or later? Nuclear war risk perceptions mirroring media coverage and emotional tone shifts in Italian news" (Judgment and Decision Making, 2024).
-
Clone the repository:
git clone <repository_url> cd DuckDuckSelenium
-
Install dependencies: Ensure you have Python 3.8+ and Google Chrome installed.
pip install -r requirements.txt
-
Configure Input Files: The scraper relies on three text files in the root directory:
Media.txt: List of news websites to search (e.g.,repubblica.it).Keywords.txt: Search terms (e.g.,Ucraina AND guerra).Date.txt: Date range for the search (format:YYYY-MM-DDtoYYYY-MM-DD).
-
Run the Scraper:
python main.py
The script performs two main phases:
- Search Scraping: Queries DuckDuckGo for each media outlet and keyword, saving results to
output/search_results.csv. - Article Scraping: Visits the collected URLs to extract the full headline (H1), saving to
output/articles_scraped.csv.
- Search Scraping: Queries DuckDuckGo for each media outlet and keyword, saving results to
output/search_results.csv: Contains raw search results including URL, date, and snippet.output/articles_scraped.csv: Contains the final dataset with the extracted article titles.
Note: The tool supports incremental saving and can resume if interrupted.
If you use this tool in your research, please cite:
Lauriola, M., Di Cicco, G., & Savadori, L. (2024). Apocalypse now or later? Nuclear war risk perceptions mirroring media coverage and emotional tone shifts in Italian news. Judgment and Decision Making, 19(e7), 1–25. doi:10.1017/jdm.2024.2
Complete study materials are available at: https://osf.io/pduwq/overview
This tool is for educational and research purposes. Please ensure compliance with the Terms of Service of the websites you scrape.
MIT License