🐍 Zhihu Crawler - Công cụ crawl dữ liệu Zhihu

Công cụ crawl dữ liệu từ Zhihu với giao diện CLI thân thiện, hỗ trợ crawl nhiều loại nội dung khác nhau.

✨ Tính năng

🔗 Crawl từ link cụ thể: Hỗ trợ crawl trực tiếp từ URL Zhihu
🔍 Tìm kiếm theo từ khóa: Tìm kiếm và crawl nội dung theo từ khóa
👤 Crawl thông tin người dùng: Lấy thông tin profile và hoạt động của user
🔥 Crawl hot list: Lấy dữ liệu từ bảng xếp hạng hot
📚 Hỗ trợ nhiều loại nội dung:
- Truyện/Tiểu thuyết
- Bài viết
- Câu hỏi & Câu trả lời
- Video
💾 Xuất dữ liệu: Tự động lưu kết quả ra file JSON/TXT
⚙️ Cấu hình linh hoạt: Hỗ trợ proxy, cookie tùy chỉnh

🚀 Cài đặt

Yêu cầu hệ thống

Python 3.13+
Node.js (cho JavaScript execution)
Git

Cài đặt từ source

# Clone repository
git clone https://github.com/thucpru/zhihu-crawl.git
cd zhihu-crawl

# Tạo virtual environment
python -m venv venv

# Kích hoạt virtual environment
# Windows:
.\venv\Scripts\activate
# Linux/Mac:
source venv/bin/activate

# Cài đặt dependencies
pip install -r requirements.txt

📖 Sử dụng

Chạy chương trình

python zhihu_crawler_cli.py

Menu chính

============================================================
🐍 ZHIHU CRAWLER CLI - CÔNG CỤ CRAWL ZHIHU
============================================================
📚 Hỗ trợ crawl: Truyện, Bài viết, Câu hỏi, Video
🔍 Tìm kiếm theo từ khóa, crawl từ link cụ thể
💾 Tự động lưu dữ liệu ra file JSON/TXT
============================================================

📋 MENU CHÍNH:
1. 🔗 Crawl từ link cụ thể
2. 🔍 Tìm kiếm và crawl theo từ khóa
3. 👤 Crawl thông tin người dùng
4. 🔥 Crawl hot list (bảng xếp hạng)
5. ⚙️  Cài đặt proxy/cookie
6. 📊 Xem thống kê crawl
0. 🚪 Thoát

Ví dụ sử dụng

1. Crawl từ link cụ thể

# Nhập URL Zhihu bất kỳ
https://www.zhihu.com/question/123456789

2. Tìm kiếm theo từ khóa

# Nhập từ khóa muốn tìm
python programming

3. Crawl thông tin user

# Nhập username hoặc user ID
excited-vczh

📦 Dependencies

gevent: Async networking
requests: HTTP requests
loguru: Logging
beautifulsoup4: HTML parsing
lxml: XML/HTML processing
requests-html: JavaScript rendering
PyExecJS: JavaScript execution
my_fake_useragent: User agent rotation

🔧 Cấu hình

Proxy

# Cấu hình proxy trong menu settings
proxy = {
    'http': 'http://proxy:port',
    'https': 'https://proxy:port'
}

Cookie

# Thêm cookie d_c0 để tăng tỷ lệ thành công
d_c0 = "your_cookie_value"

📁 Cấu trúc dự án

zhihu-crawl/
├── zhihu_crawler/          # Core crawler modules
│   ├── __init__.py
│   ├── zhihu_scraper.py    # Main scraper class
│   ├── page_iterators.py   # Page iteration logic
│   ├── extractors.py       # Data extraction
│   ├── constants.py        # Constants
│   ├── exceptions.py       # Custom exceptions
│   └── zhihu_types.py      # Type definitions
├── utils/                  # Utility functions
│   ├── __init__.py
│   └── zhihu_utils.py      # Helper functions
├── common/                 # Common modules
│   ├── __init__.py
│   ├── encrypt.py          # Encryption utilities
│   └── encrypt.js          # JavaScript encryption
├── docs/                   # Documentation
├── run/                    # Runtime data
├── zhihu_crawler_cli.py    # CLI interface
├── requirements.txt        # Dependencies
├── .gitignore             # Git ignore rules
└── README.md              # This file

💻 Sử dụng API (Programmatic)

```python
# Sử dụng monkey patch cho async operations
from gevent import monkey
monkey.patch_all()
from zhihu_crawler import ZhiHuScraper

if __name__ == '__main__':
    # Khởi tạo scraper
    scraper = ZhiHuScraper()

    # Cài đặt proxy (tùy chọn)
    scraper.set_proxy({
        'http': 'http://127.0.0.1:8125',
        'https': 'http://127.0.0.1:8125'
    })

    # Cài đặt cookie (khuyến nghị)
    scraper.set_cookie({
        'd_c0': 'your_d_c0_cookie_value'
    })

    # Tìm kiếm và crawl
    for result in scraper.search_crawl(keyword='python programming', nums=10):
        print(result)

    # Crawl thông tin user
    for user_info in scraper.user_crawler(
        user_id='excited-vczh',
        answer_nums=20,
        comment_nums=10
    ):
        print(user_info)

    # Crawl hot questions
    for hot_question in scraper.hot_questions_crawl(
        question_nums=10,
        drill_down_nums=5
    ):
        print(hot_question)

🤝 Đóng góp

Fork repository
Tạo feature branch (git checkout -b feature/AmazingFeature)
Commit changes (git commit -m 'Add some AmazingFeature')
Push to branch (git push origin feature/AmazingFeature)
Tạo Pull Request

📄 License

Dự án này được phân phối dưới giấy phép MIT. Xem file LICENSE để biết thêm chi tiết.

⚠️ Lưu ý quan trọng

Chỉ sử dụng cho mục đích học tập và nghiên cứu
Tuân thủ Terms of Service của Zhihu
Không spam hoặc crawl quá nhiều requests trong thời gian ngắn
Khuyến nghị sử dụng delay giữa các requests
Tự chịu trách nhiệm về việc sử dụng công cụ này

🔧 Troubleshooting

Lỗi thường gặp

ModuleNotFoundError: No module named 'execjs'
```
pip install PyExecJS
```
ModuleNotFoundError: No module named 'my_fake_useragent'
```
pip install my_fake_useragent
```
Lỗi build lxml trên Windows
```
pip install lxml>=4.9.3
```
JavaScript execution error
- Đảm bảo Node.js đã được cài đặt
- Kiểm tra file common/encrypt.js tồn tại

📞 Liên hệ

GitHub: @thucpru
Email: thucpru@gmail.com
Repository: zhihu-crawl

⭐ Nếu dự án hữu ích, hãy cho một star nhé! ⭐

📋 Changelog

v1.0.0: Initial release với CLI interface
Thêm hỗ trợ Python 3.13+
Cập nhật dependencies và fix các lỗi environment
Thêm .gitignore và documentation đầy đủ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐍 Zhihu Crawler - Công cụ crawl dữ liệu Zhihu

✨ Tính năng

🚀 Cài đặt

Yêu cầu hệ thống

Cài đặt từ source

📖 Sử dụng

Chạy chương trình

Menu chính

Ví dụ sử dụng

1. Crawl từ link cụ thể

2. Tìm kiếm theo từ khóa

3. Crawl thông tin user

📦 Dependencies

🔧 Cấu hình

Proxy

Cookie

📁 Cấu trúc dự án

💻 Sử dụng API (Programmatic)

🤝 Đóng góp

📄 License

⚠️ Lưu ý quan trọng

🔧 Troubleshooting

Lỗi thường gặp

📞 Liên hệ

📋 Changelog

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.idea		.idea
common		common
docs		docs
run		run
utils		utils
zhihu_crawler		zhihu_crawler
zhihu_utils/__pycache__		zhihu_utils/__pycache__
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
zhihu_crawler_cli.py		zhihu_crawler_cli.py

Folders and files

Latest commit

History

Repository files navigation

🐍 Zhihu Crawler - Công cụ crawl dữ liệu Zhihu

✨ Tính năng

🚀 Cài đặt

Yêu cầu hệ thống

Cài đặt từ source

📖 Sử dụng

Chạy chương trình

Menu chính

Ví dụ sử dụng

1. Crawl từ link cụ thể

2. Tìm kiếm theo từ khóa

3. Crawl thông tin user

📦 Dependencies

🔧 Cấu hình

Proxy

Cookie

📁 Cấu trúc dự án

💻 Sử dụng API (Programmatic)

🤝 Đóng góp

📄 License

⚠️ Lưu ý quan trọng

🔧 Troubleshooting

Lỗi thường gặp

📞 Liên hệ

📋 Changelog

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages