Introduction

Special thanks to c3h3 for his great course materials and tzangms for his great talk in PyCon APAC 2014.

Introduction

PTT is the largest terminal-based bulletin board system (BBS) based in Taiwan. Beauty Board is one of the most popular board on it. There are a lot of users post images in it.

This application try to crawl all available images URLs from Beauty Board, store them in database, and you can make your own thing by this dataset. For example, you can make a website show images.

Prerequisites

Download MongoDB and install it.
Install packages in the requirements.txt

$ pip install -r requirements.txt

Usage

There are 2 files: beauty_crawler.py and beauty_query.py. One for create or update dataset, and the other for query dataset and output html format result.

Create dataset

You can crawl your own dataset by $ python beauty_crawler.py

or restore the existing dataset in dump/ directory by $ mongorestore

Update dataset

Change the update parameter of save_all_articles_to_db(update=False) to True in the main section, and then execute it.

$ python beauty_crawler.py

It will update the dataset of latest articles within one month.

Query dataset

Execute beauty_query.py will try to output html format result. You can redirect it to a file, then open it by browser.

$ python beauty_query.py > test.html

Dataset Parameter

date: the post date of article
title: title of article
author: author of article
push: commendation number
pic: Image URLs
url: Article URL

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
dump/beauty_board_db		dump/beauty_board_db
.gitignore		.gitignore
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
README.md		README.md
beauty_crawler.py		beauty_crawler.py
beauty_query.py		beauty_query.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Prerequisites

Usage

Create dataset

Update dataset

Query dataset

Dataset Parameter

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

mythnc/ptt-beauty-crawler

Folders and files

Latest commit

History

Repository files navigation

Introduction

Prerequisites

Usage

Create dataset

Update dataset

Query dataset

Dataset Parameter

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages