You should consider before collecting data by using the following scripts. This is my old version used to collect data from Foody. I think most APIs have upgraded or changed. You can customize scripts to be compatible with new APIs.
Say Again: Unauthorized Data Collecting. "Open at Your Own Risk"
Crawl https://www.foody.vn/ using NodeJS with dependencies request-promise, cheerio, async and more.
Crawled collections:
Provinces,Districtsindb/provinces.jsonCuisinesindb/cuisines.jsonCategoriesindb/categories.jsonFacilitiesindb/facilities.jsonRestaurants: all restaurants in each district, located in the directorydb/restaurantwith chunk files.jl(JSON line format). These files are too large with total size is larger than 50 MB, so they are ignored in git. You can unzipdb/restaurant.zip.
yarn
# or `npm install`Start crawling all simple collections: province, district, cuisine, category, facility
node src/getMetadata.jsBefore crawling all restaurants, we should login to the site for using APIs blocking normal user (without authentication, not been logged in).
- Create an account on https://www.foody.vn
- Create file
.envsimilar to.env.example - Change
FOODY_USERNAMEandFOODY_PASSWORDto your account
Start crawling process
node src/getRestaurants.jsOutput
https://www.foody.vn:443/account/validatetoken?token=C6E96F82-9F8C-4DBA-91AF-622E22D18627&isremember=True
done
===================================
Province 217
District 1
District 2
District 4
District 5
District 6
District 7
District 8
District 9
Done district 8
District 10
Done district 6
District 11
Done district 4
District 12
Done district 5
District 13
...
District 824
Done district 824
Done district 126
Done district 124
Done district 123
Done district 125
Done district 128
Done district 122
Done district 127
Finish province 230
===================================
Province 265
Finish province 265
OK
├── cookies.json <=== generated in crawling process, ignored in git
├── db
│ ├── categories.json
│ ├── cuisines.json
│ ├── facilities.json
│ ├── .gitignore
│ ├── provinces.json
│ ├── restaurant <=== ignored in git
│ │ ├── in_217_10.jl
│ │ ├── in_217_11.jl
│ │ ├── in_273_305.jl
| | |── ...
│ │ ├── in_273_306.jl
│ │ ├── in_273_307.jl
│ │ ├── in_273_308.jl
│ │ └── in_273_309.jl
│ └── restaurant.zip <=== zipping directory `restaurant`
├── .editorconfig
├── .env <=== containing foody account, ignored in git
├── .env.example
├── .eslintignore
├── .eslintrc
├── .gitignore
├── package.json
├── README.md
├── src
│ ├── crawlers
│ │ ├── metadata.js <=== crawling simple collections
│ │ └── restaurant.js <=== crawling all restaurants
│ ├── getMetadata.js <=== file to execute the crawling process
│ ├── getRestaurants.js <=== file to execute the crawling process
│ └── modules
│ ├── request.js <=== managing global request and cookies jar
│ └── visit.js <=== visiting home page to get cookies and login
└── yarn.lock