EzWeb

An easy to use web page analyzer (scraper or crawler) with many useful features and properties

Installation

pip install ezweb

Basic Usage

from ezweb import EzSoup

url = "https://www.techradar.com/reviews/google-pixel-5"

page = EzSoup.from_url(url)

print(page.json_summary)

Output :

{
    "url" : "https://www.techradar.com/reviews/google-pixel-5",

    "title": "Google Pixel 5 review",

    "description": 
    "The Google Pixel 5 sheds a few features to  become a more affordable and compact phone that still takes great photos at a competitive price.",

    "main_image": "https://cdn.mos.cms.futurecdn.net/EicnoxJ3tKYhTRqEauB6RU-1200-80.jpg",

    "main_content":
     "Two-minute review\nThe Google Pixel 5 represents a strategy change for the tech giant: the phone does", // [And more ...]

    "possible_topics": 
    [
        "Mobile Phones",
        "Reviews",
    ]
}

Available properties and methods

# You can use any of below properties and methods instead `a_tags_mp3`
page.a_tags_mp3

Click to expand!

`property` a_tag_hrefs

`property` a_tag_texts

`property` a_tags_mp3

`property` a_tags_rar

`property` a_tags_with_href

`property` article_tag

returns an article tag which has the most text length

`property` children

returns a list of EzSoup instances from self.important_hrefs ##### using ThreadPoolExecutor to crawl children much faster than normal for loop

`property` favicon_href

`property` important_a_tags

returns a tags that includes header (h2, h3) inside or a tags inside headers or elements with class item or post I call these important becuase they're most likely to be crawlable contentful webpages

`property` important_hrefs

`property` json_summary

`property` main_html

`property` main_image_src

`property` main_text

`property` meta_article_modified_time

`property` meta_article_published_time

`property` meta_description

`property` meta_image_src

`property` possible_topic_names

returns possible topic/breadcrump names of webpage ### values can be unreliable since they aren't generated with NLP methods yet .

`property` summary_dict

`property` text

`property` title

usually the <h1> tag content of a web page is cleaner than original page <title> text so if the h1 or h2 text is similar to the title it is better to return it instead of original title text

`property` title_tag_text

`method` `from_url`

from_url(url: str)

`method` `get_important_children_soups`

get_important_children_soups(multithread: bool = True, limit: int = None)

returns a list of EzSoup instances from self.important_hrefs ## Parameters : --- multithread : True by default , using ThreadPoolExecutor to crawl children much faster --- limit: limit children count that will be crawled

`method` `save_content_summary_html`

save_content_summary_html(path: str = None)

`method` `save_content_summary_json`

save_content_summary_json(path: str = None)

`method` `save_content_summary_txt`

save_content_summary_txt(path: str = None)

_{_This file was automatically generated via [lazydocs](https://github.com/ml-tooling/lazydocs)._}

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
dist		dist
src/ezweb		src/ezweb
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg

License

moehmeni/ezweb

Folders and files

Latest commit

History

Repository files navigation

EzWeb

Installation

Basic Usage

Available properties and methods

property a_tag_hrefs

property a_tag_texts

property a_tags_mp3

property a_tags_rar

property a_tags_with_href

property article_tag

property children

property favicon_href

property important_a_tags

property important_hrefs

property json_summary

property main_html

property main_image_src

property main_text

property meta_article_modified_time

property meta_article_published_time

property meta_description

property meta_image_src

property possible_topic_names

property summary_dict

property text

property title

property title_tag_text

method from_url

method get_important_children_soups

method save_content_summary_html

method save_content_summary_json

method save_content_summary_txt

About

Topics

Resources