An easy to use web page analyzer (scraper or crawler) with many useful features and properties
pip install ezweb
from ezweb import EzSoup
url = "https://www.techradar.com/reviews/google-pixel-5"
page = EzSoup.from_url(url)
print(page.json_summary)Output :
{
"url" : "https://www.techradar.com/reviews/google-pixel-5",
"title": "Google Pixel 5 review",
"description":
"The Google Pixel 5 sheds a few features to become a more affordable and compact phone that still takes great photos at a competitive price.",
"main_image": "https://cdn.mos.cms.futurecdn.net/EicnoxJ3tKYhTRqEauB6RU-1200-80.jpg",
"main_content":
"Two-minute review\nThe Google Pixel 5 represents a strategy change for the tech giant: the phone does", // [And more ...]
"possible_topics":
[
"Mobile Phones",
"Reviews",
]
}# You can use any of below properties and methods instead `a_tags_mp3`
page.a_tags_mp3Click to expand!
returns an article tag which has the most text length
returns a list of EzSoup instances from self.important_hrefs ##### using ThreadPoolExecutor to crawl children much faster than normal for loop
returns a tags that includes header (h2, h3) inside or a tags inside headers or elements with class item or post I call these important becuase they're most likely to be crawlable contentful webpages
returns possible topic/breadcrump names of webpage ### values can be unreliable since they aren't generated with NLP methods yet .
usually the <h1> tag content of a web page is cleaner than original page <title> text so if the h1 or h2 text is similar to the title it is better to return it instead of original title text
from_url(url: str)get_important_children_soups(multithread: bool = True, limit: int = None)returns a list of EzSoup instances from self.important_hrefs ## Parameters :
--- multithread : True by default , using ThreadPoolExecutor to crawl children much faster
--- limit: limit children count that will be crawled
save_content_summary_html(path: str = None)save_content_summary_json(path: str = None)save_content_summary_txt(path: str = None)_This file was automatically generated via [lazydocs](https://github.com/ml-tooling/lazydocs)._