Skip to content

moehmeni/ezweb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EzWeb

An easy to use web page analyzer (scraper or crawler) with many useful features and properties

Installation

pip install ezweb

Basic Usage

from ezweb import EzSoup

url = "https://www.techradar.com/reviews/google-pixel-5"

page = EzSoup.from_url(url)

print(page.json_summary)

Output :

{
    "url" : "https://www.techradar.com/reviews/google-pixel-5",

    "title": "Google Pixel 5 review",

    "description": 
    "The Google Pixel 5 sheds a few features to  become a more affordable and compact phone that still takes great photos at a competitive price.",

    "main_image": "https://cdn.mos.cms.futurecdn.net/EicnoxJ3tKYhTRqEauB6RU-1200-80.jpg",

    "main_content":
     "Two-minute review\nThe Google Pixel 5 represents a strategy change for the tech giant: the phone does", // [And more ...]

    "possible_topics": 
    [
        "Mobile Phones",
        "Reviews",
    ]
}

Available properties and methods

# You can use any of below properties and methods instead `a_tags_mp3`
page.a_tags_mp3
Click to expand!

property a_tag_hrefs


property a_tag_texts


property a_tags_mp3


property a_tags_rar


property a_tags_with_href


property article_tag

returns an article tag which has the most text length


property children

returns a list of EzSoup instances from self.important_hrefs ##### using ThreadPoolExecutor to crawl children much faster than normal for loop


property favicon_href


property important_a_tags

returns a tags that includes header (h2, h3) inside or a tags inside headers or elements with class item or post I call these important becuase they're most likely to be crawlable contentful webpages


property important_hrefs


property json_summary


property main_html


property main_image_src


property main_text


property meta_article_modified_time


property meta_article_published_time


property meta_description


property meta_image_src


property possible_topic_names

returns possible topic/breadcrump names of webpage ### values can be unreliable since they aren't generated with NLP methods yet .


property summary_dict


property text


property title

usually the <h1> tag content of a web page is cleaner than original page <title> text so if the h1 or h2 text is similar to the title it is better to return it instead of original title text


property title_tag_text


method from_url

from_url(url: str)

method get_important_children_soups

get_important_children_soups(multithread: bool = True, limit: int = None)

returns a list of EzSoup instances from self.important_hrefs ## Parameters : --- multithread : True by default , using ThreadPoolExecutor to crawl children much faster --- limit: limit children count that will be crawled


method save_content_summary_html

save_content_summary_html(path: str = None)

method save_content_summary_json

save_content_summary_json(path: str = None)

method save_content_summary_txt

save_content_summary_txt(path: str = None)

_This file was automatically generated via [lazydocs](https://github.com/ml-tooling/lazydocs)._