It is a python package based on research involving over 2 million URLs, designed to handle URLs in a flexible manner for data-driven projects.
It checks the given URL input, validates it against the URL regex, identifies each component of the URL and processes it according to the set flags.
- Handles both encoded and decoded URLs.
- Handles comma separated URLs using recursion.
- Filter out valid, invalid URLs and also bad socials with ease.
- Email and Socials extraction from a given text and validate them.
- URL validation using regex and over 1400 TLDs (off by default).
- Duplicate reduction by minimizing the general and social URL patterns.
- Domain mismatch by just extracting the domain along with the TLD and match against the email domain.
- Researched and refined social regexes to recognize different social patterns and generalizes them to the standardized format.
First things first, you need to install URL Genie by running the following command in your terminal:
python -m pip install urlgenie
That's it! Now you can use URL Genie in your code.
Let's first import the package and create an object of it to access its features.
from urlgenie import UrlGenie
from pprint import pprint
genie = UrlGenie()
Let's try to give a sample input url and get it generalized.
url = ""
gen = genie.generalize(url)
Would return
as the output.
It detects that the schema is missing and adds it. By default, it removes the query (starts with ?) and fragment (starts with #).
As explained previously, URL Genie breaks down the URL, identifies the components and allows you to form the URL as per your needs.
This can be achieved using the flags (boolean parameters) and is explained here:
Below are the different use cases where URL Genie might come in handy.
Just provide a string text and URL Genie will extract a dict containing emails and socials for you.
text = """
This is a good email: and this is a bad email: sample@image.png.
Another would be an email with a custom domain:
Sample facebook, lets try with fb domain:
Lets add a bad facebook:
Lets add 2 twitter formats: and with same handles.
How about a linkedin pub?
Let's also add its in url:"""
result_dict = genie.extract_from_text(text)
This would return:
{'email': {'', ''},
'facebook': {'', '', ''},
'instagram': set(),
'linkedin': {'', ''},
'phone': set(),
'twitter': {'', ''}}
As you can see, it has strict regexes which prevented the bad email (sample@image.png) from being extracted.
But it has extracted which is not really a URL we want since it does not lead to any person / organization / page.
Also, there are duplicates for twitter having the same handle and are not really in a standardized format.
For that, we can validate the given extract to remove invalid data and standardize the valid ones.
result_dict = genie.extract_from_text(text)
validated_dict = genie.validate_result_dict(result_dict)
This would return:
{'email': {'', ''},
'facebook': {'', ''},
'instagram': set(),
'linkedin': {''},
'phone': set(),
'twitter': {''}}
With this, we have removed the duplicates, invalid URLs like, generalized URLs such as LinkedIn PUB to IN.
When you scrape websites for contact info, you might get a lot of emails, and not all of them would be related to the organization.
To filter out the ones which are not related to the organization, we can use the email validation.
result_dict = genie.extract_from_text(text)
validated_dict = genie.validate_result_dict(result_dict, url = "")
This would return:
{'email': {''},
'facebook': {'', ''},
'instagram': set(),
'linkedin': {''},
'phone': set(),
'twitter': {''}}
Now, we have removed the which is not related to the organization's URL we have provided.
This would prove to be helpful when making scrapers or processing and cleaning data.
You can filter out valid URLs, invalid URLs and invalid socials when you have data in bulk to deal with.
For this example, we would be using data stored in a CSV.
import pandas as pd
from pprint import pprint
from urlgenie import UrlGenie
#-Reading the CSV-#
df = pd.read_csv("test.csv", encoding = "utf-8")
#-Creating UrlGenie object with custom texts for Bad Url and Socials, and TLD validation-#
genie = UrlGenie(bad_url = "Bad Url", bad_social = "Bad Social", proper_tlds = True)
#-Applying the generalize function and creating a new column-#
df["gen"] = df["url"].apply(genie.generalize)
#-Printing the updated dataframe-#
Would return:
url gen
0 badbadwebsite?! Bad Url
2 Bad Social
3 random.haz/somePath Bad Url
5 anotherbadwebsite??? Bad Url
As you can see, we got genrealized URLs for the valid ones and Bad Url, Bad Social for the invalid ones.
The reason why random.haz was deemed as invalid is due to the proper_tlds flag which verified the tld 'haz' agaisnt over 1400 TLDs.
As for the twitter one, intent is not a valid twitter page, hence a valid url but an invalid social.
