Exp No : 9
Date:
Data Extraction and NLP
Problem Statement
The task is to develop a Python program for extracting textual data from
specified URLs, perform text analysis on the extracted content, and compute a
set of predefined variables. The assignment includes two major components:
Data Extraction and Data Analysis and Sub components: Visualization
Objective
The objective of this assignment is to extract textual data articles from the
given URL and perform text analysis to compute variables and visualize them.
ABOUT
1. Importing libraries
2. Data Extraction
3. Data Analysis
4. Data Preprocessing
5. Visualization
Importing Libraries:
BeautifulSoup
numpy
pandas
matplotlib
seaborn
Data Extraction:
Given an input Excel file, "Input.xlsx," containing articles and their
corresponding URLs, the program must extract article text from each URL .
The extracted text should be saved into text files named with the URL_ID as the
file name.
# Function to extract and save article text
def extract_and_save_article(url, url_id):
try:
# Send an HTTP GET request to the URL
response = requests.get(url)
# Parse the HTML content of the page using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Find and extract the article title
article_title = soup.title.text.strip()
# Find and extract the article text
article_text = ' '.join([p.text for p in soup.find_all('p')])
# Create a text file and save the article
output_filename = os.path.join(output_dir, f'{url_id}.txt')
with open(output_filename, 'w', encoding='utf-8') as file:
file.write(f"{article_title}\n\n{article_text}")
print(f'Article saved: {output_filename}')
except Exception as e:
print(f'Error extracting article from {url_id}: {str(e)}')
Data Analysis:
After extracting the text, the program should perform textual analysis
to compute various variables listed in the "Text Analysis.docx"
document.
# Function for textual analysis
def text_analysis(article_text):
stop_words = set(stopwords.words('english'))
# Load positive and negative words
positive_words =
set(open('MasterDictionary-20230918T180035Z-001/MasterDictionary/positive-
words.txt', 'r').read().split())
negative_words =
set(open('MasterDictionary-20230918T180035Z-001/MasterDictionary/negative-
words.txt', 'r').read().split())
# Tokenize the text
tokens =word_tokenize(article_text)
# Clean tokens by removing stop words
clean_tokens = [token.lower() for token in tokens if token.isalpha() and
token.lower() not in stop_words]
# Calculate positive and negative scores
positive_score = sum(1 for word in clean_tokens if word in positive_words)
negative_score = -sum(1 for word in clean_tokens if word in
negative_words)
# Calculate polarity score
polarity_score = (positive_score - negative_score) / (positive_score +
negative_score + 0.000001)
# Calculate subjectivity score
subjectivity_score = (positive_score + negative_score) /
(len(clean_tokens) + 0.000001)
# Tokenize the text into sentences
sentences = sent_tokenize(article_text)
# Calculate average sentence length
avg_sentence_length = len(word_tokenize(article_text)) / len(sentences)
# Calculate percentage of complex words
words = [word.lower() for word in word_tokenize(article_text) if
word.isalpha()]
complex_words = [word for word in words if len(word) > 2]
# assuming words with more than 2 characters are complex
percentage_complex_words = len(complex_words) / len(words)
# Calculate fog index
fog_index = 0.4 * (avg_sentence_length + percentage_complex_words)
# Calculate average number of words per sentence
avg_words_per_sentence = len(words) / len(sentences)
# Calculate complex word count
complex_word_count = len(complex_words)
# Calculate syllable per word
syllable_per_word = syllable_count(words) / len(words)
# Count personal pronouns
pronoun_count = personal_pronouns_count(article_text)
# Calculate average word length
avg_word_length = average_word_length(article_text)
return {
'PositiveScore': positive_score,
'NegativeScore': negative_score,
'PolarityScore': polarity_score,
'SubjectivityScore': subjectivity_score,
'AvgSentenceLength': avg_sentence_length,
'PercentageOfComplexWords': percentage_complex_words,
'FogIndex': fog_index,
'AvgNumberOfWordsPerSentence': avg_words_per_sentence,
'ComplexWordCount': complex_word_count,
'WordCount': len(words),
'SyllablePerWord': syllable_per_word,
'PersonalPronouns': pronoun_count,
'AvgWordLength': avg_word_length
}
The computed variables must be saved in the specified order in the "Output
Data Structure.xlsx" file.
Data Preprocessing
Open the file:
df=pd.read_excel('C:/Users/pushpasri/OneDrive/Desktop/Assignment/Output Data
Structure.xlsx')
Data Cleaning:
df.head(10)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114 entries, 0 to 113
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 URL_ID 114 non-null float64
1 URL 114 non-null object
2 PositiveScore 114 non-null int64
3 NegativeScore 114 non-null int64
4 PolarityScore 114 non-null float64
5 SubjectivityScore 114 non-null float64
6 AvgSentenceLength 114 non-null float64
7 PercentageOfComplexWords 114 non-null float64
8 FogIndex 114 non-null float64
9 AvgNumberOfWordsPerSentence 114 non-null float64
10 ComplexWordCount 114 non-null int64
11 WordCount 114 non-null int64
12 SyllablePerWord 114 non-null float64
13 PersonalPronouns 114 non-null int64
14 AvgWordLength 114 non-null float64
dtypes: float64(9), int64(5), object(1)
memory usage: 13.5+ KB
df.dropna()
Visualization
sns.barplot(x=['PositiveScore', 'NegativeScore'],
y=[df['PositiveScore'].mean(), df['NegativeScore'].mean()])
plt.title('Average Positive and Negative Scores')
plt.show()
# Visualization 2: Polarity and Subjectivity Scores
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['PolarityScore'], bins=20, kde=True)
plt.title('Distribution of Polarity Scores')
plt.subplot(1, 2, 2)
sns.histplot(df['SubjectivityScore'], bins=20, kde=True)
plt.title('Distribution of Subjectivity Scores')
plt.tight_layout()
plt.show()
# Visualization 3: Average Sentence Length
sns.boxplot(y=df['AvgSentenceLength'])
plt.title('Distribution of Average Sentence Length')
plt.show()
# Visualization 4: Percentage of Complex Words
sns.histplot(df['PercentageOfComplexWords'], bins=20, kde=True)
plt.title('Distribution of Percentage of Complex Words')
plt.show()
# Visualization 5: FOG Index
sns.boxplot(y=df['FogIndex'])
plt.title('Distribution of FOG Index')
plt.show()
sns.histplot(df['AvgNumberOfWordsPerSentence'], bins=20, kde=True)
plt.title('Distribution of Average Number of Words Per Sentence')
plt.show()
sns.barplot(x=df['URL_ID'], y=df['ComplexWordCount'])
plt.title('Complex Word Count for Each Entry')
plt.show()
import matplotlib.pyplot as plt
plt.scatter(df['URL_ID'],df['PersonalPronouns'])
plt.xlabel('URL_id')
plt.ylabel('PersonalPronouns')
plt.title('Personal Pronouns Over Time')
plt.show()
sns.histplot(df['WordCount'], bins=20, kde=True)
plt.title('Distribution of Word Count')
plt.show()