0% found this document useful (0 votes)
10 views10 pages

Problem Statement

The document describes a Python program for extracting text from URLs, analyzing the text to compute variables like sentiment scores and complexity metrics, and visualizing the results. It involves importing libraries, extracting text from input URLs, analyzing the text to calculate over a dozen metrics, preprocessing the output data, and creating several visualizations of the computed variables like bar plots of average sentiment scores and histograms of complexity scores. The goal is to analyze articles from given URLs and visualize patterns in the text metrics.

Uploaded by

purpleworldot07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

Problem Statement

The document describes a Python program for extracting text from URLs, analyzing the text to compute variables like sentiment scores and complexity metrics, and visualizing the results. It involves importing libraries, extracting text from input URLs, analyzing the text to calculate over a dozen metrics, preprocessing the output data, and creating several visualizations of the computed variables like bar plots of average sentiment scores and histograms of complexity scores. The goal is to analyze articles from given URLs and visualize patterns in the text metrics.

Uploaded by

purpleworldot07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Exp No : 9

Date:
Data Extraction and NLP

Problem Statement
The task is to develop a Python program for extracting textual data from
specified URLs, perform text analysis on the extracted content, and compute a
set of predefined variables. The assignment includes two major components:
Data Extraction and Data Analysis and Sub components: Visualization

Objective
The objective of this assignment is to extract textual data articles from the
given URL and perform text analysis to compute variables and visualize them.

ABOUT
1. Importing libraries
2. Data Extraction
3. Data Analysis
4. Data Preprocessing
5. Visualization

Importing Libraries:
 BeautifulSoup
 numpy
 pandas
 matplotlib
 seaborn
Data Extraction:
Given an input Excel file, "Input.xlsx," containing articles and their
corresponding URLs, the program must extract article text from each URL .
The extracted text should be saved into text files named with the URL_ID as the
file name.
# Function to extract and save article text
def extract_and_save_article(url, url_id):
try:
# Send an HTTP GET request to the URL
response = requests.get(url)

# Parse the HTML content of the page using BeautifulSoup


soup = BeautifulSoup(response.text, 'html.parser')

# Find and extract the article title


article_title = soup.title.text.strip()

# Find and extract the article text


article_text = ' '.join([p.text for p in soup.find_all('p')])

# Create a text file and save the article


output_filename = os.path.join(output_dir, f'{url_id}.txt')
with open(output_filename, 'w', encoding='utf-8') as file:
file.write(f"{article_title}\n\n{article_text}")

print(f'Article saved: {output_filename}')

except Exception as e:
print(f'Error extracting article from {url_id}: {str(e)}')

Data Analysis:
After extracting the text, the program should perform textual analysis
to compute various variables listed in the "Text Analysis.docx"
document.
# Function for textual analysis
def text_analysis(article_text):
stop_words = set(stopwords.words('english'))

# Load positive and negative words


positive_words =
set(open('MasterDictionary-20230918T180035Z-001/MasterDictionary/positive-
words.txt', 'r').read().split())
negative_words =
set(open('MasterDictionary-20230918T180035Z-001/MasterDictionary/negative-
words.txt', 'r').read().split())

# Tokenize the text


tokens =word_tokenize(article_text)

# Clean tokens by removing stop words


clean_tokens = [token.lower() for token in tokens if token.isalpha() and
token.lower() not in stop_words]

# Calculate positive and negative scores


positive_score = sum(1 for word in clean_tokens if word in positive_words)
negative_score = -sum(1 for word in clean_tokens if word in
negative_words)

# Calculate polarity score


polarity_score = (positive_score - negative_score) / (positive_score +
negative_score + 0.000001)

# Calculate subjectivity score


subjectivity_score = (positive_score + negative_score) /
(len(clean_tokens) + 0.000001)
# Tokenize the text into sentences
sentences = sent_tokenize(article_text)

# Calculate average sentence length


avg_sentence_length = len(word_tokenize(article_text)) / len(sentences)

# Calculate percentage of complex words


words = [word.lower() for word in word_tokenize(article_text) if
word.isalpha()]
complex_words = [word for word in words if len(word) > 2]
# assuming words with more than 2 characters are complex
percentage_complex_words = len(complex_words) / len(words)

# Calculate fog index


fog_index = 0.4 * (avg_sentence_length + percentage_complex_words)

# Calculate average number of words per sentence


avg_words_per_sentence = len(words) / len(sentences)

# Calculate complex word count


complex_word_count = len(complex_words)
# Calculate syllable per word
syllable_per_word = syllable_count(words) / len(words)

# Count personal pronouns


pronoun_count = personal_pronouns_count(article_text)

# Calculate average word length


avg_word_length = average_word_length(article_text)

return {
'PositiveScore': positive_score,
'NegativeScore': negative_score,
'PolarityScore': polarity_score,
'SubjectivityScore': subjectivity_score,
'AvgSentenceLength': avg_sentence_length,
'PercentageOfComplexWords': percentage_complex_words,
'FogIndex': fog_index,
'AvgNumberOfWordsPerSentence': avg_words_per_sentence,
'ComplexWordCount': complex_word_count,
'WordCount': len(words),
'SyllablePerWord': syllable_per_word,
'PersonalPronouns': pronoun_count,
'AvgWordLength': avg_word_length
}
The computed variables must be saved in the specified order in the "Output
Data Structure.xlsx" file.

Data Preprocessing
Open the file:
df=pd.read_excel('C:/Users/pushpasri/OneDrive/Desktop/Assignment/Output Data
Structure.xlsx')

Data Cleaning:
df.head(10)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114 entries, 0 to 113
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 URL_ID 114 non-null float64
1 URL 114 non-null object
2 PositiveScore 114 non-null int64
3 NegativeScore 114 non-null int64
4 PolarityScore 114 non-null float64
5 SubjectivityScore 114 non-null float64
6 AvgSentenceLength 114 non-null float64
7 PercentageOfComplexWords 114 non-null float64
8 FogIndex 114 non-null float64
9 AvgNumberOfWordsPerSentence 114 non-null float64
10 ComplexWordCount 114 non-null int64
11 WordCount 114 non-null int64
12 SyllablePerWord 114 non-null float64
13 PersonalPronouns 114 non-null int64
14 AvgWordLength 114 non-null float64
dtypes: float64(9), int64(5), object(1)
memory usage: 13.5+ KB

df.dropna()
Visualization
sns.barplot(x=['PositiveScore', 'NegativeScore'],
y=[df['PositiveScore'].mean(), df['NegativeScore'].mean()])
plt.title('Average Positive and Negative Scores')
plt.show()

# Visualization 2: Polarity and Subjectivity Scores


plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['PolarityScore'], bins=20, kde=True)
plt.title('Distribution of Polarity Scores')

plt.subplot(1, 2, 2)
sns.histplot(df['SubjectivityScore'], bins=20, kde=True)
plt.title('Distribution of Subjectivity Scores')

plt.tight_layout()
plt.show()

# Visualization 3: Average Sentence Length


sns.boxplot(y=df['AvgSentenceLength'])
plt.title('Distribution of Average Sentence Length')
plt.show()

# Visualization 4: Percentage of Complex Words


sns.histplot(df['PercentageOfComplexWords'], bins=20, kde=True)
plt.title('Distribution of Percentage of Complex Words')
plt.show()
# Visualization 5: FOG Index
sns.boxplot(y=df['FogIndex'])
plt.title('Distribution of FOG Index')
plt.show()
sns.histplot(df['AvgNumberOfWordsPerSentence'], bins=20, kde=True)
plt.title('Distribution of Average Number of Words Per Sentence')
plt.show()

sns.barplot(x=df['URL_ID'], y=df['ComplexWordCount'])
plt.title('Complex Word Count for Each Entry')
plt.show()

import matplotlib.pyplot as plt


plt.scatter(df['URL_ID'],df['PersonalPronouns'])
plt.xlabel('URL_id')
plt.ylabel('PersonalPronouns')
plt.title('Personal Pronouns Over Time')
plt.show()

sns.histplot(df['WordCount'], bins=20, kde=True)


plt.title('Distribution of Word Count')
plt.show()

You might also like