0% found this document useful (0 votes)

10 views10 pages

Problem Statement

The document describes a Python program for extracting text from URLs, analyzing the text to compute variables like sentiment scores and complexity metrics, and visualizing the results. It involves importing libraries, extracting text from input URLs, analyzing the text to calculate over a dozen metrics, preprocessing the output data, and creating several visualizations of the computed variables like bar plots of average sentiment scores and histograms of complexity scores. The goal is to analyze articles from given URLs and visualize patterns in the text metrics.

Uploaded by

purpleworldot07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views10 pages

Problem Statement

Uploaded by

purpleworldot07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Exp No : 9

Date:
Data Extraction and NLP

Problem Statement
The task is to develop a Python program for extracting textual data from
specified URLs, perform text analysis on the extracted content, and compute a
set of predefined variables. The assignment includes two major components:
Data Extraction and Data Analysis and Sub components: Visualization

Objective
The objective of this assignment is to extract textual data articles from the
given URL and perform text analysis to compute variables and visualize them.

ABOUT
1. Importing libraries
2. Data Extraction
3. Data Analysis
4. Data Preprocessing
5. Visualization

Importing Libraries:
 BeautifulSoup
 numpy
 pandas
 matplotlib
 seaborn
Data Extraction:
Given an input Excel file, "Input.xlsx," containing articles and their
corresponding URLs, the program must extract article text from each URL .
The extracted text should be saved into text files named with the URL_ID as the
file name.
# Function to extract and save article text
def extract_and_save_article(url, url_id):
try:
# Send an HTTP GET request to the URL
response = requests.get(url)

# Parse the HTML content of the page using BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

# Find and extract the article title

article_title = soup.title.text.strip()

# Find and extract the article text

article_text = ' '.join([p.text for p in soup.find_all('p')])

# Create a text file and save the article

output_filename = os.path.join(output_dir, f'{url_id}.txt')
with open(output_filename, 'w', encoding='utf-8') as file:
file.write(f"{article_title}\n\n{article_text}")

print(f'Article saved: {output_filename}')

except Exception as e:
print(f'Error extracting article from {url_id}: {str(e)}')

Data Analysis:
After extracting the text, the program should perform textual analysis
to compute various variables listed in the "Text Analysis.docx"
document.
# Function for textual analysis
def text_analysis(article_text):
stop_words = set(stopwords.words('english'))

# Load positive and negative words

positive_words =
set(open('MasterDictionary-20230918T180035Z-001/MasterDictionary/positive-
words.txt', 'r').read().split())
negative_words =
set(open('MasterDictionary-20230918T180035Z-001/MasterDictionary/negative-
words.txt', 'r').read().split())

# Tokenize the text

tokens =word_tokenize(article_text)

# Clean tokens by removing stop words

clean_tokens = [token.lower() for token in tokens if token.isalpha() and
token.lower() not in stop_words]

# Calculate positive and negative scores

positive_score = sum(1 for word in clean_tokens if word in positive_words)
negative_score = -sum(1 for word in clean_tokens if word in
negative_words)

# Calculate polarity score

polarity_score = (positive_score - negative_score) / (positive_score +
negative_score + 0.000001)

# Calculate subjectivity score

subjectivity_score = (positive_score + negative_score) /
(len(clean_tokens) + 0.000001)
# Tokenize the text into sentences
sentences = sent_tokenize(article_text)

# Calculate average sentence length

avg_sentence_length = len(word_tokenize(article_text)) / len(sentences)

# Calculate percentage of complex words

words = [word.lower() for word in word_tokenize(article_text) if
word.isalpha()]
complex_words = [word for word in words if len(word) > 2]
# assuming words with more than 2 characters are complex
percentage_complex_words = len(complex_words) / len(words)

# Calculate fog index

fog_index = 0.4 * (avg_sentence_length + percentage_complex_words)

# Calculate average number of words per sentence

avg_words_per_sentence = len(words) / len(sentences)

# Calculate complex word count

complex_word_count = len(complex_words)
# Calculate syllable per word
syllable_per_word = syllable_count(words) / len(words)

# Count personal pronouns

pronoun_count = personal_pronouns_count(article_text)

# Calculate average word length

avg_word_length = average_word_length(article_text)

return {
'PositiveScore': positive_score,
'NegativeScore': negative_score,
'PolarityScore': polarity_score,
'SubjectivityScore': subjectivity_score,
'AvgSentenceLength': avg_sentence_length,
'PercentageOfComplexWords': percentage_complex_words,
'FogIndex': fog_index,
'AvgNumberOfWordsPerSentence': avg_words_per_sentence,
'ComplexWordCount': complex_word_count,
'WordCount': len(words),
'SyllablePerWord': syllable_per_word,
'PersonalPronouns': pronoun_count,
'AvgWordLength': avg_word_length
}
The computed variables must be saved in the specified order in the "Output
Data Structure.xlsx" file.

Data Preprocessing
Open the file:
df=pd.read_excel('C:/Users/pushpasri/OneDrive/Desktop/Assignment/Output Data
Structure.xlsx')

Data Cleaning:
df.head(10)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114 entries, 0 to 113
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 URL_ID 114 non-null float64
1 URL 114 non-null object
2 PositiveScore 114 non-null int64
3 NegativeScore 114 non-null int64
4 PolarityScore 114 non-null float64
5 SubjectivityScore 114 non-null float64
6 AvgSentenceLength 114 non-null float64
7 PercentageOfComplexWords 114 non-null float64
8 FogIndex 114 non-null float64
9 AvgNumberOfWordsPerSentence 114 non-null float64
10 ComplexWordCount 114 non-null int64
11 WordCount 114 non-null int64
12 SyllablePerWord 114 non-null float64
13 PersonalPronouns 114 non-null int64
14 AvgWordLength 114 non-null float64
dtypes: float64(9), int64(5), object(1)
memory usage: 13.5+ KB

df.dropna()
Visualization
sns.barplot(x=['PositiveScore', 'NegativeScore'],
y=[df['PositiveScore'].mean(), df['NegativeScore'].mean()])
plt.title('Average Positive and Negative Scores')
plt.show()

# Visualization 2: Polarity and Subjectivity Scores

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
sns.histplot(df['PolarityScore'], bins=20, kde=True)
plt.title('Distribution of Polarity Scores')

plt.subplot(1, 2, 2)
sns.histplot(df['SubjectivityScore'], bins=20, kde=True)
plt.title('Distribution of Subjectivity Scores')

plt.tight_layout()
plt.show()

# Visualization 3: Average Sentence Length

sns.boxplot(y=df['AvgSentenceLength'])
plt.title('Distribution of Average Sentence Length')
plt.show()

# Visualization 4: Percentage of Complex Words

sns.histplot(df['PercentageOfComplexWords'], bins=20, kde=True)
plt.title('Distribution of Percentage of Complex Words')
plt.show()
# Visualization 5: FOG Index
sns.boxplot(y=df['FogIndex'])
plt.title('Distribution of FOG Index')
plt.show()
sns.histplot(df['AvgNumberOfWordsPerSentence'], bins=20, kde=True)
plt.title('Distribution of Average Number of Words Per Sentence')
plt.show()

sns.barplot(x=df['URL_ID'], y=df['ComplexWordCount'])
plt.title('Complex Word Count for Each Entry')
plt.show()

import matplotlib.pyplot as plt

plt.scatter(df['URL_ID'],df['PersonalPronouns'])
plt.xlabel('URL_id')
plt.ylabel('PersonalPronouns')
plt.title('Personal Pronouns Over Time')
plt.show()

sns.histplot(df['WordCount'], bins=20, kde=True)

plt.title('Distribution of Word Count')
plt.show()

Objective
No ratings yet
Objective
4 pages
Text Analysis - Py
No ratings yet
Text Analysis - Py
5 pages
Objective
No ratings yet
Objective
4 pages
Code
No ratings yet
Code
13 pages
Blackcoffer NLP Project Overview
No ratings yet
Blackcoffer NLP Project Overview
7 pages
Raj DV Exp5
No ratings yet
Raj DV Exp5
6 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
22 pages
Text Analysis in Business Using Python
No ratings yet
Text Analysis in Business Using Python
5 pages
Sentiment Analysis Blog Series Part
No ratings yet
Sentiment Analysis Blog Series Part
8 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Python Sentiment Analysis Guide
No ratings yet
Python Sentiment Analysis Guide
5 pages
British Airways Forage Report
No ratings yet
British Airways Forage Report
12 pages
Requirements Documentation
No ratings yet
Requirements Documentation
3 pages
Package Sentimentr': R Topics Documented
No ratings yet
Package Sentimentr': R Topics Documented
49 pages
Ballerono Cappuchino
No ratings yet
Ballerono Cappuchino
10 pages
Methodology
No ratings yet
Methodology
9 pages
NLP Sentimental Analysis 1736351356
No ratings yet
NLP Sentimental Analysis 1736351356
32 pages
Adithiyaa BR 23MBA0018 SMA DA Text Mining PDF
No ratings yet
Adithiyaa BR 23MBA0018 SMA DA Text Mining PDF
6 pages
Bert Sentiment
No ratings yet
Bert Sentiment
7 pages
Emotion Classification With DistilBERT
No ratings yet
Emotion Classification With DistilBERT
25 pages
Untitled Document
No ratings yet
Untitled Document
18 pages
Sma Exp 03 Code Print
No ratings yet
Sma Exp 03 Code Print
5 pages
Text Analysis
No ratings yet
Text Analysis
4 pages
Ai Assignment3 Lcs2023007
No ratings yet
Ai Assignment3 Lcs2023007
8 pages
Sentiment Analysis with NLTK
No ratings yet
Sentiment Analysis with NLTK
4 pages
Python NLP
No ratings yet
Python NLP
15 pages
First
No ratings yet
First
27 pages
PY0101EN 3 5 Practice - Lab 20230526 1685059200.jupyterlite
No ratings yet
PY0101EN 3 5 Practice - Lab 20230526 1685059200.jupyterlite
7 pages
Chapter 8 Text Analytics
No ratings yet
Chapter 8 Text Analytics
42 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
Lab 15 Assignment by Ankit
No ratings yet
Lab 15 Assignment by Ankit
4 pages
6 - Text Vectorization-CSC688-SP22
No ratings yet
6 - Text Vectorization-CSC688-SP22
5 pages
R002 KrishAhuja BDA Lab9.Ipynb - Colab
No ratings yet
R002 KrishAhuja BDA Lab9.Ipynb - Colab
3 pages
Resume Phrase Matcher Code GitHub
No ratings yet
Resume Phrase Matcher Code GitHub
2 pages
NLP Mini Project Rubric & Code
No ratings yet
NLP Mini Project Rubric & Code
3 pages
PYTHON Worksheetamanx
No ratings yet
PYTHON Worksheetamanx
3 pages
Thesis Final - Pham Dung - Quang Anh - Ver2
No ratings yet
Thesis Final - Pham Dung - Quang Anh - Ver2
30 pages
CSV Files in R
No ratings yet
CSV Files in R
28 pages
Saltz HW11
No ratings yet
Saltz HW11
2 pages
Ment Analysis Text Classification
No ratings yet
Ment Analysis Text Classification
9 pages
AI Lab Report BIM
No ratings yet
AI Lab Report BIM
34 pages
Amazon Assignment Ex
No ratings yet
Amazon Assignment Ex
11 pages
Twitter Sentiment Analysis Guide
No ratings yet
Twitter Sentiment Analysis Guide
7 pages
The Same Column Names
No ratings yet
The Same Column Names
9 pages
IR Pract
No ratings yet
IR Pract
7 pages
DS - Lab Report.
No ratings yet
DS - Lab Report.
25 pages
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
No ratings yet
28.1 - 28.16 Real World Problem - Predict Rating Given Product Reviews On Amazon
19 pages
Text Mining
No ratings yet
Text Mining
31 pages
WSMA Lab Manual 2
No ratings yet
WSMA Lab Manual 2
8 pages
Milestone2 Weekly
No ratings yet
Milestone2 Weekly
9 pages
EX1
No ratings yet
EX1
6 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
DSBDL Assn 07
No ratings yet
DSBDL Assn 07
4 pages
Python NLP Techniques Guide
No ratings yet
Python NLP Techniques Guide
18 pages
Python Module 4
No ratings yet
Python Module 4
12 pages
PBL - 3 Proposed Methodology: The Workflow For Creating The Summary Generator
No ratings yet
PBL - 3 Proposed Methodology: The Workflow For Creating The Summary Generator
6 pages
Data Analysis Assignment Guide
No ratings yet
Data Analysis Assignment Guide
4 pages
Phyton Script Example
No ratings yet
Phyton Script Example
4 pages
CoP2 User Manual PDF
No ratings yet
CoP2 User Manual PDF
20 pages
Wayside Amenities Guidelines
No ratings yet
Wayside Amenities Guidelines
8 pages
Linking Activities Using Intents
No ratings yet
Linking Activities Using Intents
4 pages
1st Summative Test in Science 5
100% (10)
1st Summative Test in Science 5
5 pages
Safety Feature of RNPP
No ratings yet
Safety Feature of RNPP
29 pages
Nonprofit SWOT Analysis Guide
100% (2)
Nonprofit SWOT Analysis Guide
14 pages
Mammoth Cave Presentation
No ratings yet
Mammoth Cave Presentation
17 pages
Project Synchrony
No ratings yet
Project Synchrony
3 pages
Advanced Binary Trading Guide - Quotex Edition: 1. Understand The Platform (Quotex)
No ratings yet
Advanced Binary Trading Guide - Quotex Edition: 1. Understand The Platform (Quotex)
4 pages
Marketing Manager Training
No ratings yet
Marketing Manager Training
2 pages
Travel Agency Marketing Strategies
No ratings yet
Travel Agency Marketing Strategies
18 pages
Guidelines For GNSS Positioning in The Oil and Gas Industry: February
No ratings yet
Guidelines For GNSS Positioning in The Oil and Gas Industry: February
91 pages
SV-Is5 Manual (English)
No ratings yet
SV-Is5 Manual (English)
205 pages
Rock Mechanics Module
No ratings yet
Rock Mechanics Module
18 pages
Lab 6
0% (1)
Lab 6
5 pages
Impossibility of Distributed Consensus With One Faulty Process
No ratings yet
Impossibility of Distributed Consensus With One Faulty Process
7 pages
Danh Sach Tim 10-57-53 13-11-2024
No ratings yet
Danh Sach Tim 10-57-53 13-11-2024
16 pages
Disc Brake Rotor Thermal Analysis
No ratings yet
Disc Brake Rotor Thermal Analysis
12 pages
Design Technologist, Brand Innovation Lab - Job ID - 2743288 - Amazon - Jobs
No ratings yet
Design Technologist, Brand Innovation Lab - Job ID - 2743288 - Amazon - Jobs
2 pages
Hyundai R220LC-9S SM PDF
96% (51)
Hyundai R220LC-9S SM PDF
611 pages
116 SA1130 Hi Fi Choice English Nov 1998
No ratings yet
116 SA1130 Hi Fi Choice English Nov 1998
1 page
Spouses Ros v. Philippine National Bank - Laoag Branch
No ratings yet
Spouses Ros v. Philippine National Bank - Laoag Branch
7 pages
Preboard Angelus Victus Test 5
No ratings yet
Preboard Angelus Victus Test 5
3 pages
c427b Upsc General Studies Foundation Course Delhi 2026 - GTB Nagar
No ratings yet
c427b Upsc General Studies Foundation Course Delhi 2026 - GTB Nagar
4 pages
Assessment of Axial Flux Motor Technology For Hybrid Powertrain Integration
No ratings yet
Assessment of Axial Flux Motor Technology For Hybrid Powertrain Integration
8 pages
Arturo Soria y Mata
No ratings yet
Arturo Soria y Mata
15 pages
Ford Ranger Raptor Brochure
No ratings yet
Ford Ranger Raptor Brochure
2 pages
Archicad 22 New Features Guide
No ratings yet
Archicad 22 New Features Guide
32 pages
LP 8 Taxation
No ratings yet
LP 8 Taxation
2 pages
SAP MRP - Material Requirement Planning Overview PDF
No ratings yet
SAP MRP - Material Requirement Planning Overview PDF
20 pages

Problem Statement

Uploaded by

Problem Statement

Uploaded by

Exp No : 9

# Parse the HTML content of the page using BeautifulSoup

# Find and extract the article title

# Find and extract the article text

# Create a text file and save the article

print(f'Article saved: {output_filename}')

# Load positive and negative words

# Tokenize the text

# Clean tokens by removing stop words

# Calculate positive and negative scores

# Calculate polarity score

# Calculate subjectivity score

# Calculate average sentence length

# Calculate percentage of complex words

# Calculate fog index

# Calculate average number of words per sentence

# Calculate complex word count

# Count personal pronouns

# Calculate average word length

# Visualization 2: Polarity and Subjectivity Scores

# Visualization 3: Average Sentence Length

# Visualization 4: Percentage of Complex Words

import matplotlib.pyplot as plt

sns.histplot(df['WordCount'], bins=20, kde=True)

You might also like