script that uses web scraping to extract information from a website

vluzzy · 01-02-2024, 02:45 AM

Requirements:

Install requests, beautifulsoup4, spacy, matplotlib libraries:

pip install requests beautifulsoup4 spacy matplotlib

Note: Before running the script, you'll also need to download the spaCy model for English:

python -m spacy download en_core_web_sm

python

import requests
from bs4 import BeautifulSoup
import spacy
import matplotlib.pyplot as plt
from collections import Counter

def scrape_news_article(url):
response = requests.get(url)

if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
paragraphs = soup.find_all('p')
article_text = ' '.join([p.get_text() for p in paragraphs])
return article_text
else:
raise Exception(f"Failed to fetch the news article. Status Code: {response.status_code}")

def extract_entities(text):
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
entities = [ent.text for ent in doc.ents if ent.label_ in ['ORG', 'PERSON', 'GPE']]
return entities

def visualize_entities(entities):
entity_counter = Counter(entities)
top_entities = entity_counter.most_common(5)

labels, counts = zip(*top_entities)

plt.bar(labels, counts, color='skyblue')
plt.xlabel('Entities')
plt.ylabel('Mentions')
plt.title('Top Mentioned Entities in the News Article')
plt.show()

def main():
article_url = input("Enter the URL of the news article: ")

try:
article_text = scrape_news_article(article_url)
extracted_entities = extract_entities(article_text)

print("\nTop Mentioned Entities:")
for entity in extracted_entities:
print(f"- {entity}")

visualize_entities(extracted_entities)

except Exception as e:
print(f"Error: {e}")

if __name__ == "__main__":
main()

Explanation:

The script uses the requests library to fetch the HTML content of a news article and BeautifulSoup for web scraping.
It utilizes spaCy for named entity recognition (NER) to extract organizations, persons, and geopolitical entities from the article.
The visualize_entities function creates a bar chart using Matplotlib to visualize the top mentioned entities.
The main function prompts the user for a news article URL, scrapes the article, extracts entities, prints the top mentioned entities, and visualizes them.