script that uses web scraping to extract information from a website 01-02-2024, 02:45 AM
Install requests, beautifulsoup4, spacy, matplotlib libraries:
pip install requests beautifulsoup4 spacy matplotlib
Note: Before running the script, you'll also need to download the spaCy model for English:
python -m spacy download en_core_web_sm
import requests
from bs4 import BeautifulSoup
import spacy
import matplotlib.pyplot as plt
from collections import Counter
def scrape_news_article(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
paragraphs = soup.find_all('p')
article_text = ' '.join([p.get_text() for p in paragraphs])
return article_text
raise Exception(f"Failed to fetch the news article. Status Code: {response.status_code}")
def extract_entities(text):
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
entities = [ent.text for ent in doc.ents if ent.label_ in ['ORG', 'PERSON', 'GPE']]
return entities
def visualize_entities(entities):
entity_counter = Counter(entities)
top_entities = entity_counter.most_common(5)
labels, counts = zip(*top_entities), counts, color='skyblue')
plt.title('Top Mentioned Entities in the News Article')
def main():
article_url = input("Enter the URL of the news article: ")
article_text = scrape_news_article(article_url)
extracted_entities = extract_entities(article_text)
print("\nTop Mentioned Entities:")
for entity in extracted_entities:
print(f"- {entity}")
except Exception as e:
print(f"Error: {e}")
if __name__ == "__main__":
The script uses the requests library to fetch the HTML content of a news article and BeautifulSoup for web scraping.
It utilizes spaCy for named entity recognition (NER) to extract organizations, persons, and geopolitical entities from the article.
The visualize_entities function creates a bar chart using Matplotlib to visualize the top mentioned entities.
The main function prompts the user for a news article URL, scrapes the article, extracts entities, prints the top mentioned entities, and visualizes them.
Install requests, beautifulsoup4, spacy, matplotlib libraries:
pip install requests beautifulsoup4 spacy matplotlib
Note: Before running the script, you'll also need to download the spaCy model for English:
python -m spacy download en_core_web_sm
import requests
from bs4 import BeautifulSoup
import spacy
import matplotlib.pyplot as plt
from collections import Counter
def scrape_news_article(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
paragraphs = soup.find_all('p')
article_text = ' '.join([p.get_text() for p in paragraphs])
return article_text
raise Exception(f"Failed to fetch the news article. Status Code: {response.status_code}")
def extract_entities(text):
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
entities = [ent.text for ent in doc.ents if ent.label_ in ['ORG', 'PERSON', 'GPE']]
return entities
def visualize_entities(entities):
entity_counter = Counter(entities)
top_entities = entity_counter.most_common(5)
labels, counts = zip(*top_entities), counts, color='skyblue')
plt.title('Top Mentioned Entities in the News Article')
def main():
article_url = input("Enter the URL of the news article: ")
article_text = scrape_news_article(article_url)
extracted_entities = extract_entities(article_text)
print("\nTop Mentioned Entities:")
for entity in extracted_entities:
print(f"- {entity}")
except Exception as e:
print(f"Error: {e}")
if __name__ == "__main__":
The script uses the requests library to fetch the HTML content of a news article and BeautifulSoup for web scraping.
It utilizes spaCy for named entity recognition (NER) to extract organizations, persons, and geopolitical entities from the article.
The visualize_entities function creates a bar chart using Matplotlib to visualize the top mentioned entities.
The main function prompts the user for a news article URL, scrapes the article, extracts entities, prints the top mentioned entities, and visualizes them.