Login Register






Thread Rating:
  • 0 Vote(s) - 0 Average


script that uses web scraping to extract information from a website filter_list
Author
Message
script that uses web scraping to extract information from a website #1
Requirements:

    Install requests, beautifulsoup4, spacy, matplotlib libraries:

    pip install requests beautifulsoup4 spacy matplotlib

Note: Before running the script, you'll also need to download the spaCy model for English:

python -m spacy download en_core_web_sm

python

import requests
from bs4 import BeautifulSoup
import spacy
import matplotlib.pyplot as plt
from collections import Counter

def scrape_news_article(url):
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        paragraphs = soup.find_all('p')
        article_text = ' '.join([p.get_text() for p in paragraphs])
        return article_text
    else:
        raise Exception(f"Failed to fetch the news article. Status Code: {response.status_code}")

def extract_entities(text):
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents if ent.label_ in ['ORG', 'PERSON', 'GPE']]
    return entities

def visualize_entities(entities):
    entity_counter = Counter(entities)
    top_entities = entity_counter.most_common(5)

    labels, counts = zip(*top_entities)

    plt.bar(labels, counts, color='skyblue')
    plt.xlabel('Entities')
    plt.ylabel('Mentions')
    plt.title('Top Mentioned Entities in the News Article')
    plt.show()

def main():
    article_url = input("Enter the URL of the news article: ")

    try:
        article_text = scrape_news_article(article_url)
        extracted_entities = extract_entities(article_text)

        print("\nTop Mentioned Entities:")
        for entity in extracted_entities:
            print(f"- {entity}")

        visualize_entities(extracted_entities)

    except Exception as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    main()

Explanation:

    The script uses the requests library to fetch the HTML content of a news article and BeautifulSoup for web scraping.
    It utilizes spaCy for named entity recognition (NER) to extract organizations, persons, and geopolitical entities from the article.
    The visualize_entities function creates a bar chart using Matplotlib to visualize the top mentioned entities.
    The main function prompts the user for a news article URL, scrapes the article, extracts entities, prints the top mentioned entities, and visualizes them.

[+] 1 user Likes vluzzy's post
Reply







Users browsing this thread: 1 Guest(s)