Building a Web Scraper with Python and BeautifulSoup

If you’ve ever wanted to extract useful data from websites — like article titles, product prices, or research summaries — you’ve touched the world of web scraping.
Python makes this process remarkably easy with two powerful libraries: Requests and BeautifulSoup.
In this tutorial, we’ll build a simple yet functional scraper that can pull article titles and links from any website that allows scraping.


🧩 Step 1: Setting Up Your Environment

Before you start, make sure you’ve installed the required libraries:

pip install requests beautifulsoup4

These two libraries are all you need. requests handles the network calls to fetch HTML pages, while BeautifulSoup parses that HTML so you can extract the exact data you want.


⚙️ Step 2: Fetching the Web Page

Our first step is to grab the raw HTML from a given URL. Here’s the function that does it:

def get_html(url):
    try:
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
        }
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        return response.text
    except requests.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

Adding a User-Agent header ensures the site treats your request like a normal browser visit, not an automated bot. The timeout prevents your script from hanging forever.


🔍 Step 3: Parsing and Extracting Data

Once we have the HTML, we’ll extract the article titles and links using BeautifulSoup.

def parse_articles(html):
    soup = BeautifulSoup(html, "html.parser")
    articles = []

    for article in soup.find_all("article"):
        title_tag = article.find("h2")
        if not title_tag:
            continue
        link_tag = title_tag.find("a")

        title = link_tag.get_text(strip=True) if link_tag else title_tag.get_text(strip=True)
        link = link_tag["href"] if link_tag and link_tag.has_attr("href") else None

        articles.append({"title": title, "link": link})

    return articles

This function looks for <article> tags and extracts each article’s title and link.
You can easily modify the tag names to match the structure of the website you’re scraping.


🖨️ Step 4: Displaying the Results

Let’s make our scraper print what it finds in a clean, readable format.

def display_results(articles):
    if not articles:
        print("No articles found.")
        return

    print(f"\nFound {len(articles)} articles:\n")
    for i, article in enumerate(articles, 1):
        print(f"{i}. {article['title']}")
        print(f"   {article['link']}\n")

When you run the program, you’ll see a neat list of titles and links, ready to use in research, analysis, or automation tasks.


🚀 Step 5: Putting It All Together

Finally, let’s tie it all up with a main function that defines the target URL and orchestrates the scraping.

def main():
    url = "https://realpython.com/"
    print(f"Scraping articles from: {url}")

    html = get_html(url)
    if not html:
        print("Failed to retrieve HTML content.")
        return

    articles = parse_articles(html)
    display_results(articles)

if __name__ == "__main__":
    main()

Run this program and you’ll see real article titles printed right in your terminal.


🧠 Final Thoughts

Web scraping is one of the most practical and exciting skills in Python.
Once you’ve mastered the basics, you can take it further — save data into CSV files, schedule scrapes automatically, or feed that data into AI models for training and analysis.

At ProjectPy.com, we explore exactly this kind of synergy between Python and AI — building small, powerful tools that automate the web, extract insights, and make your projects smarter.

If you enjoyed this tutorial, consider subscribing to our newsletter for weekly Python + AI coding projects!

Scroll to Top
WhatsApp Chat on WhatsApp