Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools.
The Internet hosts perhaps the greatest source of information—and misinformation—on the planet. Many disciplines, such as data science, business intelligence, and investigative reporting, can benefit enormously from collecting and analyzing data from websites.
In this tutorial, you’ll learn how to:
- Parse website data using string methods and regular expressions
- Parse website data using an HTML parser
- Interact with forms and other website components
Note: This tutorial is adapted from the chapter “Interacting With the Web” in Python Basics: A Practical Introduction to Python 3.
The book uses Python’s built-in IDLE editor to create and edit Python files and interact with the Python shell, so you will see occasional references to IDLE throughout this tutorial. However, you should have no problems running the example code from the editor and environment of your choice.
Free Bonus: Click here to get our free Python Cheat Sheet that shows you the basics of Python 3, like working with data types, dictionaries, lists, and Python functions.
Scrape and Parse Text From Websites
Collecting data from websites using an automated process is known as web scraping. Some websites explicitly forbid users from scraping their data with automated tools like the ones you’ll create in this tutorial. Websites do this for two possible reasons:
- The site has a good reason to protect its data. For instance, Google Maps doesn’t let you request too many results too quickly.
- Making many repeated requests to a website’s server may use up bandwidth, slowing down the website for other users and potentially overloading the server such that the website stops responding entirely.
Important: Before using your Python skills for web scraping, you should always check your target website’s acceptable use policy to see if accessing the website with automated tools is a violation of its terms of use. Legally, web scraping against the wishes of a website is very much a gray area.
Please be aware that the following techniques may be illegal when used on websites that prohibit web scraping.
Let’s start by grabbing all the HTML code from a single web page. You’ll use a page on Real Python that’s been set up for use with this tutorial.
Your First Web Scraper
One useful package for web scraping that you can find in Python’s standard library is urllib
, which contains tools for working with URLs. In particular, the urllib.request
module contains a function called urlopen()
that can be used to open a URL within a program.
In IDLE’s interactive window, type the following to import urlopen()
: