Web Scraping with Selenium and Python (2024)

Web Scraping with Selenium and Python (1)

Imagine that you can easily get all the data you need from the Internet without having to manually browse the web or copy and paste. That's the beauty of web scraping. Whether you are a data analyst, market researcher, or developer, web scraping opens up a whole new world of automated data collection.

In this data-driven era, information is power. However, manually extracting information from hundreds or even thousands of web pages is not only time-consuming, but also error-prone. Fortunately, Web scraping provides an efficient and accurate solution that allows you to automate the process of extracting the data you need from the Internet, thus greatly improving efficiency and data quality.

Table of Content

  1. What is Web Scraping?
    • The process of Web scraping usually includes the following steps:
  2. Getting Started with Selenium
    • Preparation
    • Import Libraries
    • Accessing a Page
    • Startup Parameters
    • Example: Running in Headless Mode
    • Locating Page Elements
    • Element Interaction
    • Data Extraction
    • Waiting for Elements to Load
  3. Get Around Anti-Scraping Protections
  4. Conclusion

What is Web Scraping?

Web scraping is a technique for automatically extracting information from web pages by writing programmes. This technology has a wide range of applications in many fields, including data analysis, market research, competitive intelligence, content aggregation, and more. With Web scraping, you can collect and consolidate data from a large number of web pages in a short period of time, rather than relying on manual operations.

The process of Web scraping usually includes the following steps:

  • Send HTTP request: Programmatically send a request to the target website to get the HTML source code of the web page. Commonly used tools such as Python's requests library can do this easily.
  • Parsing HTML content: After obtaining the HTML source code, it needs to be parsed in order to extract the required data. HTML parsing libraries such as BeautifulSoup or lxml can be used to process the HTML structure.
  • Extracting data: Based on the parsed HTML structure, locate and extract specific content, such as article title, price information, image links, etc. Common methods include using XPath or CSS selectors.
  • Store data: Save the extracted data to a suitable storage medium, such as a database, CSV file or JSON file, for subsequent analysis and processing.

And which, by using tools such as Selenium, can simulate the operation of the user's browser, bypassing some of the anti-crawler mechanisms, so as to complete the Web scraping task more efficiently.

Struggling with the repeated failure to completely solve the irritating captcha?

Discover seamless automatic captcha solving with Capsolver AI-powered Auto Web Unblock technology!

Claim Your Bonus Code for top captcha solutions; CapSolver: WEBS. After redeeming it, you will get an extra 5% bonus after each recharge, Unlimited

Web Scraping with Selenium and Python (2)

Getting Started with Selenium

Let's take ScrapingClub as an example and use selenium to complete exercise one.
Web Scraping with Selenium and Python (3)

Preparation

First, you need to ensure that Python is installed on your local machine. You can check the Python version by entering the following command in your terminal:

python --version

Make sure the Python version is greater than 3. If it is not installed or the version is too low, please download the latest version from the Python official website. Next, you need to install the selenium library using the following command:

pip install selenium

Import Libraries

from selenium import webdriver

Accessing a Page

Using Selenium to drive Google Chrome to access a page is not complicated. After initializing the Chrome Options object, you can use the get() method to access the target page:

import timefrom selenium import webdriverchrome_options = webdriver.ChromeOptions()driver = webdriver.Chrome(options=chrome_options)driver.get('https://scrapingclub.com/')time.sleep(5)driver.quit()

Startup Parameters

Chrome Options can add many startup parameters that help improve the efficiency of data retrieval. You can view the complete list of parameters on the official website: List of Chromium Command Line Switches. Some commonly used parameters are listed in the table below:

ParameterPurpose
--user-agent=""Set the User-Agent in the request header
--window-size=xxx,xxxSet the browser resolution
--start-maximizedRun with maximized resolution
--headlessRun in headless mode
--incognitoRun in incognito mode
--disable-gpuDisable GPU hardware acceleration

Example: Running in Headless Mode

import timefrom selenium import webdriverchrome_options = webdriver.ChromeOptions()chrome_options.add_argument('--headless')driver = webdriver.Chrome(options=chrome_options)driver.get('https://scrapingclub.com/')time.sleep(5)driver.quit()

Locating Page Elements

A necessary step in scraping data is to find the corresponding HTML elements in the DOM. Selenium provides two main methods to locate elements on the page:

  • find_element: Finds a single element that meets the criteria.
  • find_elements: Finds all elements that meet the criteria.

Both methods support eight different ways to locate HTML elements:

MethodMeaningHTML ExampleSelenium Example
By.IDLocate by element ID<form id="loginForm">...</form>driver.find_element(By.ID, 'loginForm')
By.NAMELocate by element name<input name="username" type="text" />driver.find_element(By.NAME, 'username')
By.XPATHLocate by XPath<p><code>My code</code></p>driver.find_element(By.XPATH, "//p/code")
By.LINK_TEXTLocate hyperlink by text<a href="continue.html">Continue</a>driver.find_element(By.LINK_TEXT, 'Continue')
By.PARTIAL_LINK_TEXTLocate hyperlink by partial text<a href="continue.html">Continue</a>driver.find_element(By.PARTIAL_LINK_TEXT, 'Conti')
By.TAG_NAMELocate by tag name<h1>Welcome</h1>driver.find_element(By.TAG_NAME, 'h1')
By.CLASS_NAMELocate by class name<p class="content">Welcome</p>driver.find_element(By.CLASS_NAME, 'content')
By.CSS_SELECTORLocate by CSS selector<p class="content">Welcome</p>driver.find_element(By.CSS_SELECTOR, 'p.content')

Let's return to the ScrapingClub page and write the following code to find the "Get Started" button element for exercise one:
Web Scraping with Selenium and Python (4)

import timefrom selenium import webdriverfrom selenium.webdriver.common.by import Bychrome_options = webdriver.ChromeOptions()driver = webdriver.Chrome(options=chrome_options)driver.get('https://scrapingclub.com/')get_started_button = driver.find_element(By.XPATH, "//div[@class='w-full rounded border'][1]/div[3]")time.sleep(5)driver.quit()

Element Interaction

Once we have found the "Get Started" button element, we need to click the button to enter the next page. This involves element interaction. Selenium provides several methods to simulate actions:

  • click(): Click the element;
  • clear(): Clear the content of the element;
  • send_keys(*value: str): Simulate keyboard input;
  • submit(): Submit a form;
  • screenshot(filename): Save a screenshot of the page.

For more interactions, refer to the official documentation: WebDriver API. Let's continue to improve the ScrapingClub exercise code by adding click interaction:

import timefrom selenium import webdriverfrom selenium.webdriver.common.by import Bychrome_options = webdriver.ChromeOptions()driver = webdriver.Chrome(options=chrome_options)driver.get('https://scrapingclub.com/')get_started_button = driver.find_element(By.XPATH, "//div[@class='w-full rounded border'][1]/div[3]")get_started_button.click()time.sleep(5)driver.quit()

Data Extraction

Web Scraping with Selenium and Python (5)

When we arrive at the first exercise page, we need to collect the product's image, name, price, and description information. We can use different methods to find these elements and extract them:

from selenium import webdriverfrom selenium.webdriver.common.by import Bychrome_options = webdriver.ChromeOptions()driver = webdriver.Chrome(options=chrome_options)driver.get('https://scrapingclub.com/')get_started_button = driver.find_element(By.XPATH, "//div[@class='w-full rounded border'][1]/div[3]")get_started_button.click()product_name = driver.find_element(By.CLASS_NAME, 'card-title').textproduct_image = driver.find_element(By.CSS_SELECTOR, '.card-img-top').get_attribute('src')product_price = driver.find_element(By.XPATH, '//h4').textproduct_description = driver.find_element(By.CSS_SELECTOR, '.card-description').textprint(f'Product name: {product_name}')print(f'Product image: {product_image}')print(f'Product price: {product_price}')print(f'Product description: {product_description}')driver.quit()

The code will output the following content:

Product name: Long-sleeved Jersey TopProduct image: https://scrapingclub.com/static/img/73840-Q.jpgProduct price: $12.99Product description: CONSCIOUS. Fitted, long-sleeved top in stretch jersey made from organic cotton with a round neckline. 92% cotton, 3% spandex, 3% rayon, 2% polyester.

Waiting for Elements to Load

Sometimes, due to network issues or other reasons, elements may not have loaded yet when Selenium finishes running, which can cause some data collection to fail. To solve this problem, we can set it to wait until a certain element is fully loaded before proceeding with data extraction. Here is an example code:

from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECchrome_options = webdriver.ChromeOptions()driver = webdriver.Chrome(options=chrome_options)driver.get('https://scrapingclub.com/')get_started_button = driver.find_element(By.XPATH, "//div[@class='w-full rounded border'][1]/div[3]")get_started_button.click()# waiting for the product image elements to load completelywait = WebDriverWait(driver, 10)wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.card-img-top')))product_name = driver.find_element(By.CLASS_NAME, 'card-title').textproduct_image = driver.find_element(By.CSS_SELECTOR, '.card-img-top').get_attribute('src')product_price = driver.find_element(By.XPATH, '//h4').textproduct_description = driver.find_element(By.CSS_SELECTOR, '.card-description').textprint(f'Product name: {product_name}')print(f'Product image: {product_image}')print(f'Product price: {product_price}')print(f'Product description: {product_description}')driver.quit()

Get Around Anti-Scraping Protections

The ScrapingClub exercise is easy to complete. However, in actual data collection scenarios, obtaining data is not so easy because some websites employ anti-scraping techniques that may detect your script as a bot and block the collection. The most common situation is captcha challenges, such as funcaptcha, datadome, recaptcha, hcaptcha, geetest, etc.
Web Scraping with Selenium and Python (6)

Solving these captcha challenges requires extensive experience in machine learning, reverse engineering, and browser fingerprint countermeasures, which can take a lot of time. Fortunately, now you don't have to do all this work yourself. CapSolver provides a complete solution to help you easily bypass all challenges. CapSolver offers browser extensions that can automatically solve captcha challenges while using Selenium to collect data. Additionally, they provide API methods to solve captchas and obtain tokens, all of which can be completed in just a few seconds. Refer to the CapSolver Documentation for more information.

Conclusion

From extracting product details to navigating through complex anti-scraping measures, web scraping with Selenium opens doors to a vast realm of automated data collection. As we navigate the web's ever-evolving landscape, tools like CapSolver pave the way for smoother data extraction, making once-formidable challenges a thing of the past. So, whether you're a data enthusiast or a seasoned developer, harnessing these technologies not only enhances efficiency but also unlocks a world where data-driven insights are just a scrape away

Web Scraping with Selenium and Python (2024)

FAQs

Is Selenium good for web scraping? ›

This makes Selenium an excellent choice for scraping data from dynamic, JavaScript-heavy websites, often called Single-Page Applications (SPAs).

Is web scraping in Python hard? ›

It can be challenging for someone without coding knowledge. There is a big learning curve for them to learn about using Python, and the process must be time-consuming. But if you have a clear and helpful step-by-step guide on how to scrape data with Python, you can also master it soon!

How to speed up web scraping Python Selenium? ›

By leveraging multiprocessing, you can significantly speed up your web scraping tasks, especially when dealing with a large number of URLs. The scraping workload is distributed among multiple processes, allowing for parallel execution and efficient utilization of system resources.

Is Selenium better than Beautiful Soup for web scraping? ›

Selenium is not as user-friendly as Beautiful Soup. Beautiful Soup provides an easy-to-use, straightforward API for beginners. However, because Selenium necessitates an understanding of programming concepts like web drivers and browser automation, it might be more difficult to set up and utilize.

Is Selenium web scraping legal? ›

There are no specific laws that ban web scraping. Many companies use it legally to gather valuable data with different web scraping tools. However, certain situations can make web scraping illegal: Terms of Service Violations: Logging into websites and scraping data can be a problem.

What is the most efficient language for web scraping? ›

Python. Our first choice for the best language for web scraping is Python, which is arguably the most popular programming language. This versatile language is easy to learn, which makes it a top choice for the best language for web scraping.

How much Python is required for web scraping? ›

- Generally, it takes about one to six months to learn the fundamentals of Python, that means being able to work with variables, objects & data structures, flow control (conditions & loops), file I/O, functions, classes and basic web scraping tools such as requests ​​​​​ library.

What are the disadvantages of web scraping in Python? ›

Disadvantages of Using Python for Web Scraping

Using Python for web scraping can be a time-consuming process. Writing scripts for web scraping in Python can be a challenging task, necessitating the need to design and implement code that is able to access data from websites and store it properly.

How long does it take to learn Python web scraping? ›

Depending on your level of experience with programming and web development, it can take anywhere from a few weeks to several months to become proficient in web scraping.

What's faster than Selenium? ›

Playwright's execution speed is faster than Selenium's. The framework also supports auto-wait and performs relevant checks for elements. You can generate selectors inspecting web pages and a scenario by recording your actions. Playwright supports simultaneous execution and can also block unnecessary resource requests.

Should I learn Selenium or Scrapy? ›

Selenium can be used for light crawling, especially if the target website is JavaScript-rendered. However, you want to choose Scrapy over Selenium any time for simple to complex crawling due to its efficient multi-page crawling feature.

Can Selenium be used for web crawling? ›

The power of Selenium is not just restricted to testing your web apps, one other use can be of crawling or scraping websites, in particular, the ones which don't provide an API and load content lazily using Javascript.

Top Articles
Latest Posts
Article information

Author: Msgr. Refugio Daniel

Last Updated:

Views: 6093

Rating: 4.3 / 5 (74 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Msgr. Refugio Daniel

Birthday: 1999-09-15

Address: 8416 Beatty Center, Derekfort, VA 72092-0500

Phone: +6838967160603

Job: Mining Executive

Hobby: Woodworking, Knitting, Fishing, Coffee roasting, Kayaking, Horseback riding, Kite flying

Introduction: My name is Msgr. Refugio Daniel, I am a fine, precious, encouraging, calm, glamorous, vivacious, friendly person who loves writing and wants to share my knowledge and understanding with you.