Web scraping is growing exponentially as a method for harvesting online data. Recent surveys indicate up to 88% of companies leverage web scraping to collect data for business intelligence, with Python being the most popular language.
Selenium opens up additional possibilities thanks to its real browser automation capabilities. It allows bypassing restrictions like CAPTCHAs which trip up other scraping tools. This comprehensive technical guide will demonstrate advanced strategies and lesser-known use cases to help streamline your web scraping projects with Selenium in Python.
The Growth of Web Scraping
Let‘s briefly highlight key web scraping and Seleniumadoption trends driving this growth:
- Market spend on web scraping solutions predicted to reach $1.6 billion by 2026
- 89% of companies in a Spiceworks survey are scraping data for business intelligence
- Python dominates as the top language for web scraping by a large margin
- Selenium usage statistics show 70% market share for browser automation
Python leads for web scraping (source: The Next Web)
This confluence of trends around Python and Selenium underlines why mastering the duo can prove highly useful for web scraping practitioners.
Scraping Checkboxes, Radio Buttons and More
Interacting with HTML forms and controls via Selenium requires appreciating how browsers handle them. Let‘s deep dive techniques for checking boxes, selecting radio options and form submissions.
Checkboxes vs Click()
Checking boxes via clicking often fails since Selenium does not process them as clicks by default. This is browser dependent behavior.
Instead, move the mouse cursor over the checkbox with ActionChains
first:
from selenium import webdriver from selenium.webdriver import ActionChains# Initialize browser driver = webdriver.Chrome()# Find checkbox checkbox = driver.find_element(By.ID, ‘check1‘)# Move to checkboxactions = ActionChains(driver) actions.move_to_element(checkbox) # Check checkboxactions.click(checkbox)actions.perform()
Next, let‘s see how to check multiple checkboxes – say check1, check3 and check5 out of 5 total boxes:
# List for checked boxeschecked = [False] * 5checked[0] = True checked[2] = Truechecked[4] = True# Check required boxesfor i in range(len(checked)): if checked[i]: checkbox = driver.find_element(By.ID, f‘check{i+1}‘) ActionChains(driver).move_to_element(checkbox).click().perform()
This iterates over the checked boolean list and checks those set to True
. The key point is we avoid blindly trying to .click()
and handle checkboxes differently.
Radio Buttons
Radio buttons only allow one option to be selected out of many. Here‘s an example to pick a specific radio choice:
# Get all choices choices = driver.find_elements(By.NAME, ‘choices‘)# Select our choice our_choice = choices[2]ActionChains(driver).move_to_element(our_choice).click().perform()
Tip: You can also use xpath, IDs or other locator strategies besides name.
Submitting Forms
We may need to fill out and submit forms during scraping. Use the send_keys()
method to populate fields:
first_name = driver.find_element(By.ID, ‘first-name‘)first_name.clear() first_name.send_keys(‘John‘)email = driver.find_element(By.ID, ‘email‘) email.send_keys(‘[emailprotected]‘)# Submit formdriver.find_element(By.CSS_SELECTOR, ‘button[type="submit"]‘).click()
This clears existing values, inputs our data and submits the form.
Handling iframes
Websites can load content in iframes
or inline frames embedded in the page. Selenium will fail to see elements in iframes until we switch context to the frame:
# Store iframe web element frame = driver.find_element(By.TAG_NAME, ‘iframe‘)# Switch to iframedriver.switch_to.frame(frame)# Now can access elements element = driver.find_element(By.ID, ‘elementInFrame‘)
Note: We may deal with multiple nested iframes. In that case, switch into the parent iframe first before accessing the child frames.
Let‘s automate iframe handling:
# Function to handle iframesdef switchFrame(loc): try: driver.switch_to.frame(loc) print(f"Switched to frame {loc}") except: print(f"Can‘t switch to frame {loc}")# Call function switchFrame(1) # Index switchFrame(‘frameName‘) # NameswitchFrame(driver.find_element(By.ID, ‘frameID‘)) # Web element
This function handles all common iframe scenarios by taking locators as arguments.
Switching Tabs
Modern sites dynamically open new tabs on actions like click or form submit. Master tab handling techniques:
Current Tab vs All Tabs Method
# Open link in new tab link = driver.find_element(By.LINK_TEXT, ‘New tab link‘)link.send_keys(Keys.CONTROL + ‘t‘)# Store handles main_tab = driver.current_window_handleall_tabs = driver.window_handles # Switch tab logicfor tab in all_tabs: if tab != main_tab: driver.switch_to.window(tab) # Tab actions breakdriver.close() # Close new tabdriver.switch_to.window(main_tab) # Back to main
This differentiates the original main tab from newly opened ones. We access the additional ones, perform our tasks and return back to the first tab cleanly.
Tab Index Order Method
We can also track tabs in order of opening:
original_tab = driver.current_window_handle# Open new tabs driver.execute_script("window.open(‘‘);")second_tab = driver.window_handles[1]driver.execute_script("window.open(‘‘);") third_tab = driver.window_handles[2] # Switch tabsdriver.switch_to.window(original_tab) driver.switch_to.window(third_tab)# etc...
Think of window handles as a stack or array storing tab references. We insert and pop tabs by order.
Warning: Balance tab usage to avoid overloading servers while scraping. Use delays, proxies and services like ScrapeStorm to prevent blocks.
Now let‘s tackle some advanced real-world web scraping challenges.
Logging In to Sites
Many websites require logging in before accessing content. This involves multiple steps:
- Navigate to login page
- Locate username, password fields
- Input credentials with
send_keys()
- Handle 2FA/CAPTCHA if needed
- Click submit button
- Confirm successful login
Let‘s automate login for a fictional site:
# Import modulesfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.common.keys import Keysimport time# Initialize Chrome browser and navigate to pagedriver = webdriver.Chrome()driver.get("http://www.example.com/login")# Find user/pass fieldsuser_field = driver.find_element(By.ID,"userName")pass_field = driver.find_element(By.ID,"password")# Send login detailsuser_field.send_keys("jsmith") pass_field.send_keys("377*EhD")# Handle CAPTCHA here # Submit formdriver.find_element(By.ID,"loginBtn").click() # Verify successful logintime.sleep(5)assert "Welcome" in driver.page_source print("Logged in successfully!")# Continue scraping workflow...driver.quit()
This logs into our site, pauses to check if login worked, and continues if successful. Tweak selectors and credentials accordingly.
Logging in Across Tabs
We can also perform login across multiple tabs:
# Login tabdriver.execute_script(‘‘‘window.open("http://example.com/login","_blank");‘‘‘)driver.switch_to.window(driver.window_handles[1]) # Login page actions...# Main tab driver.switch_to.window(driver.window_handles[0])
This keeps credentials secure in a separate tab while retaining session access in original tab. Useful for sites that disallow multiple logins.
Handling CAPTCHAs and Tests
Sites may generate tests like CAPTCHAs to block bots. Here are some solutions:
- Use the Audio mode to get CAPTCHA as a sequence of numbers/letters
- Employ services like 2Captcha and Anti-Captcha to solve them
- Rotate IP addresses using proxy services to avoid triggering tests
- For A/B testing, locate and click/submit the desired version
Let‘s automate solving a common image CAPTCHA:
import speech_recognition as sr# Switch to CAPTCHA frameframes = driver.find_elements(By.TAG_NAME, ‘iframe‘)driver.switch_to.frame(frames[0])# Click audio icon driver.find_element(By.CSS_SELECTOR,‘.audio‘).click()# Initialize speech enginerec = sr.Recognizer()# Receive audio clip with sr.Microphone() as source: audio = rec.listen(source)text = rec.recognize_google(audio)print(text) # Print text # Enter CAPTCHA solution driver.find_element(By.ID, ‘code‘).send_keys(text) #Press submit buttondriver.find_element(By.ID,‘submitCaptcha‘).click()
We switch context, transform the image CAPTCHA into an audio clip, run speech recognition to extract the code and submit it automatically.
This way you can scrape past CAPTCHAs effortlessly.
Scraping JavaScript-Loaded Sites
Another advantage of browser automation engines like Selenium is scraping pages leveraging JavaScript.
Let‘s scrape a site using dynamic infinite scroll to load content.
Objective: Extract all dog images
Approach:
- Scroll to end of page
- Let JavaScript trigger and load images
- Extract image elements after loading finishes
- Repeat
Implementation:
from selenium import webdriverfrom selenium.webdriver.common.by import By url = ‘http://example.com/infiniteDogs‘driver = webdriver.Chrome()driver.get(url)lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")match=Falsewhile(match==False): lastCount = lenOfPage time.sleep(3) lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;") if lastCount==lenOfPage: match=True# Now extract images after full load images = driver.find_elements(By.TAG_NAME, ‘img‘)for image in images: print(image.get_attribute(‘src‘)) # Image urls/paths
This scrolls incrementally, comparing heights until full loading, before finding elements. The key difference vs simple Selenium is accounting for JavaScript actions first.
There are many other combinations possible around AJAX-based sites, infinite scrolling, filtering and more.
Conclusion
This guide should equip you to harness the true power of Selenium for robust web scraping using Python. The key is to replicate an end user closely while accounting for dynamic JavaScript behavior.
Some parting tips:
- Employ proxy rotation, browsers, tools like ScrapeStorm to avoid blocks
- Consider commercial scraping platforms like ParseHub if needed
- Check legal compliance for your jurisdiction
I hope these techniques help you effectively gather and analyze new datasets. Let me know if you have any other questions!