Python is the go-to language for web scraping — and Amazon is the most-scraped website on the internet. In this hands-on tutorial, you'll learn exactly how to extract Amazon product data using Python, from a simple first scraper to a production-ready extraction pipeline.
What You'll Learn
- Setting up your scraping environment
- Simple HTTP scraping with
requests+BeautifulSoup - Handling JavaScript-rendered content with
Playwright - Rotating proxies to avoid IP bans
- Parsing all key product fields (title, price, rating, reviews, BSR)
- Storing data as JSON or CSV
- Common errors and how to fix them
Prerequisites
- Python 3.9+
- Basic Python knowledge
pipinstalled
Step 1 — Install Dependencies
pip install requests beautifulsoup4 lxml playwright pandas
playwright install chromium
Step 2 — Simple HTTP Scraper (Small Scale)
For low-volume scraping (under a few hundred requests), a basic requests + BeautifulSoup scraper works:
import requests
from bs4 import BeautifulSoup
import json
import time
import random
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/124.0.0.0 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'DNT': '1',
'Connection': 'keep-alive',
}
def scrape_product(asin: str, marketplace: str = 'amazon.com') -> dict:
url = f'https://www.{marketplace}/dp/{asin}'
try:
response = requests.get(url, headers=HEADERS, timeout=10)
response.raise_for_status()
except requests.RequestException as e:
print(f'Request failed for {asin}: {e}')
return {}
soup = BeautifulSoup(response.content, 'lxml')
# Parse fields
title_el = soup.find('span', {'id': 'productTitle'})
price_el = soup.find('span', {'class': 'a-price-whole'})
rating_el = soup.find('span', {'class': 'a-icon-alt'})
review_el = soup.find('span', {'id': 'acrCustomerReviewText'})
return {
'asin': asin,
'url': url,
'title': title_el.text.strip() if title_el else None,
'price': price_el.text.strip() if price_el else None,
'rating': rating_el.text.strip() if rating_el else None,
'reviews': review_el.text.strip() if review_el else None,
}
# Usage
asins = ['B09G3HRMVB', 'B08N5WRWNW', 'B07XJ8C8F5']
results = []
for asin in asins:
data = scrape_product(asin)
results.append(data)
print(f'Scraped: {data.get("title", "Failed")}')
time.sleep(random.uniform(2, 5)) # Random delay!
# Save to JSON
with open('amazon_products.json', 'w') as f:
json.dump(results, f, indent=2)
Note: This basic approach works for testing, but Amazon blocks it heavily at scale. You'll see CAPTCHA pages or empty responses after ~50 requests without proxy rotation.
Step 3 — Handling JavaScript Content with Playwright
Many Amazon pages load pricing and availability via JavaScript after page load. For these, you need a real browser:
from playwright.sync_api import sync_playwright
import json
def scrape_with_playwright(asin: str) -> dict:
with sync_playwright() as p:
browser = p.chromium.launch(
headless=True,
args=['--no-sandbox', '--disable-setuid-sandbox']
)
context = browser.new_context(
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/124.0.0.0 Safari/537.36',
viewport={'width': 1920, 'height': 1080},
locale='en-US',
)
page = context.new_page()
# Block unnecessary resources for speed
page.route('**/*.{png,jpg,gif,svg,woff,woff2}',
lambda route: route.abort())
page.goto(f'https://www.amazon.com/dp/{asin}',
wait_until='domcontentloaded')
# Wait for price element
try:
page.wait_for_selector('.a-price-whole', timeout=5000)
except:
pass # Price might not exist
title = page.query_selector('#productTitle')
price = page.query_selector('.a-price-whole')
rating = page.query_selector('.a-icon-alt')
result = {
'asin': asin,
'title': title.inner_text().strip() if title else None,
'price': price.inner_text().strip() if price else None,
'rating': rating.inner_text().strip() if rating else None,
}
browser.close()
return result
Step 4 — Proxy Rotation (Essential for Scale)
Without proxy rotation, Amazon blocks you after 50–100 requests. Here's a simple proxy rotator:
import requests
import random
from itertools import cycle
# Residential proxies work best (datacenter proxies get blocked faster)
PROXIES = [
'http://user:pass@proxy1.example.com:8080',
'http://user:pass@proxy2.example.com:8080',
'http://user:pass@proxy3.example.com:8080',
]
proxy_pool = cycle(PROXIES)
def get_with_proxy(url: str, retries: int = 3) -> requests.Response | None:
for attempt in range(retries):
proxy = next(proxy_pool)
try:
response = requests.get(
url,
headers=HEADERS,
proxies={'http': proxy, 'https': proxy},
timeout=15
)
if response.status_code == 200:
return response
elif response.status_code == 503:
print(f'Got CAPTCHA on attempt {attempt + 1}, rotating proxy...')
except Exception as e:
print(f'Proxy failed: {e}')
return None
Step 5 — Parsing All Key Fields
Here's a comprehensive parser for all the major product fields:
def parse_product_page(soup: BeautifulSoup, asin: str) -> dict:
def text(selector, attr_id=None, attr_class=None):
"""Safe text extractor."""
try:
if attr_id:
el = soup.find(attrs={'id': attr_id})
else:
el = soup.find(class_=attr_class)
return el.get_text(strip=True) if el else None
except:
return None
# Price — combine whole + fraction
price_whole = text(None, attr_class='a-price-whole')
price_frac = text(None, attr_class='a-price-fraction')
price = f"{price_whole}{price_frac}" if price_whole else None
# Images
import re
img_data = soup.find('div', {'id': 'imgTagWrapperId'})
img_url = img_data.find('img')['src'] if img_data else None
# BSR
bsr_el = soup.find('span', string=re.compile(r'Best Sellers Rank'))
bsr_txt = bsr_el.find_next('span').text.strip() if bsr_el else None
return {
'asin': asin,
'title': text(attr_id='productTitle'),
'brand': text(attr_id='bylineInfo'),
'price': price,
'rating': text(None, attr_class='a-icon-alt'),
'review_count': text(attr_id='acrCustomerReviewText'),
'bsr': bsr_txt,
'availability': text(attr_id='availability'),
'main_image': img_url,
}
Step 6 — Save as CSV
import pandas as pd
# Assuming results is a list of product dicts
df = pd.DataFrame(results)
df.to_csv('amazon_products.csv', index=False, encoding='utf-8')
print(f'Saved {len(df)} products to amazon_products.csv')
Common Errors and Fixes
| Error | Cause | Fix |
|---|---|---|
503 Service Unavailable | IP blocked / CAPTCHA | Rotate proxy, add delays |
Empty title element | JavaScript-rendered page | Use Playwright instead of requests |
ConnectionError | Proxy failed | Add retry logic with fallback proxies |
Price returns None | Different price selector | Check for .a-offscreen as fallback |
| Inconsistent data | Layout A/B test by Amazon | Use multiple selector fallbacks |
Success Rate Expectations
| Approach | Expected Success Rate | Good For |
|---|---|---|
requests alone | 20–40% | Testing only |
requests + headers | 40–60% | Very small scale |
requests + proxy rotation | 70–85% | Small–medium projects |
Playwright + proxies | 85–95% | Medium projects |
| Professional service | 98–99.5% | Production / enterprise |
When to Use a Professional Service Instead
Building and maintaining a Python scraper becomes impractical when:
- You need consistent 98%+ success rates (Amazon changes layout frequently)
- You're scraping millions of records per month
- You need data from multiple marketplaces simultaneously
- You want automatic maintenance when Amazon changes its structure
- Your team doesn't have scraping infrastructure expertise
At that point, the engineering cost of maintaining your own scraper exceeds the cost of a managed service.
Get a free quote and we'll assess your requirements — including a sample extraction to demonstrate output quality.
Our team of senior data engineers and web scraping specialists has delivered over 500 million records across 12+ Amazon marketplaces. We write about scraping techniques, eCommerce data strategy, and Amazon market intelligence based on real-world project experience.