Web Scraping Proxies: Why Requests Work Locally but Get Blocked in the Cloud

Web Scraping Proxies: Why Requests Work Locally but Get Blocked in the Cloud (A Diagnosis Framework)

Your scraper runs perfectly on your laptop. You deploy it to AWS, and it fails immediately. This is not a code bug—it is an environment detection problem. This article provides a diagnostic framework to identify why your web scraping proxy configuration works locally but fails in cloud or datacenter environments, with validation steps and defensive-only remediation paths.


Direct Answer: Why Local Works but Cloud Fails

The core mechanism: When a scraper runs correctly on your local machine but fails in production, the scraper itself has no issue—the website detects something about your running environment. Your local machine uses a residential IP with high trust, browser-native TLS signatures, and natural request timing. A cloud server uses a datacenter IP with low trust (often pre-blocked), HTTP library TLS fingerprints, and parallel request patterns. Websites detect these differences at multiple layers simultaneously.

Five detection layers cause local-vs-cloud blocking:

  1. IP Trust Score: Datacenter IPs from providers like AWS, GCP, and Azure are flagged before any request reaches the server. Cloud providers publish their IP subnet lists, which websites use for immediate blocking. An estimated 99% of traffic from traceable datacenter IPs is bot traffic.

  2. ASN Recognition: AWS WAF maintains a HostingProviderIPList containing all known hosting providers, with inclusion determined on an ASN basis. If your proxy provider's IP range falls within a known datacenter ASN, you may be blocked before sending a single request.

  3. TLS Fingerprint Mismatch: HTTP client libraries produce TLS Client Hello messages with parameters distinct from real browsers. The JA3 fingerprint algorithm hashes five fields (TLSVersion, Ciphers, Extensions, EllipticCurves, EllipticCurvePointFormats), and anti-scraping services maintain databases of whitelisted browser fingerprints versus blacklisted scraping tool fingerprints.

  4. Header and Behavioral Signals: Default library user-agents explicitly identify automation. Beyond headers, modern anti-bot systems use machine learning to analyze request patterns, timing, and behavioral fingerprints across dozens of signals.

  5. Browser Fingerprint Leakage: For browser automation, headless Chrome differs from standard Chrome in subtle fingerprint ways. The navigator.webdriver property returns true for automation-controlled browsers, and websites gather browser API information including audio/video devices and WebGL renderer data.

Key insight: Websites use a layered approach combining network-level signals (JA3/JA4, IP geolocation, ASN reputation) with application-level signals (missing fonts, unusual screen sizes, headless-detection scripts). Fixing one layer while ignoring others will not resolve blocking.


Diagnostic Flowchart: Identifying Your Blocking Cause

Use this decision tree to systematically identify why your scraper fails in cloud deployment. Start at the top and follow the branches based on your observed symptoms.

START: Scraper works locally but fails in cloud?
    │
    ▼
[1] Check if anti-bot is installed on target
    (Use Wappalyzer browser extension to detect protection)
    │
    ├─► Anti-bot detected → Refer to anti-bot specific diagnostic 
    │                        (protection-specific validation required)
    │
    └─► No anti-bot detected
            │
            ▼
[2] Does manual browser access work from same cloud IP?
    (SSH to server, curl or browser test)
    │
    ├─► NO: Browser also blocked
    │       │
    │       ▼
    │   CAUSE: IP/ASN-based blocking
    │   → Validate: Check if IP belongs to known datacenter ASN
    │   → Test: Try residential proxy or different datacenter provider
    │   → See: "IP-Based Blocking" row in Troubleshooting Matrix
    │
    └─► YES: Browser works, scraper fails
            │
            ▼
[3] Is your scraper browserless or browser automation?
    │
    ├─► BROWSERLESS (requests, httpx, aiohttp, etc.)
    │       │
    │       ▼
    │   [4] What HTTP error code is returned?
    │       │
    │       ├─► 403 on ALL requests immediately
    │       │   → Likely: TLS fingerprint or IP blocking
    │       │   → Test: Use TLS impersonation library (curl_cffi)
    │       │   → If still blocked: IP/ASN issue
    │       │
    │       ├─► 403 after some successful requests
    │       │   → Likely: Header mismatch or rate detection
    │       │   → Test: Copy exact browser headers from network tab
    │       │   → Validate header order and capitalization
    │       │
    │       ├─► 429 Too Many Requests
    │       │   → Likely: Rate limiting
    │       │   → Fix: Reduce threads, add random delays
    │       │   → Consider: Session-based IP rotation
    │       │
    │       └─► Timeout / Connection errors
    │           → Likely: Rate limiting or IP ban mid-session
    │           → Reduce parallel requests
    │           → Rotate IPs more frequently
    │
    └─► BROWSER AUTOMATION (Puppeteer, Playwright, Selenium)
            │
            ▼
        [5] Check browser fingerprint signals
            │
            ├─► navigator.webdriver = true?
            │   → Fix: Add --disable-blink-features=AutomationControlled
            │
            ├─► Missing plugins/fonts/WebGL anomalies?
            │   → Apply stealth plugin
            │   → Test at fingerprint validation sites
            │
            └─► Fingerprint appears normal but still blocked?
                → Likely: IP/ASN blocking compound issue
                → Test with residential proxy
                → Check for behavioral pattern detection

Terminal diagnosis mapping:

DiagnosisPrimary CauseValidation MethodRemediation Path
Blocked before any content loadsASN/IP range pre-blockingCheck error 1005 or immediate 403Switch to residential proxies or different ASN
Blocked with library but not browserTLS fingerprint mismatchCompare JA3 hash of client vs browserUse TLS impersonation library or browser automation
Works locally, 403 in cloud with same codeDatacenter IP detectionTest same code with residential proxyUse rotating residential proxies
Initial success then failuresRate limiting or behavioral detectionMonitor error rate over timeReduce concurrency, add delays, implement session management
Geo-specific blocking (error 1009)Country-based access controlTest with geo-targeted proxyUse proxy for web scraping with country parameter



Troubleshooting Matrix: Symptoms, Causes, Validation, and Fixes

This matrix maps observable symptoms to likely causes, provides validation steps to confirm the diagnosis, and offers defensive-only remediation approaches.

Row 1: IP-Based Blocking (Cloudflare 1005, Immediate 403)

SymptomLikely CauseValidation StepFix / MitigationEscalation Path
Cloudflare error 1005 or immediate 403 on first requestASN/IP range is pre-blocked; datacenter IP recognizedCheck if IP belongs to AWS/GCP published ranges; test with manual browser from same IPUse datacenter proxy from different ASN, or switch to residential proxyIf datacenter proxies from multiple ASNs fail within 50 requests, escalate to residential proxies
All requests blocked regardless of headers/timingTarget maintains HostingProviderIPList blocking known hosting ASNsASN lookup on proxy IP; compare against AWS/GCP published rangesUse proxy providers for web scraping with residential or ISP IP rangesIf residential also blocked, target may use advanced fingerprinting

Key mechanism: Cloud providers like AWS publish their IP subnet lists, and websites proactively block these ranges. Some targets block entire subnets instead of single IPs—this is the case for high-security platforms. The solution when facing subnet blocking is to use multiple proxy providers and distribute requests across geographic regions.


Row 2: TLS Fingerprint Mismatch (Blocked Before Content Loads)

SymptomLikely CauseValidation StepFix / MitigationEscalation Path
403 immediately with no page content; works with browser from same IPHTTP library TLS fingerprint differs from browserCompare JA3 hash of your client to known browser fingerprintsUse TLS impersonation library (curl_cffi for Python, tls-client for Go)If TLS impersonation fails, switch to full browser automation
Request rejected at TLS handshake levelAnti-scraping service maintains JA3/JA4 blacklistTest with curl-impersonate command-line toolBrowser automation tools (Puppeteer, Playwright) use authentic TLS fingerprintsConsider managed web scraping proxy service with built-in TLS handling

Technical detail: JA3 fingerprint uses five fields from the TLS Client Hello: TLSVersion, Ciphers, Extensions, EllipticCurves, EllipticCurvePointFormats. These are concatenated and hashed with MD5. JA4 is the successor with improved accuracy for TLS 1.3 and QUIC/HTTP3 traffic. Standard HTTP libraries produce fingerprints that differ from browsers and are catalogued by anti-scraping services.

Row 3: Header/User-Agent Detection (403 with Partial Content)

SymptomLikely CauseValidation StepFix / MitigationEscalation Path
403 returned with error page content (not blank)Default library user-agent identifies automationCheck User-Agent header; compare against browser DevTools network tabConfigure spider to send browser-like User-Agent; optimize all request headersIf headers fix doesn't work, IP is likely flagged—need rotating proxy pool
Same headers work in one library but fail in anotherFramework adds extra headers; header capitalization differsCompare raw requests between libraries (Scrapy vs requests); check header caseUse raw HTTP client or middleware to control exact header outputTest with proxy web scraping configuration to isolate header issues

Library-specific issue: Scrapy (built on Twisted) uses a different HTTP client than the requests library (which uses urllib3). This causes different HTTP requests even with the same configured headers. Some websites detect bots by looking for capitalized headers versus lowercase headers from real browsers.


Row 4: Browser Fingerprint Detection (Headless/Automation Signals)

SymptomLikely CauseValidation StepFix / MitigationEscalation Path
Browser automation blocked; manual browser worksnavigator.webdriver returns true; headless fingerprint differencesCheck navigator.webdriver in console; test at fingerprint validation sitesUse --disable-blink-features=AutomationControlled flag; apply stealth pluginIf stealth patches fail, check for WebGL/canvas fingerprint leaks
Blocked despite stealth pluginWebsites gather browser API information: audio/video devices, WebGL rendererCompare browser fingerprint between headless and headed modeUse full headed browser with virtual display; rotate fingerprint componentsSophisticated detection uses ML across multiple signals—may need managed service

Detection mechanism: Headless Chrome differs from standard Chrome in subtle fingerprint ways. Stealth plugins attempt to hide these differences by erasing navigator.webdriver=true using the --disable-blink-features=AutomationControlled flag and overriding navigator.plugins. However, HTTP/TLS fingerprints become less useful for detection when stealth plugin is used with a realistic user agent, shifting detection to behavioral signals.


Row 5: Rate Limiting (429 After N Requests)

SymptomLikely CauseValidation StepFix / MitigationEscalation Path
429 Too Many Requests after initial successRequest rate exceeds threshold; pattern detectedMonitor error rate over time; check if 429 correlates with request countRandomize intervals, reduce request frequency, rotate IP addresses, vary user-agentIf rate limiting persists at low volume, behavioral fingerprinting is active
Timeout errors after initial successRate limiting manifests as timeouts instead of explicit 429Check if timeouts correlate with request volume or elapsed timeReduce parallel threads and add request delaysImplement exponential backoff on failures

Behavioral signals: Modern anti-bot systems use machine learning to spot automated traffic by examining request patterns, browser fingerprints, and dozens of other signals. Fixed intervals between requests are a strong automation signal.

Row 6: Session/Cookie Inconsistency (Works Initially Then Fails)

SymptomLikely CauseValidation StepFix / MitigationEscalation Path
Scraper works for first requests then failsSession state lost between requests; cookies not maintainedCheck if session cookies are persisted across requestsImplement session management with consistent cookies per IPUse rotating proxies for web scraping with session locking
Works with one IP, fails when rotatingSame cookie sent from multiple IPs (impossible for real user)Audit cookie handling across IP rotationsNever send a single cookie from multiple IP addresses; can send multiple cookies from single IPUse proxy sessions to lock specific IP for consistent use

Critical rule: Sending a single cookie from multiple IPs is impossible in reality and is an immediate automation signal. However, sending multiple cookies from a single IP is normal since many users share public IPs. Proxy sessions allow locking a certain IP, enabling you to pair an IP address with human-like cookies and headers.


Row 7: Geo-Blocking (1009 Region-Based)

SymptomLikely CauseValidation StepFix / MitigationEscalation Path
Error 1009 or region-specific access deniedTarget restricts access by country; proxy geo doesn't matchCheck proxy IP geolocation against target's expected regionUse proxy for web scraping with country parameter (e.g., country-US)If geo-targeted proxy still fails, target may use additional validation



Measurement Plan Template

To diagnose blocking causes and validate fixes, establish baseline metrics and monitoring before deployment. This template provides the fields to track—actual thresholds depend on your target site and should be determined through testing.

Baseline Metrics to Establish

MetricDescriptionMeasurement MethodValidation
Response Time (baseline)Average response time under normal conditionsLog response times for first 100 successful requestsDeviation >2x baseline may indicate throttling
Error Rate (baseline)Percentage of non-2xx responsesTrack HTTP status codes over sample periodSpike in 403/429 signals suspicion triggering
Data CompletenessPercentage of expected fields retrievedCompare output against expected schemaPartial data may indicate soft blocking
Success Rate per IPRequests completed before block per IPTrack block events correlated to IP lifecycleDetermine when to rotate IPs
Success Rate per SessionRequests completed before block per sessionTrack block events correlated to session lifecycleDetermine session duration limits

Monitoring Fields for Production

FieldWhat It IndicatesAction Threshold
403 Forbidden rateHeader/IP/TLS detectionInvestigate at >YOUR_THRESHOLD_PERCENT
429 Too Many Requests rateRate limiting triggeredReduce concurrency or add delays
Timeout rateSoft blocking or rate limitingCheck for pattern correlation with volume
Response body anomalyContent served differs from expected (captcha, empty)Fingerprint or behavioral detection
IP block correlationWhich IPs get blocked and whenIdentify problematic ASNs or providers

Acceptance Criteria Template

Before declaring your proxy web scraping configuration production-ready:

  • [ ] Error rate remains below YOUR_ACCEPTABLE_RATE over YOUR_SAMPLE_SIZE requests

  • [ ] Same configuration works from at least YOUR_SAMPLE_REGIONS geographic regions

  • [ ] Success rate per IP exceeds YOUR_MIN_REQUESTS_PER_IP before rotation needed

  • [ ] No 403/429 spikes correlated with deployment timing

  • [ ] Data completeness matches local testing baseline

Note: Specific thresholds (percentages, request counts, timeframes) must be determined through testing against your target site. The knowledge base provides the measurement method framework but not universal threshold values.



Integration Snippets: Validated Code Patterns

The following code examples are extracted verbatim from authoritative sources. Use these as starting points, adapting credentials and endpoints to your environment.

Proxy Configuration with Session Locking

When implementing web scraping with proxy servers, maintaining consistent IP for session-based requests is critical:

http://session-my_session_id_123:p455w0rd@proxy.apify.com:8000


Proxy with Country Parameter for Geo-Targeting

To access region-restricted content, specify country in proxy URL:

http://country-US:p455w0rd@proxy.apify.com:8000


Python Requests with Proxy Server

Basic proxy server for web scraping configuration:

import requests

proxy_servers = {
   'http': 'http://proxy.apify.com:8000',
   'https': 'http://proxy.apify.com:8000',
}

auth = ('auto', 'p455w0rd')
response = requests.get('https://example.com', proxies=proxy_servers, auth=auth)


Complete Browser Headers Configuration

To avoid header-based detection, include full browser header set:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Referer': 'https://www.google.com/',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Cache-Control': 'max-age=0',
}


Disabling Automation Detection Flag (Browser Automation)

For Playwright or Puppeteer, disable the navigator.webdriver detection:

const browser = await chromium.launch({
    args: ['--disable-blink-features=AutomationControlled']
});


Session Pool for Rotating Proxies

Managing multiple sessions with consistent fingerprints per session:

const user1 = {
    sessionId: 'user1',
    headers: {
        "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36",
    },
    cookieJar: new CookieJar(),
}


Error Handling with Exponential Backoff

Graceful retry logic for 403 errors:

import time, math
retry_delay = 1
for attempt in range(10):
    response = requests.get(url)
    if response.status_code == 403:
        print(f'403! Retrying in {retry_delay} seconds...')
        retry_delay = math.pow(2, attempt)
        time.sleep(retry_delay)
    else:
        break


TLS Impersonation Template (TEMPLATE)

For sites blocking standard HTTP library TLS fingerprints, use TLS impersonation. The curl_cffi library can simulate browser TLS/JA3 and HTTP/2 fingerprints, unlike standard requests or httpx.

# Standard example (not verbatim from sources)
# curl_cffi wraps curl-impersonate to mimic browser TLS handshakes

from curl_cffi import requests

response = requests.get(
    'YOUR_TARGET_URL',
    impersonate='YOUR_BROWSER_CHOICE',  # e.g., 'chrome110'
    proxies={'https': 'YOUR_PROXY_URL'}
)

# Validation Steps:
# 1. Compare response status vs standard requests library
# 2. Verify JA3 fingerprint matches browser at fingerprint testing service
# 3. Confirm response body contains expected content (not block page)



Decision Matrix: Selecting Proxy Type After Diagnosis

Once you have diagnosed the blocking cause, use this matrix to select the appropriate proxy type. This table uses vendor-claimed success rates for reference; actual performance requires testing against your specific target.

Target Protection LevelRecommended Proxy TypeExpected Success Rate (Vendor Claims)Cost ModelWhen to Escalate
No anti-bot protectionDatacenter proxyHigh (test with 50+ requests to validate)Per IP or flat-rate bandwidthIf 403s persist after headers fix, check TLS layer
Basic protection (rate limits only)Rotating datacenter proxy + session managementModerate-HighPer IPIf blocked within 50 requests per IP, escalate to residential
Moderate (Cloudflare basic, WAF)Residential proxy or ISP proxiesVendor claims: 95-99% on protected sitesPer GB of trafficIf JS challenge persists despite residential IP
Aggressive (advanced bot management)Rotating residential proxies + browser automationVendor claims: 40-60% datacenter vs 95-99% residential on protected sitesPer GB + compute costsConsider managed web scraping proxy service

Decision rules:

  1. Always try datacenter proxies first: They are faster, more stable, and cheaper than residential proxies. Test with at least 50 requests per IP before concluding they don't work.

  2. Escalate to residential proxies when: Datacenter IPs are blocked regardless of headers/TLS fixes, error 1005 appears, or target maintains aggressive HostingProviderIPList.

  3. When to buy rotating residential proxies: High-security targets (e-commerce, social media, travel aggregators) where datacenter success rate is unacceptable and session management alone doesn't resolve blocking.

  4. When to buy datacenter proxy: Target is low-security, speed is critical, budget is constrained, and testing confirms acceptable success rate. Consider static datacenter proxies for consistent IP assignment.

  5. When rotating proxy for scraping is needed: Rate limiting triggers at volumes that require IP diversity beyond what sticky sessions provide.

Vendor selection considerations: When evaluating the best web scraping proxies or the best web scraping proxy for your use case, test against your actual target site rather than relying solely on vendor claims. The best proxies for web scraping or the best proxy for web scraping depends on your specific target's protection mechanisms, not generic benchmarks. Explore options at proxy001.com to compare proxy types.



Procurement Due Diligence Checklist (TEMPLATE)

Before purchasing from proxy providers for web scraping, validate these criteria. Fields marked with YOUR_* require input from vendor documentation or testing.

Technical Validation

  • [ ] IP Type Verification: Confirm whether IPs are datacenter, residential, or ISP

  • [ ] ASN Diversity: Check if provider offers IPs from multiple ASNs (reduces subnet blocking risk)

  • [ ] Geographic Coverage: Verify available locations match your target sites' expected regions

  • [ ] Session Support: Confirm ability to lock IP for session-based requests (session parameter in proxy URL)

  • [ ] Protocol Support: Verify HTTP/HTTPS/SOCKS support as needed

  • [ ] Rotation Options: Understand rotation frequency and control (per request, timed, manual)

Operational Validation

  • [ ] Request Volume Testing: Test with at least 50 requests per IP against your actual target before committing

  • [ ] Error Rate Baseline: Establish baseline error rate before production deployment

  • [ ] Support Responsiveness: Test support channel response time for technical issues

  • [ ] Dashboard/API Availability: Verify monitoring and usage tracking capabilities

  • [ ] Documentation Quality: Review developer documentation for integration clarity

Compliance and Risk

  • [ ] Proxy Sourcing Transparency: Understand how residential IPs are sourced (ethical sourcing verification)

  • [ ] Terms of Service Review: Confirm permitted use cases align with your application

  • [ ] Data Retention Policy: Understand what request logs are retained and for how long

  • [ ] Exclusion Risk: For residential proxies, understand if provider's ASN appears on HostingProviderIPLists (some residential ranges advertised from hosting ASNs may be flagged)

Trial and Escalation

  • [ ] Trial Availability: Request demo or trial access before commitment

  • [ ] Escalation Path: Understand options if current proxy type insufficient (datacenter → ISP → residential)

  • [ ] Contract Flexibility: Verify ability to adjust volume or type based on testing results



Risk Boundary Box: Compliance and Defensive Limits

This framework is for diagnosing and resolving legitimate scraping failures. The following boundaries define allowed diagnostic activities versus prohibited actions.

Allowed Activities (Defensive Diagnosis and Reliability Engineering)

  • Diagnosing why your scraper fails in cloud environment

  • Testing different proxy types (datacenter proxy, residential proxy, rotating residential proxy, rotating datacenter proxy) to find working configuration

  • Adjusting headers and TLS settings to match browser behavior for public data access

  • Implementing rate limiting and session management to reduce server load

  • Using legitimate anti-bot bypass techniques for public data collection where legally permissible

  • Validating that your rotating proxies for web scraping configuration maintains ethical request rates

Boundary Conditions (Must Be Maintained)

  • Respect robots.txt directives where applicable to your use case

  • Do not overload target servers—implement appropriate delays and concurrency limits

  • Do not bypass authentication or access control for private data

  • Do not use scraped data in violation of terms of service

  • Do not scrape personal data without legal basis (GDPR, CCPA, etc.)

Stop Conditions (Immediate Halt Required)

  • If receiving legal notices from target site, consult legal counsel immediately before continuing

  • If scraping causes measurable target site performance degradation, reduce load or stop

  • If data is behind paywall or login, do not circumvent access controls

What This Guide Does NOT Cover

This diagnostic framework does not provide:

  • Techniques to bypass CAPTCHAs or challenge pages requiring human verification

  • Methods to access authenticated or paywalled content without authorization

  • Guidance on scraping personal data or content with specific legal restrictions

  • Tools or configurations for attacking or overloading target infrastructure



Pre-Deployment Checklist: Validating Before Cloud Deployment

Before moving your scraper from local development to cloud production, validate each layer. This checklist synthesizes the diagnostic framework into actionable verification steps.

IP and Network Layer

  • [ ] Target site does NOT block datacenter IPs (validated with 50+ test requests)

  • [ ] Proxy provider ASN is not on known blocklists (check against AWS/GCP published ranges if using datacenter)

  • [ ] Geo-location of proxy matches target site's expected region

  • [ ] IP rotation configured with appropriate session management (session ID parameter)

  • [ ] For residential proxies, sourcing ethics verified with provider

  • [ ] For static datacenter proxies, ASN diversity confirmed

TLS Layer

  • [ ] Using TLS impersonation library (curl_cffi, tls-client) OR full browser automation

  • [ ] JA3/JA4 fingerprint matches target browser (verify at fingerprint testing service)

  • [ ] HTTP/2 support enabled if target expects it

  • [ ] For browserless scraping, confirm library TLS fingerprint is not blacklisted

HTTP Headers Layer

  • [ ] User-Agent matches real browser (not library default like "python-requests/2.x")

  • [ ] All standard browser headers included (Accept, Accept-Language, Accept-Encoding, Referer)

  • [ ] Header order and capitalization matches browser (some sites detect capitalized headers as bot signal)

  • [ ] Referer header set appropriately for navigation context

Browser Automation Layer (If Applicable)

  • [ ] navigator.webdriver flag disabled (--disable-blink-features=AutomationControlled)

  • [ ] Stealth plugin applied (puppeteer-extra-stealth or equivalent)

  • [ ] Plugins/fonts/WebGL fingerprint appears normal (test at fingerprint validation sites)

  • [ ] No automation indicators in JavaScript environment (window.chrome, navigator.languages)

Behavioral Patterns Layer

  • [ ] Request rate configured below target's rate limit threshold (determined through testing)

  • [ ] Random delays between requests (not fixed intervals—fixed timing is automation signal)

  • [ ] Session cookies maintained consistently per IP (never send same cookie from multiple IPs)

  • [ ] Concurrency level appropriate for target capacity

  • [ ] Error handling includes exponential backoff

Monitoring and Observability

  • [ ] Baseline metrics established (response time, error rate, data completeness)

  • [ ] Alerting configured for error rate spikes (403, 429 thresholds)

  • [ ] IP block events logged with correlation to request patterns

  • [ ] Fallback path defined (escalation to different proxy type if primary fails)

For production-grade unlimited residential proxies or static residential proxies, ensure your provider supports the session management and geographic targeting your use case requires.


Frequently asked questions

My scraper works with residential proxy but not datacenter proxy. Does this mean I need to buy rotating residential proxies for all my projects?

Not necessarily. Datacenter proxies should always be tried first because they are faster, more stable, and cheaper than residential proxies. Test with at least 50 requests per IP before concluding datacenter won't work. Some sites allow datacenter access if other signals (headers, TLS, behavior) appear legitimate. Only escalate to residential proxies when datacenter IPs are blocked regardless of other fixes.

I'm getting Cloudflare error 1005. What does this mean?

Error 1005 indicates ASN/proxy range blocking. The target site (via Cloudflare) has detected that your IP belongs to a known datacenter or hosting provider ASN. Solutions include using datacenter proxies from a different ASN range, switching to a different proxy provider entirely, or escalating to residential proxies.

Why do some websites block my scraper even when I use the exact same headers as my browser?

Several factors beyond headers can cause this. First, your HTTP library's TLS fingerprint differs from browsers—solutions like curl_cffi or browser automation address this. Second, your library may add extra headers or change header capitalization. Scrapy, built on Twisted, uses a different HTTP client than requests (urllib3), producing different wire-level requests even with identical configured headers. Some sites detect capitalized headers versus lowercase headers as a bot signal.

What is JA3 fingerprinting and how does it affect my scraper?

JA3 is a fingerprinting technique that hashes five fields from the TLS Client Hello message: TLSVersion, Ciphers, Extensions, EllipticCurves, and EllipticCurvePointFormats. Anti-scraping services maintain databases of JA3 fingerprints, whitelisting browser fingerprints and blacklisting known scraping tool fingerprints. If your HTTP library produces a blacklisted JA3 hash, you may be blocked before any HTTP exchange occurs. Solutions include TLS impersonation libraries or browser automation with authentic TLS fingerprints.

How do I know if my headless browser is being detected?

Check for these signals: navigator.webdriver returns true (should be false for undetected automation), missing or anomalous navigator.plugins, unusual screen sizes or WebGL renderer values, and missing fonts. Browser automation tools like Selenium, Puppeteer, and Playwright use real browser TLS fingerprints but leak other signals. Stealth plugins attempt to hide these but sophisticated detection uses machine learning across multiple signals including user behavior and IP reputation.

Should I use a rotating proxy for scraping or stick with sticky sessions?

It depends on your target and failure mode. Use rotating proxies for web scraping when: you need high request volume and rate limiting triggers quickly per IP, or when individual IPs get blocked and you need continuous diversity. Use sticky sessions when: the target requires session continuity (logged-in state, shopping cart), or when sending the same cookie from multiple IPs would trigger detection. Never send a single cookie from multiple IPs—this is impossible for real users and is an immediate automation signal.

Start Your Secure and Stable
Global Proxy Service
Get started within just a few minutes and fully unleash the potential of proxies.
Start free trial