Why Standard Tools Miss the Real Status of Your Overseas Backlinks
Ahrefs and Majestic are built for discovery, not real-time verification. Their crawlers work on a schedule, and by the time you pull a backlink report, some of those entries are weeks or months stale. A page that showed as "live" in last month's export may be returning a 404 right now — and if your disavow decisions or outreach prioritization are based on that snapshot, you're working with bad data.
The second problem is trickier: geo-blocking. Many overseas domains in finance, media, and e-commerce serve completely different responses depending on where the request originates. When you send bulk HTTP checks from a standard datacenter IP, the fingerprint is obvious — consistent ASN, no residential ISP routing, no geographic spread across requests. Most anti-bot layers on these sites return a 403 or a deceptive soft-200 (a "you don't have access" page that still returns HTTP 200) before any real content loads. You can't trust those readings.
Residential proxies route requests through real consumer devices assigned IPs by actual ISPs. From the target site's perspective, each request looks like a different real user browsing from within that country — which is what it takes to get a reliable status reading from overseas backlinks at scale.
Before you start, you'll need:
An Ahrefs, Majestic, or SE Ranking account to export your backlink list as CSV
Python 3.9+ with
requests,pandas, andbeautifulsoup4installed (pip install requests pandas beautifulsoup4)A rotating residential proxy account with username/password credentials and a gateway endpoint address
---
What "Valid" Actually Means: Reading HTTP Status Codes
Nail this down before writing a single line of code, because the status code is your primary output and each one calls for a different action.
| HTTP Code | Meaning | Backlink Status |
|---|---|---|
| 200 | Page is live and accessible | ✅ Reachable — run Step 3b to confirm anchor text and link attribute are still present |
| 301 | Permanent redirect | ⚠️ Follow the chain; check final\_url for topical relevance (see Step 4) |
| 302 | Temporary redirect | ⚠️ Investigate — could be a session redirect or in-progress site migration |
| 403 | Forbidden | ❓ Likely geo-blocked — retry from the target country's residential IP before writing it off |
| 404 | Not found | ❌ Dead link — Google may keep this in index temporarily |
| 410 | Gone permanently | ❌ Dead link — Google removes 410s from its index faster than 404s; prioritize these |
| 5xx | Server error | ⏳ Retry later; this is a server-side issue, not evidence the backlink is dead |
One important distinction on 200s: a successful response only confirms the page loaded. It doesn't confirm your backlink is still there. The anchor text and the <a href> pointing to your domain could have been removed in a content update while the URL itself stays live. Step 3b handles this.
---
Step 1: Export Your Backlink List
From Ahrefs: Open Site Explorer → enter your target domain → click Backlinks in the left sidebar → apply filters as needed (dofollow, minimum DR, etc.) → click Export → CSV. The file includes columns for referring page URL, anchor text, link type, and the date Ahrefs last crawled it. The column header you need is "Referring page URL" — note this exactly, you'll use it in the script.
From Majestic: Go to Site Explorer → the Backlinks tab → filter to your required link type → Export to CSV.
Strip the file down to a single url column with a header row, or note the exact column header name from your export — the script needs it. A clean input looks like:
url https://example-blog.com/seo-tools-review https://news-site.de/marketing-software-guide https://forum.au/threads/proxy-ip-guide
Save it as backlinks\_to\_check.csv in your working directory.
---
Step 2: Configure Your Rotating Residential Proxy
Most residential proxy providers expose a single gateway endpoint — one hostname and port — that handles IP rotation automatically. Every outbound request exits through a different residential IP. Authentication is via username and password embedded directly in the proxy URL string.
Log into your Proxy001 dashboard and navigate to the Residential Proxy → Integration section. There you'll find your gateway hostname, port, username, and password. Proxy001 supports country-level IP targeting, and the exact username parameter format for country selection is documented in that same Integration section — copy the example syntax directly from there rather than guessing, since the format varies between proxy providers and a typo here causes every request to fail silently.
With your credentials in hand, your proxy URL will follow this structure:
http://your\_username:your\_password@gateway\_host:port
If you need results from specific markets — say, German news sites or Australian forums — set the country targeting in your username parameter per your dashboard's instructions. Routing through in-country residential IPs significantly reduces false 403 responses caused by geo-restrictions on those domains.
---
Verify Your Connection Before Running the Full Batch
Run this before sending requests to your actual backlink list. If anything in your configuration is wrong — credentials, gateway address, network routing — you'll catch it in 10 seconds rather than discovering it after processing 500 URLs.
import requests
PROXY\_URL = "http://YOUR\_USERNAME:YOUR\_PASSWORD@GATEWAY\_HOST:PORT"
PROXIES = {"http": PROXY\_URL, "https": PROXY\_URL}
# Test 1: Confirm proxy is routing correctly
r = requests.get("https://httpbin.org/ip", proxies=PROXIES, timeout=10)
print("Exit IP:", r.json()) # should show a non-local IP
# Test 2: Confirm a real target domain returns an expected status
test\_backlink = "https://example-target-domain.com/page-with-your-link"
r2 = requests.head(test\_backlink, proxies=PROXIES, timeout=15, allow\_redirects=True)
print("Status:", r2.status\_code, "→", r2.url)Three signs your setup is working:
httpbin.org/ipreturns a JSON object with an IP address that isn't your ownYour test backlink URL returns the status code you'd expect (200, 301, etc.)
No
ProxyErrororConnectionErrorexception is raised
If Test 1 fails, recheck your gateway hostname and port. If Test 2 fails but Test 1 passes, the issue is with that specific target domain (geo-block or down) — not your proxy config.
---
Step 3: Run the Batch Verification Script
This script reads your URL list, sends an HTTP HEAD request through the rotating residential proxy for each URL, follows any redirect chains to capture the final status code, and writes everything to a results CSV. Sites that return 405 (Method Not Allowed) for HEAD requests are automatically retried with a GET request — no manual intervention needed.
import requests
import pandas as pd
import time
import random
from concurrent.futures import ThreadPoolExecutor, as\_completed
# ── Configuration ──────────────────────────────────────────────────────────────
PROXY\_URL = "http://YOUR\_USERNAME:YOUR\_PASSWORD@GATEWAY\_HOST:PORT"
INPUT\_FILE = "backlinks\_to\_check.csv"
URL\_COLUMN = "url" # update to match your CSV header exactly
OUTPUT\_FILE = "backlink\_verification\_results.csv"
MAX\_WORKERS = 5 # concurrent threads; keep ≤ 10 on first run
REQUEST\_TIMEOUT = 15 # seconds before timeout
MAX\_RETRIES = 2 # retries on timeout or 5xx
DELAY\_RANGE = (1, 3) # random pause (seconds) between retries
PROXIES = {"http": PROXY\_URL, "https": PROXY\_URL}
HEADERS = {"User-Agent": "BacklinkAuditBot/1.0 (+https://proxy001.com)"}
# ── Per-URL verification ────────────────────────────────────────────────────────
def verify\_backlink(url: str) -> dict:
result = {"url": url, "status\_code": None, "final\_url": None, "error": None}
for attempt in range(MAX\_RETRIES + 1):
try:
resp = requests.head(
url,
proxies=PROXIES,
headers=HEADERS,
timeout=REQUEST\_TIMEOUT,
allow\_redirects=True,
)
# Some servers don't allow HEAD — fall back to GET automatically
if resp.status\_code == 405:
resp = requests.get(
url,
proxies=PROXIES,
headers=HEADERS,
timeout=REQUEST\_TIMEOUT,
allow\_redirects=True,
)
result\["status\_code"] = resp.status\_code
result\["final\_url"] = resp.url
return result
except requests.exceptions.Timeout:
result\["error"] = "timeout"
except requests.exceptions.ProxyError:
result\["error"] = "proxy\_error"
break # auth/credential failures won't resolve on retry
except requests.exceptions.ConnectionError:
result\["error"] = "connection\_error"
except requests.exceptions.RequestException as exc:
result\["error"] = str(exc)\[:120]
break
if attempt < MAX\_RETRIES:
time.sleep(random.uniform(\*DELAY\_RANGE))
return result
# ── Batch runner ────────────────────────────────────────────────────────────────
def main():
df = pd.read\_csv(INPUT\_FILE)
urls = df\[URL\_COLUMN].dropna().str.strip().tolist()
print(f"Loaded {len(urls)} URLs.")
results = \[]
with ThreadPoolExecutor(max\_workers=MAX\_WORKERS) as executor:
futures = {executor.submit(verify\_backlink, url): url for url in urls}
for i, future in enumerate(as\_completed(futures), 1):
results.append(future.result())
if i % 50 == 0:
print(f" {i}/{len(urls)} checked...")
out = pd.DataFrame(results)
out.to\_csv(OUTPUT\_FILE, index=False)
print(f"\\nDone — results saved to {OUTPUT\_FILE}")
print(out\["status\_code"].value\_counts(dropna=False).to\_string())
if \_\_name\_\_ == "\_\_main\_\_":
main()**Note on the User-Agent header:** The script identifies itself as
BacklinkAuditBot/1.0— a transparent, self-identifying string consistent with responsible automated access. Well-behaved crawlers including Googlebot follow this convention. If a specific domain'srobots.txtdisallows your bot name, respect that directive and exclude those URLs from the batch.
Four configuration details that affect results:
URL\_COLUMN: Update this to match your CSV header exactly. Ahrefs exports use"Referring page URL"rather than"url"— a mismatch here raises aKeyErrorimmediately on run.allow\_redirects=True: Thestatus\_codein your output is the final code at the end of any redirect chain, not just the first hop. Thefinal\_urlcolumn records where you ended up.MAX\_WORKERS = 5: A safe starting point. Running more parallel threads against the same target domain can trigger rate-limiting. Confirm output quality at 5 before scaling up.405 fallback is automatic: The script detects Method Not Allowed responses and retries with GET. You don't need to handle this manually.
Sample output (backlink\_verification\_results.csv):
url,status\_code,final\_url,error https://example-blog.com/seo-tools-review,200,https://example-blog.com/seo-tools-review, https://news-site.de/marketing-guide,301,https://news-site.de/tools/, https://forum.au/threads/proxy-guide,403,https://forum.au/threads/proxy-guide, https://old-press-release.com/article,404,https://old-press-release.com/article, https://deleted-page.net/post-42,410,https://deleted-page.net/post-42, https://slow-server.io/resource,,,timeout
---
Step 3b: Check Link Attributes for 200 Responses
A 200 status tells you the page is alive. It doesn't tell you whether your link is still on it, or whether it's dofollow. Since dofollow vs. nofollow is central to any link equity assessment, run this second pass on just the 200-OK rows from your Step 3 output.
import requests
import pandas as pd
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as\_completed
# ── Configuration ──────────────────────────────────────────────────────────────
STEP3\_OUTPUT = "backlink\_verification\_results.csv"
OUTPUT\_FILE = "backlink\_attributes.csv"
TARGET\_DOMAIN = "yourdomain.com" # domain you're checking links TO — no https://
MAX\_WORKERS = 3 # lower concurrency for GET requests (full page body)
REQUEST\_TIMEOUT = 20
PROXIES = {"http": PROXY\_URL, "https": PROXY\_URL} # reuse from Step 3 config
HEADERS = {"User-Agent": "BacklinkAuditBot/1.0 (+https://proxy001.com)"}
# ── Per-URL attribute check ─────────────────────────────────────────────────────
def check\_link\_attributes(page\_url: str) -> dict:
result = {
"page\_url": page\_url,
"link\_found": False,
"href": None,
"anchor\_text": None,
"is\_dofollow": None,
"error": None,
}
try:
resp = requests.get(
page\_url,
proxies=PROXIES,
headers=HEADERS,
timeout=REQUEST\_TIMEOUT,
)
soup = BeautifulSoup(resp.text, "html.parser")
for a in soup.find\_all("a", href=True):
if TARGET\_DOMAIN in a\["href"]:
rel = a.get("rel", \[]) # BeautifulSoup returns rel as a list
result.update({
"link\_found": True,
"href": a\["href"],
"anchor\_text": a.get\_text(strip=True),
"is\_dofollow": "nofollow" not in rel and "ugc" not in rel and "sponsored" not in rel,
})
break # stops at the first matching link; remove break to find all
except Exception as exc:
result\["error"] = str(exc)\[:120]
return result
# ── Batch runner ────────────────────────────────────────────────────────────────
def main():
df = pd.read\_csv(STEP3\_OUTPUT)
live\_200 = df\[df\["status\_code"] == 200]\["url"].dropna().tolist()
print(f"Running attribute check on {len(live\_200)} live URLs...")
results = \[]
with ThreadPoolExecutor(max\_workers=MAX\_WORKERS) as executor:
futures = {executor.submit(check\_link\_attributes, url): url for url in live\_200}
for i, future in enumerate(as\_completed(futures), 1):
results.append(future.result())
if i % 20 == 0:
print(f" {i}/{len(live\_200)} checked...")
out = pd.DataFrame(results)
out.to\_csv(OUTPUT\_FILE, index=False)
print(f"\\nDone — saved to {OUTPUT\_FILE}")
print(out\[\["link\_found", "is\_dofollow"]].value\_counts(dropna=False).to\_string())
if \_\_name\_\_ == "\_\_main\_\_":
main()The is\_dofollow column accounts for all three Google-recognized link attribute values that suppress equity flow: nofollow, ugc, and sponsored. Any link with these attributes present should be treated as non-equity-passing regardless of page status.
---
Step 4: Read and Act on the Results
Your Step 3 output CSV has four columns: url, status\_code, final\_url, and error. Your Step 3b output adds link\_found, anchor\_text, and is\_dofollow. Here's how to act on each group:
200s and 403s — the two that need extra work. For 200-OK URLs, check your Step 3b output: if link\_found = False, the page is alive but your link has been removed — this is a link reclamation opportunity worth a direct outreach email. If is\_dofollow = False, the link exists but carries no equity. For 403 responses, don't write them off yet. A 403 through a residential proxy often means the country you routed through didn't match the site's access rules. Re-run those specific URLs with the proxy targeted to the site's home country. Persistent 403s after that usually mean the page requires an authenticated session — flag those for manual review rather than marking the link dead.
301s — check final\_url in your CSV. Google has publicly confirmed that 301 redirects consolidate link equity to the destination URL, provided the destination is topically relevant. A redirect landing on the same site's relevant content preserves that equity signal. A redirect to the homepage or an unrelated page, however, may be treated as a soft 404 by Google — meaning no equity passes at all. For any 301 where final\_url shows the homepage or a generic landing page, contact the site owner to update the link directly to your target page.
404s and 410s — your actionable dead links. Prioritize 410s first. Google removes 410 pages from its index faster than 404s, so link equity from those pages is already gone or evaporating. For 404s, cross-reference with the Wayback Machine to check how recently the page existed — if it disappeared in the last 30–60 days, a link reclamation email to the site owner is worth sending.
error column not null: proxy\_error means your credentials failed — check your dashboard. Repeated timeout on the same URL usually indicates the site blocks all automated requests regardless of IP type — flag those for manual review rather than assuming the link is dead.
---
Troubleshooting Common Failures
Every request returns proxy\_error. Your PROXY\_URL string has a formatting issue or your credentials have expired. Verify the exact syntax — http://username:password@host:port — and regenerate credentials from your provider dashboard if needed. Run the single-URL connectivity test from the "Verify Your Connection" section to isolate whether the issue is credentials or network routing.
High rate of 403 responses across many domains. The most likely cause is request rate — drop MAX\_WORKERS to 2–3 and widen DELAY\_RANGE to (2, 5) to bring the request pattern closer to normal browsing cadence. If 403s persist on specific domains after that, re-run those URLs with the proxy country-targeted to the site's home market; many overseas domains enforce geo-restrictions on non-local traffic, which can produce false 403s on otherwise accessible pages. Domains that return 403 consistently across all approaches likely require session-authenticated access — mark those for manual review.
Output CSV shows all nulls in status\_code. The script ran but every request failed silently. Check the error column — if everything shows connection\_error, your proxy gateway is unreachable. If it shows proxy\_error, authentication is failing. Also confirm URL\_COLUMN matches your CSV header exactly; a header mismatch causes a KeyError on startup that may not surface cleanly in some environments.
---
A Note on Responsible Use
This workflow sends HTTP HEAD and GET requests to publicly accessible pages — functionally identical to what a standard link-checking tool does. A few boundaries to stay within: always check and respect a site's robots.txt directives, which are governed by IETF RFC 9309, the formal standard for the Robots Exclusion Protocol. Don't target pages that require authentication, session cookies, or a logged-in state. Keep your request rate low enough that it doesn't meaningfully burden the target server — 2–5 concurrent threads against any single domain is well within normal SEO audit parameters. This workflow is scoped to verifying your own existing backlinks — a clearly bounded, legitimate use of automated HTTP requests consistent with standard SEO auditing practice.
---
Verify at Scale with Proxy001
Running this workflow against thousands of overseas backlinks across European, Asian, and LATAM domains puts real demands on your proxy pool. You need enough IP coverage to avoid reuse patterns and geographic precision to get accurate responses from region-locked sites.
Proxy001 provides access to 100M+ residential IPs spanning 200+ regions, with targeting down to the country and city level. Their rotating gateway works directly with Python's requests library using the standard username:password@host:port format used in the scripts above — no SDK or additional wrapper needed. The platform also supports Scrapy, Puppeteer, and Selenium if you want to extend this workflow to include JS-rendered sites. Proxy001 offers a free trial so you can benchmark proxy performance against your specific target domains before committing to a full-scale run — check response accuracy across your top backlink markets before processing your full list. Get started at proxy001.com.








