Get external links from a website different ways

When I got this task and start work about it, I noted for myself, that it very interesting. It may be resolve a few ways in depend: for your knowledge, also to having access to data base or WordPress dashboard with admin rights.

How to get all external links from a website

Problem: Get all external links from a website if we have access to database, or WordPress, or without them.

Solution:

1. SEO Spider. It's a cross-platform application ( Ubuntu, MacOS, Windows ).  There are free and paid version. For free version limit is maximum 500 URLs, paid version cost £149.00 per year without limits. If you have small site and don't have access, maybe it solution for you, but I moved on.

2. Broken Link Checker. This a free plugin for WordPress. If you have right access, you can try this way. Click to link to elicit more details about it.

3. And then there are more interesting solutions. For example we have a huge web site and it has tens of thousands posts, many custom post types, etc. We have access to data base file. We'll need Linux & 3 commands:

    3.1. Get all links from database:

grep -Eo '^http[s]?://[^/]+' database_name.sql > all_links

3.2. Get only external links from all_links file

sed '/your_domain_name/d' all_links > external_links

3.3. Get unique external links from external_links file

sort -u external_links > unical_external_links

4. So, let's bit complicate a previous task. We have a huge site and we don't have access rights. In this case we should get all external links for a website. Look away to sitemap.xml. I was lucky and my site was have a this file. Also, we can use sitemap generator service, e.g. this, but it has limit for free version maximum 500 URLs. Next step we'll write script which'll open each link from sitemap file, get all links from a open page, compare them for domain name and if they are different then write the URL to our file. We work at Linux, also we'll should install text browser lynx and awk utility.

#!/bin/sh
q=0
cat sitemap_links | 
while read url
    do
        lynx -dump "$url" | awk '/http/{print $2}' | sed '/www.site_name.org/d' | grep -Eo '^http[s]?://[^/]+' |  sort -u >> external_links
        q=$(($q+1))
        echo $q $url
    done
 sort -u ./sitemap_links >> external_links

Good luck! 😉

Leave Comment

Your email address will not be published.