Get external links from a website different ways
When I got this task and start work about it, I noted for myself, that it very interesting. It may be resolve a few ways in depend: for your knowledge, also to having access to data base or WordPress dashboard with admin rights.
Problem: Get all external links from a website if we have access to database, or WordPress, or without them.
Solution:
1. SEO Spider. It's a cross-platform application ( Ubuntu, MacOS, Windows ). There are free and paid version. For free version limit is maximum 500 URLs, paid version cost £149.00 per year without limits. If you have small site and don't have access, maybe it solution for you, but I moved on.
2. Broken Link Checker. This a free plugin for WordPress. If you have right access, you can try this way. Click to link to elicit more details about it.
3. And then there are more interesting solutions. For example we have a huge web site and it has tens of thousands posts, many custom post types, etc. We have access to data base file. We'll need Linux & 3 commands:
3.1. Get all links from database:
grep -Eo '^http[s]?://[^/]+' database_name.sql > all_links
3.2. Get only external links from all_links file
sed '/your_domain_name/d' all_links > external_links
3.3. Get unique external links from external_links file
sort -u external_links > unical_external_links
4. So, let's bit complicate a previous task. We have a huge site and we don't have access rights. In this case we should get all external links for a website. Look away to sitemap.xml. I was lucky and my site was have a this file. Also, we can use sitemap generator service, e.g. this, but it has limit for free version maximum 500 URLs. Next step we'll write script which'll open each link from sitemap file, get all links from a open page, compare them for domain name and if they are different then write the URL to our file. We work at Linux, also we'll should install text browser lynx and awk utility.
#!/bin/sh
q=0
cat sitemap_links |
while read url
do
lynx -dump "$url" | awk '/http/{print $2}' | sed '/www.site_name.org/d' | grep -Eo '^http[s]?://[^/]+' | sort -u >> external_links
q=$(($q+1))
echo $q $url
done
sort -u ./sitemap_links >> external_links
Good luck! 😉
Hi to every one, because I am actually keen of reading this weblog’s post to be updated regularly.
It carries good material.