Skip to content

Instantly share code, notes, and snippets.

@parker-jana
Created November 1, 2016 19:12
Show Gist options
  • Save parker-jana/efe77484a27b59f6cc7e5fc528e017b2 to your computer and use it in GitHub Desktop.
Save parker-jana/efe77484a27b59f6cc7e5fc528e017b2 to your computer and use it in GitHub Desktop.
| I'm interested in this logic ("the url contains the root domain as part of the domain or subdomain"):
| https://github.com/amelehy/email_parse/blob/adc7497de476598f743cb0a83e752407cc069ca0/parser.py#L68-L71
| Can you talk me through what it does (what do you expect ROOT_DOMAIN to be? which urls will pass the check and which fail?) and why you chose to make it work that way?
So I expect ROOT_DOMAIN to be the base domain for the initial URL that is passed by the user. So for example if I were to pass “mit.edu”, “jana.com”, or “drive.google.com" as an argument, ROOT_DOMAIN would be “mit”, “jana”, or “google" respectively.
In the section where it checks if the ROOT_DOMAIN is part of the domain or subdomain of each parsed URL, the idea is that it is trying to determine which URLs that are gathered from the page actually belong to (or are related to) the original website that was intended on being crawled and which are “external links."
So for example if I were to pass “www.jana.com”, this script will gather all the URLs from the page and we are really only interested in those that are specific to the Jana website. As an example, the script will gather these two URLs (among many others):
//www.jana.com/product
http://cta-redirect.hubspot.com/cta/redirect/2235268/06883901-bdf9-40b6-84d1-2f0ae3ee548c
The first URL (which has “jana” as the “domain”) will pass our check to see if the ROOT_DOMAIN (which is “jana” as well) is contained within the domain or subdomain.
The second URL (which has “hubspot” as a domain and “cta-redirect” as the subdomain) will fall our check to see if the ROOT_DOMAIN (which is still “jana”) is contained within the domain or subdomain.
I chose to use this logic (try to check if each URL belongs to the site intended on being crawled) because I thought that it was the best way I could think of to determine which URLs from a given webpage were pointing to other places on the website. From what I’ve read there isn’t a super good way to reliably determine all of the available URLs or pages that are accessible on a website aside from parsing a site map (if it exists) or manually trying all sorts of combinations of the base url with additional pages appended to the end, which seems unrealistic and kind of ridiculous. There could very well be a better way to determine all of the visitable URLs for a website, but this was the most logical solution that I could think of.
*Now that you’ve mentioned it and I’ve taken a more thorough look at this again, I realized that this check will fail for relative links like “/contact” or “contact.html”. I’ll have to take another pass at this function and try to make sure that it picks up relative paths, sorry about that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment