Skip to content

Instantly share code, notes, and snippets.

@gruber
Last active May 29, 2024 00:03
Show Gist options
  • Save gruber/249502 to your computer and use it in GitHub Desktop.
Save gruber/249502 to your computer and use it in GitHub Desktop.
Liberal, Accurate Regex Pattern for Matching All URLs
The regex patterns in this gist are intended to match any URLs,
including "mailto:foo@example.com", "x-whatever://foo", etc. For a
pattern that attempts only to match web URLs (http, https), see:
https://gist.github.com/gruber/8891611
# Single-line version of pattern:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
# Multi-line commented version of same pattern:
(?xi)
\b
( # Capture 1: entire matched URL
(?:
[a-z][\w-]+: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct char
)
)
@glensc
Copy link

glensc commented Dec 27, 2021

putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression ([ ]):

    [^\s`!()\[\]{};:'".,<>?«»“”‘’]		# not a space or one of these punct char

« is two bytes: "\xc2\xab", which means the pattern will accept \xc2 and \xab anywhere in the sequence not in a specific order or not even close to each other!

php -r '$s="\xab \xc2 \xc2 \xab"; $v=preg_match_all("/[«]/", $s, $m); var_dump([$v, $m, $s]);' > foo.txt

you need to open foo.txt with a program which can print you bytes.

@solaluset
Copy link

putting wide characters (Unicode of more than 1 byte) is incorrect into a bracket expression ([ ])

It depends on the language/library. Works fine in Python and node.js

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment