B-219 Sec-55 Noida, India
+918010221733

PHP regular expressions examples

Mastering Regular Expressions quickly covers the basics of regular-expression syntax, then delves into the mechanics of expression-processing, common pitfalls, performance issues, and implementation-specific differences. Written in an engaging style and sprinkled with solutions to complex real-world problems, MRE offers a wealth information that you use. I will start with some simple usage examples of the regular expressions and continue with a huge list of cases for various situations where we would normally need a regex to operate. We will use simple functions which return TRUE or FALSE. $regex will serve as our regular expression to match against and $text will be our text (pretty obvious):

function do_reg($text, $regex)
{
    if (preg_match($regex, $text)) {
        return TRUE;
    }
    else {
        return FALSE;
    }
}

The next function will get the part of a given string ($text) matched by the regex ($regex) using a group srorage ($regs). By changing the $regs[0] to $regs[1] we can use a capturing group (in this case griup 1) to match against. The capturing group can also have a name ($regs[‘groupname’]):

function do_reg($text, $regex, $regs)
{
    if (preg_match($regex, $text, $regs)) {
        $result = $regs[0];
    }
    else {
        $result = “”;
    }
    return $result;
}

The following function will return an array of all regex matches in a given string ($text):

function do_reg($text, $regex)
{
    preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);
    return $result = $result[0];
}

Next we can iterate (loop) over all matches in a string ($text) and output the results:

function do_reg($text, $regex)
{
    preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);
    for ($i = 0; $i < count($result[0]); $i++) {
    $result[0][$i];
}
}

Extending the above one we can iterate over all matches ($text) and capture groups in a string ($text):

function do_reg($text, $regex)
{
    preg_match_all($regex, $text, $result, PREG_SET_ORDER);
    for ($matchi = 0; $matchi < count($result); $matchi++) {
        for ($backrefi = 0; $backrefi < count($result[$matchi]); $backrefi++) {
            $result[$matchi][$backrefi];
        }
    }
}
}

REGULAR EXPRESSION EXAMPLES BY SITUATIONS AND NEEDS: Addresses

//Address: State code (US)
‘/\b(?:A[KLRZ]|C[AOT]|D[CE]|FL|GA|HI|I[ADLN]
|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|PA|RI|S[CD]|T[NX]
|UT|V[AT]|W[AIVY])\b/’

//Address: ZIP code (US)
‘b[0-9]{5}(?:-[0-9]{4})?b’

Columns

//Columns: Match a regex starting at a specific column on a line.
‘^.{%SKIPAMOUNT%}(%REGEX%)’

//Columns: Range of characters on a line, captured into backreference 1
//Iterate over all matches to extract a column of text from a file
//E.g. to grab the characters in colums 8..10, set SKIPAMOUNT to 7, and CAPTUREAMOUNT to 3
‘^.{%SKIPAMOUNT%}(.{%CAPTUREAMOUNT%})’

Credit cards

//Credit card: All major cards
‘^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6011[0-9]{12}|3(?:0[0-5]|[68][0-9])[0-9]{11}|3[47][0-9]{13})$’

//Credit card: American Express
‘^3[47][0-9]{13}$’

//Credit card: Diners Club
‘^3(?:0[0-5]|[68][0-9])[0-9]{11}$’

//Credit card: Discover
‘^6011[0-9]{12}$’

//Credit card: MasterCard
‘^5[1-5][0-9]{14}$’

//Credit card: Visa
‘^4[0-9]{12}(?:[0-9]{3})?$’

//Credit card: remove non-digits
‘/[^0-9]+/’

CSV

//CSV: Change delimiter
//Changes the delimiter from a comma into a tab.
//The capturing group makes sure delimiters inside double-quoted entries are ignored.
‘(“[^”rn]*”)?,(?![^”,rn]*”$)’

//CSV: Complete row, all fields.
//Match complete rows in a comma-delimited file that has 3 fields per row,
//capturing each field into a backreference. 
//To match CSV rows with more or fewer fields, simply duplicate or delete the capturing groups.
‘^(“[^”rn]*”|[^,rn]*),(“[^”rn]*”|[^,rn]*),(“[^”rn]*”|[^,rn]*)$’

//CSV: Complete row, certain fields.
//Set %SKIPLEAD% to the number of fields you want to skip at the start, and %SKIPTRAIL% to
//the number of fields you want to ignore at the end of each row. 
//This regex captures 3 fields into backreferences.  To capture more or fewer fields,
//simply duplicate or delete the capturing groups.
‘^(?:(?:”[^”rn]*”|[^,rn]*),){%SKIPLEAD%}(“[^”rn]*”|[^,rn]*),(“[^”rn]*”|[^,rn]*),(“[^”rn]*”|[^,rn]*)(?:(?:”[^”rn]*”|[^,rn]*),){%SKIPTRAIL%}$’

//CSV: Partial row, certain fields
//Match the first SKIPLEAD+3 fields of each rows in a comma-delimited file that has SKIPLEAD+3
//or more fields per row.  The 3 fields after SKIPLEAD are each captured into a backreference. 
//All other fields are ignored.  Rows that have less than SKIPLEAD+3 fields are skipped. 
//To capture more or fewer fields, simply duplicate or delete the capturing groups.
‘^(?:(?:”[^”rn]*”|[^,rn]*),){%SKIPLEAD%}(“[^”rn]*”|[^,rn]*),(“[^”rn]*”|[^,rn]*),(“[^”rn]*”|[^,rn]*)’

//CSV: Partial row, leading fields
//Match the first 3 fields of each rows in a comma-delimited file that has 3 or more fields per row. 
//The first 3 fields are each captured into a backreference.  All other fields are ignored. 
//Rows that have less than 3 fields are skipped.  To capture more or fewer fields,
//simply duplicate or delete the capturing groups.
‘^(“[^”rn]*”|[^,rn]*),(“[^”rn]*”|[^,rn]*),(“[^”rn]*”|[^,rn]*)’

//CSV: Partial row, variable leading fields
//Match the first 3 fields of each rows in a comma-delimited file. 
//The first 3 fields are each captured into a backreference.
//All other fields are ignored.  If a row has fewer than 3 field, some of the backreferences
//will remain empty.  To capture more or fewer fields, simply duplicate or delete the capturing groups. 
//The question mark after each group makes that group optional.
‘^(“[^”rn]*”|[^,rn]*),(“[^”rn]*”|[^,rn]*)?,(“[^”rn]*”|[^,rn]*)?’

Dates

//Date d/m/yy and dd/mm/yyyy
//1/1/00 through 31/12/99 and 01/01/1900 through 31/12/2099
//Matches invalid dates such as February 31st
‘b(0?[1-9]|[12][0-9]|3[01])[- /.](0?[1-9]|1[012])[- /.](19|20)?[0-9]{2}b’

//Date dd/mm/yyyy
//01/01/1900 through 31/12/2099
//Matches invalid dates such as February 31st
‘(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)[0-9]{2}’

//Date m/d/y and mm/dd/yyyy
//1/1/99 through 12/31/99 and 01/01/1900 through 12/31/2099
//Matches invalid dates such as February 31st
//Accepts dashes, spaces, forward slashes and dots as date separators
‘b(0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])[- /.](19|20)?[0-9]{2}b’

//Date mm/dd/yyyy
//01/01/1900 through 12/31/2099
//Matches invalid dates such as February 31st
‘(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)[0-9]{2}’

//Date yy-m-d or yyyy-mm-dd
//00-1-1 through 99-12-31 and 1900-01-01 through 2099-12-31
//Matches invalid dates such as February 31st
‘b(19|20)?[0-9]{2}[- /.](0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])b’

//Date yyyy-mm-dd
//1900-01-01 through 2099-12-31
//Matches invalid dates such as February 31st
‘(19|20)[0-9]{2}[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])’

Delimiters

//Delimiters: Replace commas with tabs
//Replaces commas with tabs, except for commas inside double-quoted strings
‘((?:”[^”,]*+”)|[^,]++)*+,’

Email addresses

//Email address
//Use this version to seek out email addresses in random documents and texts.
//Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum. 
//Including these increases the risk of false positives when applying the regex to random documents.
‘b[A-Z0-9._%-]+@[A-Z0-9.-]+.[A-Z]{2,4}b’

//Email address (anchored)
//Use this anchored version to check if a valid email address was entered.
//Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum.
//Requires the “case insensitive” option to be ON.
‘^[A-Z0-9._%-]+@[A-Z0-9.-]+.[A-Z]{2,4}$’

//Email address (anchored; no consecutive dots)
//Use this anchored version to check if a valid email address was entered.
//Improves on the original email address regex by excluding addresses with consecutive dots such as john@aol…com
//Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum. 
//Including these increases the risk of false positives when applying the regex to random documents.
‘^[A-Z0-9._%-]+@(?:[A-Z0-9-]+.)+[A-Z]{2,4}$’

//Email address (no consecutive dots)
//Use this version to seek out email addresses in random documents and texts.
//Improves on the original email address regex by excluding addresses with consecutive dots such as john@aol…com
//Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum. 
//Including these increases the risk of false positives when applying the regex to random documents.
‘b[A-Z0-9._%-]+@(?:[A-Z0-9-]+.)+[A-Z]{2,4}b’

//Email address (specific TLDs)
//Does not match email addresses using an IP address instead of a domain name.
//Matches all country code top level domains, and specific common top level domains.
‘^[A-Z0-9._%-]+@[A-Z0-9.-]+.(?:[A-Z]{2}|com|org|net|biz|info|name|aero|biz|info|jobs|museum|name)$’

//Email address: Replace with HTML link
‘b(?:mailto:)?([A-Z0-9._%-]+@[A-Z0-9.-]+.[A-Z]{2,4})b’

HTML

//HTML comment
‘<!–.*?–>’

//HTML file
//Matches a complete HTML file.  Place round brackets around the .*? parts you want to extract from the file.
//Performance will be terrible on HTML files that miss some of the tags
//(and thus won’t be matched by this regular expression).  Use the atomic version instead when your search
//includes such files (the atomic version will also fail invalid files, but much faster).
‘<html>.*?<head>.*?<title>.*?</title>.*?</head>.*?<body[^>]*>.*?</body>.*?</html>’

//HTML file (atomic)
//Matches a complete HTML file.  Place round brackets around the .*? parts you want to extract from the file.
//Atomic grouping maintains the regular expression’s performance on invalid HTML files.
‘<html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)(?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>’

//HTML tag
//Matches the opening and closing pair of whichever HTML tag comes next.
//The name of the tag is stored into the first capturing group.
//The text between the tags is stored into the second capturing group.
‘<([A-Z][A-Z0-9]*)[^>]*>(.*?)</1>’

//HTML tag
//Matches the opening and closing pair of a specific HTML tag.
//Anything between the tags is stored into the first capturing group.
//Does NOT properly match tags nested inside themselves.
‘<%TAG%[^>]*>(.*?)</%TAG%>’

//HTML tag
//Matches any opening or closing HTML tag, without its contents.
‘</?[a-z][a-z0-9]*[^<>]*>’

IP addresses

//IP address
//Matches 0.0.0.0 through 999.999.999.999
//Use this fast and simple regex if you know the data does not contain invalid IP addresses.
‘b([0-9]{1,3}).([0-9]{1,3}).([0-9]{1,3}).([0-9]{1,3})b’

//IP address
//Matches 0.0.0.0 through 999.999.999.999
//Use this fast and simple regex if you know the data does not contain invalid IP addresses,
//and you don’t need access to the individual IP numbers.
‘b(?:[0-9]{1,3}.){3}[0-9]{1,3}b’

//IP address
//Matches 0.0.0.0 through 255.255.255.255
//Use this regex to match IP numbers with accurracy, without access to the individual IP numbers.
‘b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)b’

//IP address
//Matches 0.0.0.0 through 255.255.255.255
//Use this regex to match IP numbers with accurracy.
//Each of the 4 numbers is stored into a capturing group, so you can access them for further processing.
‘b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)b’

Lines

//Lines: Absolutely blank (no whitespace)
//Regex match does not include line break after the line.
‘^$’

//Lines: Blank (may contain whitespace)
//Regex match does not include line break after the line.
‘^[ t]*$’

//Lines: Delete absolutely blank lines
//Regex match includes line break after the line.
‘^r?n’

//Lines: Delete blank lines
//Regex match includes line break after the line.
‘^[ t]*$r?n’

//Lines: Delete duplicate lines
//This regex matches two or more lines, each identical to the first line. 
//It deletes all of them, except the first.
‘^(.*)(r?n1)+$’

//Lines: Truncate a line after a regex match.
//The regex you specify is guaranteed to match only once on each line. 
//If the original regex you specified should match more than once,
//the line will be truncated after the last match.
preg_replace(‘^.*(%REGEX%)(.*)$’, ‘$1$2’, $text);

//Lines: Truncate a line before a regex match.
//If the regex matches more than once on the same line, everything before the last match is deleted.
preg_replace(‘^.*(%REGEX%)’, ‘$1’, $text);

//Lines: Truncate a line before and after a regex match.
//This will delete everything from the line not matched by the regular expression.
preg_replace(‘^.*(%REGEX%).*$’, ‘$1’, $text);

Logs

//Logs: Apache web server
//Successful hits to HTML files only.  Useful for counting the number of page views.
‘^((?#client IP or domain name)S+)s+((?#basic authentication)S+s+S+)s+[((?#date and time)[^]]+)]s+”(?:GET|POST|HEAD) ((?#file)/[^ ?”]+?.html?)??((?#parameters)[^ ?”]+)? HTTP/[0-9.]+”s+(?#status code)200s+((?#bytes transferred)[-0-9]+)s+”((?#referrer)[^”]*)”s+”((?#user agent)[^”]*)”$’

//Logs: Apache web server
//404 errors only
‘^((?#client IP or domain name)S+)s+((?#basic authentication)S+s+S+)s+[((?#date and time)[^]]+)]s+”(?:GET|POST|HEAD) ((?#file)[^ ?”]+)??((?#parameters)[^ ?”]+)? HTTP/[0-9.]+”s+(?#status code)404s+((?#bytes transferred)[-0-9]+)s+”((?#referrer)[^”]*)”s+”((?#user agent)[^”]*)”$’

Numbers

//Number: Currency amount
//Optional thousands separators; optional two-digit fraction
‘b[0-9]{1,3}(?:,?[0-9]{3})*(?:.[0-9]{2})?b’

//Number: Currency amount
//Optional thousands separators; mandatory two-digit fraction
‘b[0-9]{1,3}(?:,?[0-9]{3})*.[0-9]{2}b’

//Number: floating point
//Matches an integer or a floating point number with mandatory integer part.  The sign is optional.
‘[-+]?b[0-9]+(.[0-9]+)?b’

//Number: floating point
//Matches an integer or a floating point number with optional integer part.  The sign is optional.
‘[-+]?b[0-9]*.?[0-9]+b’

//Number: hexadecimal (C-style)
‘b0[xX][0-9a-fA-F]+b’

//Number: Insert thousands separators
//Replaces 123456789.00 with 123,456,789.00
‘(?<=[0-9])(?=(?:[0-9]{3})+(?![0-9]))’

//Number: integer
//Will match 123 and 456 as separate integer numbers in 123.456
‘bd+b’

//Number: integer
//Does not match numbers like 123.456
‘(?<!S)d++(?!S)’

//Number: integer with optional sign
‘[-+]?bd+b’

//Number: scientific floating point
//Matches an integer or a floating point number.
//Integer and fractional parts are both optional.
‘[-+]?(?:b[0-9]+(?:.[0-9]*)?|.[0-9]+b)(?:[eE][-+]?[0-9]+b)?’

//Number: scientific floating point
//Matches an integer or a floating point number with optional integer part.
//Both the sign and exponent are optional.
‘[-+]?b[0-9]*.?[0-9]+(?:[eE][-+]?[0-9]+)?b’

Passwords

//Password complexity
//Tests if the input consists of 6 or more letters, digits, underscores and hyphens.
//The input must contain at least one upper case letter, one lower case letter and one digit.
‘A(?=[-_a-zA-Z0-9]*?[A-Z])(?=[-_a-zA-Z0-9]*?[a-z])(?=[-_a-zA-Z0-9]*?[0-9])[-_a-zA-Z0-9]{6,}z’

//Password complexity
//Tests if the input consists of 6 or more characters.
//The input must contain at least one upper case letter, one lower case letter and one digit.
‘A(?=[-_a-zA-Z0-9]*?[A-Z])(?=[-_a-zA-Z0-9]*?[a-z])(?=[-_a-zA-Z0-9]*?[0-9])S{6,}z’

File paths

//Path: Windows
‘b[a-z]:\[^/:*?”<>|rn]*’

//Path: Windows
//Different elements of the path are captured into backreferences.
‘b((?#drive)[a-z]):\((?#folder)[^/:*?”<>|rn]*\)?((?#file)[^\/:*?”<>|rn]*)’

//Path: Windows or UNC
‘(?:(?#drive)b[a-z]:|\\[a-z0-9]+)\[^/:*?”<>|rn]*’

//Path: Windows or UNC
//Different elements of the path are captured into backreferences.
‘((?#drive)b[a-z]:|\\[a-z0-9]+)\((?#folder)[^/:*?”<>|rn]*\)?((?#file)[^\/:*?”<>|rn]*)’

Phone numbers

//Phone Number (North America)
//Matches 3334445555, 333.444.5555, 333-444-5555, 333 444 5555, (333) 444 5555 and all combinations thereof.
//Replaces all those with (333) 444-5555
preg_replace(‘(?([0-9]{3}))?[-. ]?([0-9]{3})[-. ]?([0-9]{4})’, ‘(1) 2-3’, $text);

//Phone Number (North America)
//Matches 3334445555, 333.444.5555, 333-444-5555, 333 444 5555, (333) 444 5555 and all combinations thereof.
‘(?[0-9]{3})?[-. ]?[0-9]{3}[-. ]?[0-9]{4}’

Postal codes

//Postal code (Canada)
‘b[ABCEGHJKLMNPRSTVXY][0-9][A-Z] [0-9][A-Z][0-9]b’

//Postal code (UK)
‘b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}b’

Programming

//Programming: # comment
//Single-line comment started by # anywhere on the line
‘#.*$’

//Programming: # preprocessor statement
//Started by # at the start of the line, possibly preceded by some whitespace.
‘^s*#.*$’

//Programming: /* comment */
//Does not match nested comments.  Most languages, including C, Java, C#, etc.
//do not allow comments to be nested.  I.e. the first */ closes the comment.
‘/*.*?*/’

//Programming: // comment
//Single-line comment started by // anywhere on the line
‘//.*$’

//Programming: GUID
//Microsoft-style GUID, numbers only.
‘[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}’

//Programming: GUID
//Microsoft-style GUID, with optional parentheses or braces.
//(Long version, if your regex flavor doesn’t support conditionals.)
‘[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}|([A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12})|{[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}}’

//Programming: GUID
//Microsoft-style GUID, with optional parentheses or braces.
//Short version, illustrating the use of regex conditionals.  Not all regex flavors support conditionals. 
//Also, when applied to large chunks of data, the regex using conditionals will likely be slower
//than the long version.  Straight alternation is much easier to optimize for a regex engine.
‘(?:(()|({))?[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}(?(1)))(?(2)})’

//Programming: Remove escapes
//Remove backslashes used to escape other characters
preg_replace(‘\(.)’, ‘1’, $text);

//Programming: String
//Quotes may appear in the string when escaped with a backslash.
//The string may span multiple lines.
‘”[^”\]*(?:\.[^”\]*)*”‘

//Programming: String
//Quotes may appear in the string when escaped with a backslash.
//The string cannot span multiple lines.
‘”[^”\rn]*(?:\.[^”\rn]*)*”‘

//Programming: String
//Quotes may not appear in the string.  The string cannot span multiple lines.
‘”[^”rn]*”‘

Quotes

//Quotes: Replace smart double quotes with straight double quotes.
//ANSI version for use with 8-bit regex engines and the Windows code page 1252.
preg_replace(‘[x84x93x94]’, ‘”‘, $text);

//Quotes: Replace smart double quotes with straight double quotes.
//Unicode version for use with Unicode regex engines.
preg_replace(‘[u201Cu201Du201Eu201Fu2033u2036]’, ‘”‘, $text);

//Quotes: Replace smart single quotes and apostrophes with straight single quotes.
//Unicode version for use with Unicode regex engines.
preg_replace(“[u2018u2019u201Au201Bu2032u2035]”, “‘”, $text);

//Quotes: Replace smart single quotes and apostrophes with straight single quotes.
//ANSI version for use with 8-bit regex engines and the Windows code page 1252.
preg_replace(“[x82x91x92]”, “‘”, $text);

//Quotes: Replace straight apostrophes with smart apostrophes
preg_replace(“b’b”, “?”, $text);

//Quotes: Replace straight double quotes with smart double quotes.
//ANSI version for use with 8-bit regex engines and the Windows code page 1252.
preg_replace(‘B”b([^”x84x93x94rn]+)b”B’, ‘?1?’, $text);

//Quotes: Replace straight double quotes with smart double quotes.
//Unicode version for use with Unicode regex engines.
preg_replace(‘B”b([^”u201Cu201Du201Eu201Fu2033u2036rn]+)b”B’, ‘?1?’, $text);

//Quotes: Replace straight single quotes with smart single quotes.
//Unicode version for use with Unicode regex engines.
preg_replace(“B’b([^’u2018u2019u201Au201Bu2032u2035rn]+)b’B”, “?1?”, $text);

//Quotes: Replace straight single quotes with smart single quotes.
//ANSI version for use with 8-bit regex engines and the Windows code page 1252.
preg_replace(“B’b([^’x82x91x92rn]+)b’B”, “?1?”, $text);

Escape

//Regex: Escape metacharacters
//Place a backslash in front of the regular expression metacharacters
preg_replace(“[][{}()*+?.\^$|]”, “\$0”, $text);

Security

//Security: ASCII code characters excl. tab and CRLF
//Matches any single non-printable code character that may cause trouble in certain situations.
//Excludes tabs and line breaks.
‘[x00x08x0Bx0Cx0E-x1F]’

//Security: ASCII code characters incl. tab and CRLF
//Matches any single non-printable code character that may cause trouble in certain situations.
//Includes tabs and line breaks.
‘[x00-x1F]’

//Security: Escape quotes and backslashes
//E.g. escape user input before inserting it into a SQL statement
preg_replace(“\$0”, “\$0”, $text);

//Security: Unicode code and unassigned characters excl. tab and CRLF
//Matches any single non-printable code character that may cause trouble in certain situations.
//Also matches any Unicode code point that is unused in the current Unicode standard,
//and thus should not occur in text as it cannot be displayed.
//Excludes tabs and line breaks.
‘[^P{C}trn]’

//Security: Unicode code and unassigned characters incl. tab and CRLF
//Matches any single non-printable code character that may cause trouble in certain situations.
//Also matches any Unicode code point that is unused in the current Unicode standard,
//and thus should not occur in text as it cannot be displayed.
//Includes tabs and line breaks.
‘p{C}’

//Security: Unicode code characters excl. tab and CRLF
//Matches any single non-printable code character that may cause trouble in certain situations.
//Excludes tabs and line breaks.
‘[^P{Cc}trn]’

//Security: Unicode code characters incl. tab and CRLF
//Matches any single non-printable code character that may cause trouble in certain situations.
//Includes tabs and line breaks.
‘p{Cc}’

SSN (Social security numbers)

//Social security number (US)
‘b[0-9]{3}-[0-9]{2}-[0-9]{4}b’

Trim

//Trim whitespace (including line breaks) at the end of the string
preg_replace(“s+z”, “”, $text);

//Trim whitespace (including line breaks) at the start and the end of the string
preg_replace(“As+|s+z”, “”, $text);

//Trim whitespace (including line breaks) at the start of the string
preg_replace(“As+”, “”, $text);

//Trim whitespace at the end of each line
preg_replace(“[ t]+$”, “”, $text);

//Trim whitespace at the start and the end of each line
preg_replace(“^[ t]+|[ t]+$”, “”, $text);

//Trim whitespace at the start of each line
preg_replace(“^[ t]+”, “”, $text);

URL’s

//URL: Different URL parts
//Protocol, domain name, page and CGI parameters are captured into backreferenes 1 through 4
‘b((?#protocol)https?|ftp)://((?#domain)[-A-Z0-9.]+)((?#file)/[-A-Z0-9+&@#/%=~_|!:,.;]*)?((?#parameters)?[-A-Z0-9+&@#/%=~_|!:,.;]*)?’

//URL: Different URL parts
//Protocol, domain name, page and CGI parameters are captured into named capturing groups.
//Works as it is with .NET, and after conversion by RegexBuddy on the Use page with Python, PHP/preg and PCRE.
‘b(?<protocol>https?|ftp)://(?<domain>[-A-Z0-9.]+)(?<file>/[-A-Z0-9+&@#/%=~_|!:,.;]*)?(?<parameters>?[-A-Z0-9+&@#/%=~_|!:,.;]*)?’

//URL: Find in full text
//The final character class makes sure that if an URL is part of some text, punctuation such as a
//comma or full stop after the URL is not interpreted as part of the URL.
‘b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]’

//URL: Replace URLs with HTML links
preg_replace(‘b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]’, ‘<a href=””></a>’, $text);

Words

//Words: Any word NOT matching a particular regex
//This regex will match all words that cannot be matched by %REGEX%.
//Explanation: Observe that the negative lookahead and the w+ are repeated together. 
//This makes sure we test that %REGEX% fails at EVERY position in the word, and not just at any particular position.
‘b(?:(?!%REGEX%)w)+b’

//Words: Delete repeated words
//Find any word that occurs twice or more in a row.
//Delete all occurrences except the first.
preg_replace(‘b(w+)(?:s+1b)+’, ‘1’, $text);

//Words: Near, any order
//Matches word1 and word2, or vice versa, separated by at least 1 and at most 3 words
‘b(?:word1(?:W+w+){1,3}W+word2|word2(?:W+w+){1,3}W+word1)b’

//Words: Near, list
//Matches any pair of words out of the list word1, word2, word3, separated by at least 1 and at most 6 words
‘b(word1|word2|word3)(?:W+w+){1,6}W+(word1|word2|word3)b’

//Words: Near, ordered
//Matches word1 and word2, in that order, separated by at least 1 and at most 3 words
‘bword1(?:W+w+){1,3}W+word2b’

//Words: Repeated words
//Find any word that occurs twice or more in a row.
‘b(w+)s+1b’

//Words: Whole word
‘b%WORD%b’

//Words: Whole word
//Match one of the words from the list
‘b(?:word1|word2|word3)b’

//Words: Whole word at the end of a line
//Whitespace permitted after the word
‘b%WORD%s*$’

//Words: Whole word at the end of a line
‘b%WORD%$’

//Words: Whole word at the start of a line
‘^%WORD%b’

//Words: Whole word at the start of a line
//Whitespace permitted before the word
‘^s*%WORD%b’

(Visited 72 times, 1 visits today)

Leave a reply

You must be logged in to post a comment.