Thursday 17 November 2016

TCL-Regular Expression Explanation

For more examples please refer to Regular Expression Examples In TCL.

What are Regular Expressions?
regular expression, or RE, describes strings of characters (words or phrases or any arbitrary text). It's a pattern that matches certain strings and doesn't match others. For example, you could write an RE to tell you if a string contains a URL (World Wide Web Uniform Resource Locator, such as http://somehost/somefile.html). Regular expressions can be either broad and general or focused and precise.

A regular expression uses metacharacters (characters that assume special meaning for matching other characters) such as *, [], $ and.. For example, the RE [Hh]ello!* would match Hello and hello and Hello! (and hello!!!!!). The RE [Hh](ello|i)!* would match Hello and Hi and Hi! (and so on). A backslash (\) disables the special meaning of the following character, so you could match the string [Hello] with the RE \[Hello\]

Regular Expressions:
Regular expressions can be expressed in just a few rules.
.
Match any single character (e.g., m.d matches mad, mod, m3d, etc.)
[]
Bracket expression: Match any one of the enclosed characters (e.g., [a-z0-9_] matches a lowercase ASCII letter, a digit, or an underscore)
^
Start-of-string anchor: Match only at the start of a string (e.g., ^hi matches hi and his but not this)
$
End-of-string anchor: Match only at the end of a string (e.g., hi$ matches hi and chi but not this)
*
Zero-or-more quantifier: makes the previous part of the RE match zero or more times (e.g., M.*D matches MD, MAD, MooD, M.D, etc.)
?
Zero-or-one quantifier: makes the previous part of the RE match zero or one time (e.g., hi!? matches hi or hi!)
+
One-or-more quantifier: makes the previous part of the RE match one or more times (e.g., hi!+ matches hi! or hi!! or hi!!! or ...)
|
Alternation (vertical bar): Match just one alternative (e.g., this|that matches this or that)
()
Sub pattern: Group part of the RE. Many uses, such as:
  • Makes a quantifier apply to a group of text (e.g., ([0-9A-F][0-9A-F])+ matches groups of two hexadecimal digits: A9 or AB03 or 8A6E00, but not A or A2C).
  • Set limits for alternation (e.g., "Eat (this|that)!" matches "Eat this!" or "Eat that!").
  • Used for subpattern matching in the regexp and regsub commands.
\
Escape: Disables meaning of the following metacharacter (e.g., a\.* matches a or a. or a.. or etc.). Note that \ also has special meaning to the Tcl interpreter (and to applications, such as C compilers)

Eg: Set TestingDuts 1/2 
       [regexp {\/} $TestingDuts] }    
We want to match if the there is a / or not in the above string [1/2]
Since / has a different meaning so we need to add \ to remove the meaning of / in match. If want to match \n then we have to give /\n to match \n

NOTE: regexp {([^\/]+)/(.*)} $port -- devNum port1
In the above regular expression --is used in the case if we don’t want to match the entire string. First () match will store devNum and second () match will store the second match.

In regular expression parsing, the * symbol matches zero or more occurrences of the character immediately preceding the *. For example a* would match a, aaaaa, or a blank string. If the character directly before the * is a set of characters within square brackets, then the * will match any quantity of all of these characters. For example, [a-c]* would match aa, abc, aabcabc, or again, an empty string.

The + symbol behaves roughly the same as the *, except that it requires at least one character to match. For example, [a-c]+ would match a, abc, or aabcabc, but not an empty string.

Regular expression parsing also includes a method of selecting any character not in a set. If the first character after the [ is a caret (^), then the regular expression parser will match any character not in the set of characters between the square brackets. A caret can be included in the set of characters to match (or not) by placing it in any position other than the first.

The regexp command is similar to the string match command in that it matches an exp against a string. It is different in that it can match a portion of a string, instead of the entire string, and will place the characters matched into the matchVar variable.

If a match is found to the portion of a regular expression enclosed within parentheses, regexp will copy the subset of matching characters is to the subSpec argument. This can be used to parse simple strings.

Regsub will copy the contents of the string to a new variable, substituting the characters that match exp with the characters in subSpec. If subSpec contains a & or \0, then those characters will be replaced by the characters that matched exp. If the number following a backslash is 1-9, then that backslash sequence will be replaced by the appropriate portion of exp that is enclosed within parentheses

Note that the exp argument to regexp or regsub is processed by the Tcl substitution pass. Therefore quite often the expression is enclosed in braces to prevent any special processing by Tcl.

Simple Examples: All Examples tested on TCL 8.4
========================================
    EXAMPLE 1.
    set sample "Where there is a will, There is a way."
    set result [regexp {[a-z]+} $sample match]
    puts $match     ---prints here as output
    puts $result      ---prints 1 as output
    In the above regular expression here is matched and stored in match string. 
    If we want to match here there is a will in the above string the regular expression will
    be as below:
    set result [regexp {[a-z ]+} $sample match]  --prints here there is a will stored 
    in match[space added]

    To match here there is a will, in the above string the regular expression will be as below:
    set result [regexp {[a-z , ]+ } $sample match][comma added]

    To match Where there is a will, There is a way. the regular expression will be as below:
    set result [regexp {[A-Za-z ,\. ]+} $sample match]

    To match “Where there” and store “where” and “there” as separate substrings:
    set result [regexp {([A-Za-z]+) +([a-z]+)} $sample match sub1 sub2 ]
    puts $match   --- Where there
    puts $sub1     ---- Where
    puts $sub2     ----there
    In the Above regular expression match will have complete match i.e Where there
    And the match between first () will match to Where and store in sub1 and second ()
    match will match there and store in sub2

   NOTE: If we don’t want to store the complete match in variable match we can use “--”
                Command which only save first and second match in sub1 and sub 2.
                Below regular expression does the same:
   set result [regexp {([A-Za-z]+) +([a-z]+)} $sample -- sub1 sub2 ]
   puts $sub1     ---- Where
   puts $sub2     ----there

   To match “here there is a will, There is a way” and to match “here there is a will” and
   “There is a way” and store it in sub1 and sub2 respectively.
    set result [regexp {([a-z ]+), +([A-Za-z ]+)} $sample match sub1 sub2]
    puts $match  : here there is a will, There is a way
    puts $sub1     : here there is a will
    puts $sub2     : There is a way

EXAMPLE: 2
set out "Tcl Tutorial"
regexp {([A-Z,a-z]*).([A-Z,a-z]*)} $out a b c 
puts "Full Match: $a"
puts "Sub Match1: $b"
puts "Sub Match2: $c"
Output:
Full Match: Tcl Tutorial
Sub Match1: Tcl
Sub Match2: Tutorial

set out "Tcl Tutorial"
regexp {([A-Z,a-z]*.([A-Z,a-z]*))} $out a b c 
puts "Full Match: $a"
puts "Sub Match1: $b"
puts "Sub Match2: $c"
Output:
Full Match: Tcl Tutorial
Sub Match1: Tcl Tutorial
Sub Match2: Tutorial
   

Switches for Regex Command
The list of switches available in Tcl are,
nocase − Used to ignore case.
indices −  Store location of matched sub patterns instead of matched characters.
line −  New line sensitive matching. Ignores the characters after newline.
start  index − Sets the offset of start of search pattern.

In the above examples, I have deliberately used [A-Z, a-z] for all alphabets, you can easily use -nocase instead of as shown below:

set out "Tcl Tutorial"
regexp -nocase {([A-Z]*.([A-Z]*))} $out a b c 
puts "Full Match: $a"
puts "Sub Match1: $b"
puts "Sub Match2: $c"
Output:
Full Match: Tcl Tutorial
Sub Match1: Tcl Tutorial
Sub Match2: Tutorial


regexp -nocase -line -- {([A-Z]*.([A-Z]*))} "Tcl \nTutorial" a b
puts "Full Match: $a"
puts "Sub Match1: $b"
regexp -nocase -start 4 -line -- {([A-Z]*.([A-Z]*))} "Tcl \nTutorial" a b 
puts "Full Match: $a"
puts "Sub Match1: $b"
Output:
Full Match: Tcl
Sub Match1: Tcl
Full Match: Tutorial
Sub Match1: Tutorial

REGSUB:
Syntax: regsub exp string subSpec var
Searches string for substring that match the regular expression exp and replaces them with subSpec.
The resulting string is copied into var
Eg: 1
   set sample "Where there is a will, There is a way."    
   regsub "way" $sample "lawsuit" sample2
   puts $sample :  Where there is a will, There is a lawsuit.
   The above regular expression replaces the string “way” to “lawsuit” in stores it in sample.
Eg: 2
    set sample "eer dfgdfgf       trt         dfsdf      sfdsf        ree"
    regsub -all { +} $sample " " var
    puts $sample  : eer dfgdfgf trt dfsdf sfdsf ree  - Removes tab and inserts spaces.

?: Command Usage
Usage:?: is used in sub patterns in a regexp
Whenever you don’t want a particular subpattern to be included as a sub-pattern use “?:” in front of the sub-pattern
Example:
set string "Names: Manish Ajay Aman"
regexp "Names: (Manish|Ajay) (?:Aman|Raj|Ajay) (Aman|Raj)" $string match sub1 sub2 sub3
puts "$match\n$sub1\n$sub2\n$sub3\n"

For the above example, the output will be:
Names: Manish Ajay Aman
Manish
Aman

The Above regular expression will escape the condition followed by ?: so match will have full match
And sub1:Manish sub2:Aman and Sub3: is null the second condition (?:Aman|Raj|Ajay) is escaped
So here sub 3 acts as a dummy variable.