Shell Script

15 Regular Expression

트리스탄1234 2022. 8. 27. 09:10
728x90
반응형

A regular expression is a user-defined pattern that allows Linux utilities to filter text. The figure below shows the process of filtering data using regular expressions.

반응형

 

For example, the "*' character lists all files that match the pattern. Let's take a look at the example below.

$ ls -al da* -rw-r--r-- 1 ==>List all files starting with da
-rw-r--r-- 1 rich rich 25 Dec 4 12:40 data.ts
-rw-r--r-- 1 rich rich 180 Nov 26 12:42 data1
-rw-r--r-- 1 rich rich 45 Nov 26 12:44 data2
-rw-r--r-- 1 rich rich 73 Nov 27 12:31 data3
-rw-r--r-- 1 rich rich 79 Nov 28 14:01 data4
-rw-r--r-- 1 rich rich 187 Dec 4 09:45 datatest
$

kind of regular expression

The inconvenience of using regular expressions is that there are many kinds of regular expressions depending on the application. There are two types of regular expression engines in Linux as follows.

■ BRE Engine: Posix-based regular expression engine

■ ERE: An extended version of Posix's regular expression engine.

1. Defining the BRE pattern

A. The method of filtering plain text characters has been used a lot when using sed and gawk in other articles. Let's look at a simple example.

$ echo "This is a test" | sed -n ’/test/p’
This is a test
$ echo "This is a test" | sed -n ’/trial/p .
$

Let's take a look at some things to keep in mind when defining plaintext text patterns. First of all, regular expressions are case-sensitive. The BRE pattern definition will output if a match is made regardless of where the character pattern you are looking for is found. Let's take a look at the example below.

$ echo "The books are expensive" | sed -n ’/book/p’
=> From the front, book is matched among books, so it is printed
The books are expensive
$
Reversing the example above
$ echo "The book is expensive" | sed -n ’/books/p’
$

Regular expressions can also contain numbers and spaces.

$ echo "This is line number 1" | sed -n ’/ber 1/p’ ==> ber 1 is matched and output is
This is line number 1
$
This time, if you search without space, there is no output. Space is also treated as a character.
$ echo "This is line number1" | sed -n ’/ber 1/p’
$

special character handling

In regular expressions, some special characters whose meanings are defined in advance cannot be used as character patterns. To use such special characters (.*[]^${}\+?|()) as a character pattern, a backslash must be used in front of the special character to indicate it. Let's look at some examples

$ cat data2
The cost is $4.00
$ sed -n ’/\$/p’ data2 The cost is $4.00 $ ==>To use a $pattern, put a \ in front of the $.
$ echo "\ is a special character" | sed -n ’/\\/p’
==>Prepend a \ character to find a backslash
\ is a special character $
This is an example to search for slush.
$ echo "3 / 2" | sed -n ’/\//p’ 3/2
$

Using the Anchor character

In regular expressions, there are special characters to indicate start and end. Among them, the escape character (^) is a special character that marks the beginning of a text line. Let's look at the example below.

Searching for the beginning of a line

$ echo "The book store" | sed -n ’/^book/p’ => Search for sentences that start line with book
$
$ echo "Books are great" | sed -n ’/^Book/p’ => Search for lines that start with Book
Books are great
$
$ cat data3
This is a test line.
this is another test line.
A line that tests this feature.
Yet more testing of this
$ sed -n ’/^this/p’ data3 ==>Search for lines that start with this
this is another test line.
$

Search for the last part of a line

Contrary to the escape character (^), the punctuation character used to search for the end of a line is the '$' character. Let's see how to use it through some examples below.

$ echo "This is a good book" | sed -n ’/book$/p’
This is a good book
$ echo "This book is good" | sed -n ’/book$/p’
$

Using both start and end characters

You can also search for the part you want by using the escape character ^ and the $ sign at the end of the line. Let's take a look at some useful usage examples below.

$ cat data4
this is a test of using both anchors
I said this is a test
this is a test
I’m sure this is a test.
$ sed -n ’/^this is a test$/p’ data4
=>Only the part that exactly matches is printed as this is a test
this is a test
$
As shown in the example above, if you use the start sign ^ and the last character $ together, only the matching part is output, and other parts of the same line are ignored.
$ cat data5
This is one test line.
This is another test line.
$ sed ’/^$/d’ data5 ==> No pattern at the beginning or end
This is one test line.
This is another test line.
$
Find the blank line in the file and delete the blank line using sed's delete command, d.

Using the dot character

The dot(.) special symbol means a single character in a regular expression. Let's take a look at the usage example below.

$ cat data6
This is a test of a line.
The cat is sleeping.
That is a very nice hat.
This test is at line four.
at ten o’clock we’ll go home.
$ sed -n ’/.at/p’ data6 ==> Finds three-character characters ending in at
The cat is sleeping. ==. cat
That is a very nice hat. ==> hat
This test is at line four. ==>공백at
$
The peculiar thing here is that the dot special character is treated as a single character, so you have to remember that

 

Using Character classes Character

classes can be used when you want to match specific characters. You can use it by putting the patterns to be matched in square brackets. Let's look at an example below.

$ sed -n ’/[ch]at/p’ data6 ==> Prints the line with the first letter c or h among words ending in at.
The cat is sleeping.
That is a very nice hat.
$
This is an example to use when you do not know whether a word starts with a capital letter or a lowercase letter.
$ echo "Yes" | sed -n ’/[Yy]es/p’
Yes
$ echo "yes" | sed -n ’/[Yy]es/p’ yes
$
Example using more than one character class
$ echo "Yes" | sed -n ’/[Yy][Ee][Ss]/p’
Yes
$ echo "yEs" | sed -n ’/[Yy][Ee][Ss]/p’
yEs
$ echo "yeS" | sed -n ’/[Yy][Ee][Ss]/p’
yeS
$
Example of using numbers in character class
$ cat data7
This line doesn’t contain a number.
This line has 1 number on it.
This line a number 2 on it.
This line has a number 4 on it.
$ sed -n ’/[0123]/p’ data7
This line has 1 number on it.
This line a number 2 on it.
$
Find a five-digit pattern
$ cat data8
60633
46201
223001
4353
22203
$ sed -n ’ >/[0123456789][0123456789][0123456789][0123456789][0123456789]/p >’ data8 60633 46201 223001 22203
$
$ cat data9
I need to have some maintenence done on my car.
I’ll pay that in a seperate invoice.
After I pay for the maintenance my car will be as good as new.
Find sentences that match two patterns
$ sed -n ’ /maint[ea]n[ae]nce/p
/sep[ea]r[ea]te/p ’ data9
I need to have some maintenence done on my car.
I’ll pay that in a seperate invoice.
After I pay for the maintenance my car will be as good as new.
$

Using Negating character classes

Contrary to the use of character classes, when searching for characters that do not match a character class, you can find lines that do not match a character class by entering a dash (^) in the first line of the pattern.

$ sed -n ’/[^ch]at/p’ data6 ==> print at strings that do not start with c and h
This test is at line two.
$

Search using ranges

You can use '-' opportunities in character classes to specify the scope to search.

$ sed -n ’/^[0-9][0-9][0-9][0-9][0-9]$/p’ data8
==> Search for lines of 5 digits that start with a number
60633
46201
45902
$
$ sed -n ’/[c-h]at/p’ data6
==>Search for lines ending in at among the characters between c and h
The cat is sleeping.
That is a very nice hat.
$
$ sed -n ’/[a-ch-m]at/p’ data6
==> Search for lines ending in at among the characters a to c or h and h to m
The cat is sleeping.
That is a very nice hat.
$

Using special characters provided by BRE

BRE uses special characters in regular expression expressions as shown in the table below. Let's take a look at the example and table below.

$ echo "abc" | sed -n ’/[[:digit:]]/p’ => output lines containing numbers
$ $ echo "abc" | sed -n ’/[[:alpha:]]/p’ ==> line output containing alphabet
abc
$ echo "abc123" | sed -n ’/[[:digit:]]/p’ ==> output lines containing numbers
abc123
$ echo "This is, a test" | sed -n ’/[[:punct:]]/p’ ==> line output containing period
This is, a test
$ echo "This is a test" | sed -n ’/[[:punct:]]/p’ ==> line output containing period.
$

BRE special character character class Please refer to the table below for frequently used character classes.

Using The asterisk

The '*' character can be used to match at least 0 or multiple characters between specific characters. Let's take a look at the example below.

$ echo "ik" | sed -n ’/ie*k/p’
==>The e before the asterisk may not appear or may appear multiple times.
ik
$ echo "iek" | sed -n ’/ie*k/p’
iek
$ echo "ieek" | sed -n ’/ie*k/p’
ieek
$ echo "ieeek" | sed -n ’/ie*k/p’
ieeek
$ echo "ieeeek" | sed -n ’/ie*k/p’
ieeeek
$
$ echo "bt" | sed -n ’/b[ae]*t/p’
==> Search for lines ending in t in which neither a or e appears nor appears
bt
$ echo "bat" | sed -n ’/b[ae]*t/p’
bat

Using Extended Regular Expressions

The question mark'?' can also be used similarly to an asterisk. However, what is wrong with '?' Unlike an asterisk, a mark does not match multiple characters, but only one character. In other words, it searches for a line in which one letter of the corresponding character appears or a line that does not contain it. Let's look at an example below..

$ echo "bt" | gawk ’/be?t/{print $0}’ ==> line output without e
bt
$ echo "bet" | gawk ’/be?t/{print $0}’ ==> line output with one occurrence of e
bet
$
$ echo "beet" | gawk ’/be?t/{print $0}’ ==> Excludes 2 occurrences of e
$ $ echo "beeet" | gawk ’/be?t/{print $0}’ ==>Excluded because e appears 3 times
$

Using the plus sign

The plus character '+' special character searches for lines in which the character before the plus character appears at least once in the pattern and prints it. That is, lines without a character before the + special character are not searched. Let's look at the example below.

$ echo "beeet" | gawk ’/be+t/{print $0}’ beeet
$ echo "beet" | gawk ’/be+t/{print $0}’ beet
$ echo "bet" | gawk ’/be+t/{print $0}’ bet
$ echo "bt" | gawk ’/be+t/{print $0}’
$

Use parentheses to specify the number of matched characters.

You can use parentheses to specify the number of repetitions of the character matched before the braces. There are two cases of number designation as follows.

■ m: search for the mth occurrence of the line

■ m,n: Search for lines that match at least m times and at most n times.

※ The gawk editor does not recognize the regular expression inteval by default. So, in order to recognize it, you have to use it after entering the --re-interval command as in the example below.

$ echo "bt" | gawk --re-interval ’/be{1}t/{print $0}’ $
$ echo "bet" | gawk --re-interval ’/be{1}t/{print $0}’
bet
$ echo "beet" | gawk --re-interval ’/be{1}t/{print $0}’
$
$ echo "bt" | gawk --re-interval ’/be{1,2}t/{print $0}’
$ $ echo "bet" | gawk --re-interval ’/be{1,2}t/{print $0}’
bet
$ echo "beet" | gawk --re-interval ’/be{1,2}t/{print $0}’
beet
$ echo "beeet" | gawk --re-interval ’/be{1,2}t/{print $0}’
$
Using interval in character class
$ echo "bt" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
==> Search for lines starting with b and appearing 1 or 2 a or e
$
$ echo "bat" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
bat
$ echo "bet" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
bet
$ echo "beat" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
beat
$ echo "beet" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
beet
$ echo "beeat" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
$
$ echo "baeet" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
$
$ echo "baeaet" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’
$

Using the pipe symbol

pipe'|' You can see the effect of or by concatenating multiple regular expressions using the symbol. If only one of the conditions is matched, the line is printed.

$ echo "The cat is asleep" | gawk ’/cat|dog/{print $0}’
The cat is asleep
$ echo "The dog is asleep" | gawk ’/cat|dog/{print $0}’
The dog is asleep
$ echo "The sheep is asleep" | gawk ’/cat|dog/{print $0}’

Using grouping expressions

can use parentheses to group multiple characters together so that they are treated as a single character. Let's take a look at the example below.

$ echo "Sat" | gawk ’/Sat(urday)?/{print $0}’ ==> Line output with or without (Sat)
Sat
$ echo "Saturday" | gawk ’/Sat(urday)?/{print $0}’
Saturday
$
Examples of use with pipes
$ echo "cat" | gawk ’/(c|b)a(b|t)/{print $0}’
=>Search for lines starting with c or b and the character after a is b or t
cat
$ echo "cab" | gawk ’/(c|b)a(b|t)/{print $0}’
cab
$ echo "bat" | gawk ’/(c|b)a(b|t)/{print $0}’
bat
$ echo "bab" | gawk ’/(c|b)a(b|t)/{print $0}’
bab
$ echo "tab" | gawk ’/(c|b)a(b|t)/{print $0}’
$
$ echo "tac" | gawk ’/(c|b)a(b|t)/{print $0}’
$

See examples of regular expression usage

Let's look at a script that counts the number of files in all paths in the PATH path among Linux environment variables.

#!/bin/bash
# count number of files in your PATH
mypath=`echo $PATH | sed ’s/:/ /g’`
count=0
for directory in $mypath
do
check=`ls $directory`
for item in $check
do
count=$[ $count + 1 ]
done echo "$directory - $count"
count=0
done
$ ./countfiles
/usr/local/bin - 79
/bin - 86
/usr/bin - 1502
/usr/X11R6/bin - 175
/usr/games - 2
/usr/java/j2sdk1.4.1 01/bin - 27
$

Second phone number validation

#!/bin/bash
# script to filter out bad phone numbers
gawk --re-interval ’/^\(?[2-9][0-9]{2}\)?(| |-|\.) [0-9]{3}( |-|\.)[0-9]{4}/{print $0}’
$
Analyzing the pattern
^\(? ==> The beginning of a line begins with or does not begin with a parenthesis '('.
[2-9][0-9]{2} 1 number 2-9 and 2 numbers 0-9
\)? Matches or does not match a closing parenthesis.
( |-|\.) ==> blank or '-' or dot'.' ego
[0-9]{3} ==> 3 digits 0-9
}( |-|\.) ==> either blank or '-' or dot'.' ego
[0-9]{4} ==> 4 digits between 0 and 9

Parsing an e-mail address

Almost all business nowadays is done through e-mail, and checking the validity of such e-mail addresses has become an important task. The script below is a regular expression that can check the validity of an email. Usually, an email address is divided into a username part and a hostname part as shown below, and the valid special characters in each field are as follows.

Valid special characters in username

■ Dot

■ Dash

■ Plus sign

■ Underscore ​

Valid special characters in hostname

■ Dot

■ Underscore ​

Now let's take a look at the regular expression example below

^([a-zA-Z0-9 \-\.\+]+)@^([a-zA-Z0-9 \-\.\+]+)
Let's look at the username regex separated by @
^([a-zA-Z0-9 \-\.\+]+)
==>When the start of a line begins with a lowercase letter between a and z, an uppercase letter between A and Z, a number between 0 and 9, or a combination of '-', '.' ) search for the line that appears
Then look at the expression at the back of the domain and server name.
^([a-zA-Z0-9 \-\.\+]+)
==> Search for a string that starts with a lowercase letter a~z, or an uppercase letter A~Z, or a number between 0 and 9 and '-', '/', '+' signs

In addition, you can use various regular expressions to check the validity of input data, and regular expressions are a must-know expression in the Linux world.

728x90
반응형

'Shell Script' 카테고리의 다른 글

17 Advance gawk  (4) 2022.09.03
16 Advanced Sed  (1) 2022.08.31
14 Introduction sed and gawk  (1) 2022.08.27
13 Using Graphic in Script  (1) 2022.08.17
12 Making Function  (5) 2022.08.16