15 Regular Expression

Shell Script

15 Regular Expression

트리스탄1234 2022. 8. 27. 09:10

728x90

A regular expression is a user-defined pattern that allows Linux utilities to filter text. The figure below shows the process of filtering data using regular expressions.

For example, the "*' character lists all files that match the pattern. Let's take a look at the example below.

$ ls -al da* -rw-r--r-- 1 ==>List all files starting with da

-rw-r--r-- 1 rich rich 25 Dec 4 12:40 data.ts

-rw-r--r-- 1 rich rich 180 Nov 26 12:42 data1

-rw-r--r-- 1 rich rich 45 Nov 26 12:44 data2

-rw-r--r-- 1 rich rich 73 Nov 27 12:31 data3

-rw-r--r-- 1 rich rich 79 Nov 28 14:01 data4

-rw-r--r-- 1 rich rich 187 Dec 4 09:45 datatest

kind of regular expression

The inconvenience of using regular expressions is that there are many kinds of regular expressions depending on the application. There are two types of regular expression engines in Linux as follows.

■ BRE Engine: Posix-based regular expression engine

■ ERE: An extended version of Posix's regular expression engine.

1. Defining the BRE pattern

A. The method of filtering plain text characters has been used a lot when using sed and gawk in other articles. Let's look at a simple example.

$ echo "This is a test" | sed -n ’/test/p’

This is a test

$ echo "This is a test" | sed -n ’/trial/p .

Let's take a look at some things to keep in mind when defining plaintext text patterns. First of all, regular expressions are case-sensitive. The BRE pattern definition will output if a match is made regardless of where the character pattern you are looking for is found. Let's take a look at the example below.

$ echo "The books are expensive" | sed -n ’/book/p’

=> From the front, book is matched among books, so it is printed

The books are expensive

Reversing the example above

$ echo "The book is expensive" | sed -n ’/books/p’

Regular expressions can also contain numbers and spaces.

$ echo "This is line number 1" | sed -n ’/ber 1/p’ ==> ber 1 is matched and output is

This is line number 1

This time, if you search without space, there is no output. Space is also treated as a character.

$ echo "This is line number1" | sed -n ’/ber 1/p’

special character handling

In regular expressions, some special characters whose meanings are defined in advance cannot be used as character patterns. To use such special characters (.*[]^${}\+?|()) as a character pattern, a backslash must be used in front of the special character to indicate it. Let's look at some examples

$ cat data2

The cost is $4.00

$ sed -n ’/\$/p’ data2 The cost is $4.00 $ ==>To use a $pattern, put a \ in front of the $.

$ echo "\ is a special character" | sed -n ’/\\/p’

==>Prepend a \ character to find a backslash

\ is a special character $

This is an example to search for slush.

$ echo "3 / 2" | sed -n ’/\//p’ 3/2

Using the Anchor character

In regular expressions, there are special characters to indicate start and end. Among them, the escape character (^) is a special character that marks the beginning of a text line. Let's look at the example below.

Searching for the beginning of a line

$ echo "The book store" | sed -n ’/^book/p’ => Search for sentences that start line with book

$ echo "Books are great" | sed -n ’/^Book/p’ => Search for lines that start with Book

Books are great

$ cat data3

This is a test line.

this is another test line.

A line that tests this feature.

Yet more testing of this

$ sed -n ’/^this/p’ data3 ==>Search for lines that start with this

this is another test line.

Search for the last part of a line

Contrary to the escape character (^), the punctuation character used to search for the end of a line is the '$' character. Let's see how to use it through some examples below.

$ echo "This is a good book" | sed -n ’/book$/p’

This is a good book

$ echo "This book is good" | sed -n ’/book$/p’

Using both start and end characters

You can also search for the part you want by using the escape character ^ and the $ sign at the end of the line. Let's take a look at some useful usage examples below.

$ cat data4

this is a test of using both anchors

I said this is a test

this is a test

I’m sure this is a test.

$ sed -n ’/^this is a test$/p’ data4

=>Only the part that exactly matches is printed as this is a test

this is a test

As shown in the example above, if you use the start sign ^ and the last character $ together, only the matching part is output, and other parts of the same line are ignored.

$ cat data5

This is one test line.

This is another test line.

$ sed ’/^$/d’ data5 ==> No pattern at the beginning or end

This is one test line.

This is another test line.

Find the blank line in the file and delete the blank line using sed's delete command, d.

Using the dot character

The dot(.) special symbol means a single character in a regular expression. Let's take a look at the usage example below.

$ cat data6

This is a test of a line.

The cat is sleeping.

That is a very nice hat.

This test is at line four.

at ten o’clock we’ll go home.

$ sed -n ’/.at/p’ data6 ==> Finds three-character characters ending in at

The cat is sleeping. ==. cat

That is a very nice hat. ==> hat

This test is at line four. ==>공백at

The peculiar thing here is that the dot special character is treated as a single character, so you have to remember that

Using Character classes Character

classes can be used when you want to match specific characters. You can use it by putting the patterns to be matched in square brackets. Let's look at an example below.

$ sed -n ’/[ch]at/p’ data6 ==> Prints the line with the first letter c or h among words ending in at.

The cat is sleeping.

That is a very nice hat.

This is an example to use when you do not know whether a word starts with a capital letter or a lowercase letter.

$ echo "Yes" | sed -n ’/[Yy]es/p’

Yes

$ echo "yes" | sed -n ’/[Yy]es/p’ yes

Example using more than one character class

$ echo "Yes" | sed -n ’/[Yy][Ee][Ss]/p’

Yes

$ echo "yEs" | sed -n ’/[Yy][Ee][Ss]/p’

yEs

$ echo "yeS" | sed -n ’/[Yy][Ee][Ss]/p’

yeS

Example of using numbers in character class

$ cat data7

This line doesn’t contain a number.

This line has 1 number on it.

This line a number 2 on it.

This line has a number 4 on it.

$ sed -n ’/[0123]/p’ data7

This line has 1 number on it.

This line a number 2 on it.

Find a five-digit pattern

$ cat data8

60633

46201

223001

4353

22203

$ sed -n ’ >/[0123456789][0123456789][0123456789][0123456789][0123456789]/p >’ data8 60633 46201 223001 22203

$ cat data9

I need to have some maintenence done on my car.

I’ll pay that in a seperate invoice.

After I pay for the maintenance my car will be as good as new.

Find sentences that match two patterns

$ sed -n ’ /maint[ea]n[ae]nce/p

/sep[ea]r[ea]te/p ’ data9

I need to have some maintenence done on my car.

I’ll pay that in a seperate invoice.

After I pay for the maintenance my car will be as good as new.

Using Negating character classes

Contrary to the use of character classes, when searching for characters that do not match a character class, you can find lines that do not match a character class by entering a dash (^) in the first line of the pattern.

$ sed -n ’/[^ch]at/p’ data6 ==> print at strings that do not start with c and h

This test is at line two.

Search using ranges

You can use '-' opportunities in character classes to specify the scope to search.

$ sed -n ’/^[0-9][0-9][0-9][0-9][0-9]$/p’ data8

==> Search for lines of 5 digits that start with a number

60633

46201

45902

$ sed -n ’/[c-h]at/p’ data6

==>Search for lines ending in at among the characters between c and h

The cat is sleeping.

That is a very nice hat.

$ sed -n ’/[a-ch-m]at/p’ data6

==> Search for lines ending in at among the characters a to c or h and h to m

The cat is sleeping.

That is a very nice hat.

Using special characters provided by BRE

BRE uses special characters in regular expression expressions as shown in the table below. Let's take a look at the example and table below.

$ echo "abc" | sed -n ’/[[:digit:]]/p’ => output lines containing numbers

$ $ echo "abc" | sed -n ’/[[:alpha:]]/p’ ==> line output containing alphabet

abc

$ echo "abc123" | sed -n ’/[[:digit:]]/p’ ==> output lines containing numbers

abc123

$ echo "This is, a test" | sed -n ’/[[:punct:]]/p’ ==> line output containing period

This is, a test

$ echo "This is a test" | sed -n ’/[[:punct:]]/p’ ==> line output containing period.

BRE special character character class Please refer to the table below for frequently used character classes.

Using The asterisk

The '*' character can be used to match at least 0 or multiple characters between specific characters. Let's take a look at the example below.

$ echo "ik" | sed -n ’/ie*k/p’

==>The e before the asterisk may not appear or may appear multiple times.

$ echo "iek" | sed -n ’/ie*k/p’

iek

$ echo "ieek" | sed -n ’/ie*k/p’

ieek

$ echo "ieeek" | sed -n ’/ie*k/p’

ieeek

$ echo "ieeeek" | sed -n ’/ie*k/p’

ieeeek

$ echo "bt" | sed -n ’/b[ae]*t/p’

==> Search for lines ending in t in which neither a or e appears nor appears

$ echo "bat" | sed -n ’/b[ae]*t/p’

bat

Using Extended Regular Expressions

The question mark'?' can also be used similarly to an asterisk. However, what is wrong with '?' Unlike an asterisk, a mark does not match multiple characters, but only one character. In other words, it searches for a line in which one letter of the corresponding character appears or a line that does not contain it. Let's look at an example below..

$ echo "bt" | gawk ’/be?t/{print $0}’ ==> line output without e

$ echo "bet" | gawk ’/be?t/{print $0}’ ==> line output with one occurrence of e

bet

$ echo "beet" | gawk ’/be?t/{print $0}’ ==> Excludes 2 occurrences of e

$ $ echo "beeet" | gawk ’/be?t/{print $0}’ ==>Excluded because e appears 3 times

Using the plus sign

The plus character '+' special character searches for lines in which the character before the plus character appears at least once in the pattern and prints it. That is, lines without a character before the + special character are not searched. Let's look at the example below.

$ echo "beeet" | gawk ’/be+t/{print $0}’ beeet

$ echo "beet" | gawk ’/be+t/{print $0}’ beet

$ echo "bet" | gawk ’/be+t/{print $0}’ bet

$ echo "bt" | gawk ’/be+t/{print $0}’

Use parentheses to specify the number of matched characters.

You can use parentheses to specify the number of repetitions of the character matched before the braces. There are two cases of number designation as follows.

■ m: search for the mth occurrence of the line

■ m,n: Search for lines that match at least m times and at most n times.

※ The gawk editor does not recognize the regular expression inteval by default. So, in order to recognize it, you have to use it after entering the --re-interval command as in the example below.

$ echo "bt" | gawk --re-interval ’/be{1}t/{print $0}’ $

$ echo "bet" | gawk --re-interval ’/be{1}t/{print $0}’

bet

$ echo "beet" | gawk --re-interval ’/be{1}t/{print $0}’

$ echo "bt" | gawk --re-interval ’/be{1,2}t/{print $0}’

$ $ echo "bet" | gawk --re-interval ’/be{1,2}t/{print $0}’

bet

$ echo "beet" | gawk --re-interval ’/be{1,2}t/{print $0}’

beet

$ echo "beeet" | gawk --re-interval ’/be{1,2}t/{print $0}’

Using interval in character class

$ echo "bt" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’

==> Search for lines starting with b and appearing 1 or 2 a or e

$ echo "bat" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’

bat

$ echo "bet" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’

bet

$ echo "beat" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’

beat

$ echo "beet" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’

beet

$ echo "beeat" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’

$ echo "baeet" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’

$ echo "baeaet" | gawk --re-interval ’/b[ae]{1,2}t/{print $0}’

Using the pipe symbol

pipe'|' You can see the effect of or by concatenating multiple regular expressions using the symbol. If only one of the conditions is matched, the line is printed.

$ echo "The cat is asleep" | gawk ’/cat|dog/{print $0}’

The cat is asleep

$ echo "The dog is asleep" | gawk ’/cat|dog/{print $0}’

The dog is asleep

$ echo "The sheep is asleep" | gawk ’/cat|dog/{print $0}’

Using grouping expressions

can use parentheses to group multiple characters together so that they are treated as a single character. Let's take a look at the example below.

$ echo "Sat" | gawk ’/Sat(urday)?/{print $0}’ ==> Line output with or without (Sat)

Sat

$ echo "Saturday" | gawk ’/Sat(urday)?/{print $0}’

Saturday

Examples of use with pipes

$ echo "cat" | gawk ’/(c|b)a(b|t)/{print $0}’

=>Search for lines starting with c or b and the character after a is b or t

cat

$ echo "cab" | gawk ’/(c|b)a(b|t)/{print $0}’

cab

$ echo "bat" | gawk ’/(c|b)a(b|t)/{print $0}’

bat

$ echo "bab" | gawk ’/(c|b)a(b|t)/{print $0}’

bab

$ echo "tab" | gawk ’/(c|b)a(b|t)/{print $0}’

$ echo "tac" | gawk ’/(c|b)a(b|t)/{print $0}’

See examples of regular expression usage

Let's look at a script that counts the number of files in all paths in the PATH path among Linux environment variables.

#!/bin/bash

# count number of files in your PATH

mypath=`echo $PATH | sed ’s/:/ /g’`

count=0

for directory in $mypath

check=`ls $directory`

for item in $check

count=$[ $count + 1 ]

done echo "$directory - $count"

count=0

done

$ ./countfiles

/usr/local/bin - 79

/bin - 86

/usr/bin - 1502

/usr/X11R6/bin - 175

/usr/games - 2

/usr/java/j2sdk1.4.1 01/bin - 27

Second phone number validation

#!/bin/bash

# script to filter out bad phone numbers

gawk --re-interval ’/^$?[2-9][0-9]{2}$?(| |-|\.) [0-9]{3}( |-|\.)[0-9]{4}/{print $0}’

Analyzing the pattern

^\(? ==> The beginning of a line begins with or does not begin with a parenthesis '('.

[2-9][0-9]{2} 1 number 2-9 and 2 numbers 0-9

\)? Matches or does not match a closing parenthesis.

( |-|\.) ==> blank or '-' or dot'.' ego

[0-9]{3} ==> 3 digits 0-9

}( |-|\.) ==> either blank or '-' or dot'.' ego

[0-9]{4} ==> 4 digits between 0 and 9

Parsing an e-mail address

Almost all business nowadays is done through e-mail, and checking the validity of such e-mail addresses has become an important task. The script below is a regular expression that can check the validity of an email. Usually, an email address is divided into a username part and a hostname part as shown below, and the valid special characters in each field are as follows.

Valid special characters in username

■ Dot

■ Dash

■ Plus sign

■ Underscore 

Valid special characters in hostname

■ Dot

■ Underscore 

Now let's take a look at the regular expression example below

^([a-zA-Z0-9 \-\.\+]+)@^([a-zA-Z0-9 \-\.\+]+)

Let's look at the username regex separated by @

^([a-zA-Z0-9 \-\.\+]+)

==>When the start of a line begins with a lowercase letter between a and z, an uppercase letter between A and Z, a number between 0 and 9, or a combination of '-', '.' ) search for the line that appears

Then look at the expression at the back of the domain and server name.

^([a-zA-Z0-9 \-\.\+]+)

==> Search for a string that starts with a lowercase letter a~z, or an uppercase letter A~Z, or a number between 0 and 9 and '-', '/', '+' signs

In addition, you can use various regular expressions to check the validity of input data, and regular expressions are a must-know expression in the Linux world.

728x90

저작자표시 비영리 변경금지

'Shell Script' 카테고리의 다른 글

17 Advance gawk (4)	2022.09.03
16 Advanced Sed (1)	2022.08.31
14 Introduction sed and gawk (1)	2022.08.27
13 Using Graphic in Script (1)	2022.08.17
12 Making Function (5)	2022.08.16

현재글15 Regular Expression

250x250

IT관련 지식과 영어 관련 지식을 공유하는 블로그 입니다.

지식나눔