Patterns and Arrays

We now have most of the tools we need to create exceptionally powerfull web sites. Perl supplies some other tools that can add even more functionality. These concepts will help to 'round you out' as a perl programmer. Specifically, we are discussing pattern matching and arrays.

Arrays in perl

You are already familiar with scalar variables in perl. As you remember, a scalar is a variable with only one value. The '$' character that precedes variable names in perl denotes them as scalars. Scalars are very easy to work with, and they can do a lot. In the real world, though, you often have to work with groups or lists of values. A scalar variable is not enough. You could of course generate a whole group of scalars, but this could get unwieldy quickly. Imagine a program that would keep track of a miniature golf score. We would need a score for each player on hole, and another variable to handle the total for each player. If we have an eighteen hole course with four players, that comes out to 19*4 = 76 variables to keep track of! That would not be fun. The holes are numbered, so it would make sense to have one 'super-variable' for each player's score. That variable could contain a seperate value for each score, and we could use the hole number to keep track. For example, we could have playerOneScore[1], which means 'score for player one for hole 1.

In perl, arrays are denoted with the '@' symbol. (Get it? $ for Scalar, @ for Array). Like scalars, you create an array simply by using it. As usual, perl is very flexible about arrays. In many laguages, you must explicitly declare the length (number of elements) in an array before you use it, and if you try to access elements outside that length, you will have big troubles. Perl is much easier-going. Perl arrays are dynamic, which means the size can change as we go. Here's an example of an array for the golf scores in perl:


@score[1] = 1;
@score[2] = 2;
@score[3] = 3;
@score[4] = 4;
@score[5] = 5;
@score[6] = 1;
@score[7] = 2;
@score[8] = 3;
@score[9] = 4;

print $score[3];

print "\n";

As you would expect, an array is created, the scores are assigned as expected, and we will see the value 3 as the result. Note that we use the $ identifier to refer to a SCALAR value we wnat to extract. We can also use @ modifier, but that would extract a smaller array. We'll look at that in more detail later, but for now, most of the time you will want to deal with the $ identifier.

Take a look at this alternate way to fill up an array:

@score = (0, 1, 2, 3, 4, 5, 1, 2, 3, 4);
print "$score[3] \n";
As you can see, this is very easy if you already know the values you want to place in an array. Notice that I actually placed 10 elements in the array, starting with a zero. This is because arrays in perl start out with zero as a default index, and I wanted to place the scores in values one through nine.

looking at the values of an array

Of course, whenever we have an array, we will probably want to look at its elements at some point. Perl gives you many ways to do this. Examine the code below for a number of interesting ways to look at an array:


@score = (0, 1, 2, 3, 4, 5, 1, 2, 3, 4);

print "no quotes: \t";
print @score;
print " \n";

print "in quotes: \t";
print "@score \n";

$length = @score;
print "the length of score is: $length";


print "\n\nUsing standard for loop:\n";
print "hole \t score \n";
for ($i = 1; $i<10; $i++){
  print "$i \t $score[$i] \n"; 
} # end for loop

print "\n\nUsing foreach:\n";
foreach $hole (@score){
    print "$hole \n";
} # end foreach
And here is the output produced:
no quotes: 	0123451234 
in quotes: 	0 1 2 3 4 5 1 2 3 4 
the length of score is: 10

Using standard for loop:
hole 	 score 
1 	 1 
2 	 2 
3 	 3 
4 	 4 
5 	 5 
6 	 1 
7 	 2 
8 	 3 
9 	 4 


Using foreach:
0 
1 
2 
3 
4 
5 
1 
2 
3 
4 

We did a number of things here, so let's see if we can determine how all these things worked. Our first form of output was a simple print statement, like this:

print @scores;
It printed out every value in the array as one big concatenated string, like this:
 no quotes: 	0123451234 
This is a very useful feature that most languages do not have. Most of the time, if you want to look at the elements of an array, you must use a loop. Perl gives you a quick way to see the elements of the array. There is a problem, though. These are all single-digit values, so it is not hard to figure out which is which. If an array consists of multiple - character values such as strings, it will be impossible to tell when one element ends and the next begins. Perl offers a simple solution. If you interpolate the variable into a string literal (with double quotes), the array will be printed with spaces between each element.

print "@scores \n";
produces
in quotes: 	0 1 2 3 4 5 1 2 3 4 

using loops to look at arrays

Usually, the most powerful way to look at an array is with a loop. For loops are especially handy with arrays. Of course, to write a for loop, you need to know how many elements are in the array. Perl's dynamic arrays can have lengths that change, so there must be a way to know its length, or the number of elements in the array. There is of course such a function. This line does the trick:

$length = @score;

It is a little strange, but it works. What is happening is we are assigning the ENTIRE array to a scalar. Perl recognizes that what we really want is the length of the array, not any of its values. This is not intuitive, so you will have to think about it, but it will soon become natural to you.

Once we have the number of elements in an array, it becomes reasonably straightforward to create a for loop that steps through the array's elements:

print "\n\nUsing standard for loop:\n";
print "hole \t score \n";
for ($i = 1; $i<10; $i++){
  print "$i \t $score[$i] \n"; 
} # end for loop
produces:
Using standard for loop:
hole 	 score 
1 	 1 
2 	 2 
3 	 3 
4 	 4 
5 	 5 
6 	 1 
7 	 2 
8 	 3 
9 	 4 

The nice thing about for loops is the amount of control you have. You can use the value of the index variable to make nicely labeled output as we have done here.

Much of the time, though, we don't need that much information, and just need a simple way to step through the loop. Perl offers a special version of the for loop for just this case:

print "\n\nUsing foreach:\n";
foreach $hole (@score){
    print "$hole \n";
} # end foreach
produces
Using foreach:
0 
1 
2 
3 
4 
5 
1 
2 
3 
4 

The foreach loop takes a scalar and an array. It repeats once for each element of the array. Inside the loop, the value of the scalar takes on the appropriate value in the array. This foreach loop works exactly like the for loop above. It is very easy to work with, but it has one major drawback. Inside the loop, you do not have access to the index variable. In this case, that meant I could not easily print the hole number. (of course, I could have simulated this by generating a variable that was incremented). The real lesson is this: You can always use a for loop, and it is very powerful. For simpler applications, the foreach loop can be very powerful as well, and simple to write.

split and join

It is quite common to want to convert scalars to arrays, and arrays to scalars. Perl has some operators that make this quite easy to do. We have already used the split operator, but we haven't yet seen its full functionality. As an example of how these operators work, let's look at a silly program that reverses the words in a phrase.
print "Please enter a phrase to reverse: \n";

$phrase = ;
chop $phrase;
@words = split (/ /, $phrase);
$phrase = join (" ", reverse @words);
print $phrase;
and here is a typical output:
Please enter a phrase to reverse:
this is a cool program
program cool a is this

The reverse function expects to receive an array. It then reverses the order of the elements in the array. Unfortunately, when we ask for input, we get a SCALAR, not an array. We can use the split function to break the scalar into an array. In this case, we are splitting on the space character.

Once we have an array, it is a simple matter to reverse it. We will take that reversed array, and join it back together with spaces, and store this back into the $phrase scalar.

In CGI programming, the most common use of this kind of operation is when we are doing some kind of file handling. We will often store a record as a line, and use some special character for delimiting the fields. We can of course split into an array, or into specific values, or even a combination. For example, if you had a file that had a different record on each line, and you knew that the first field would be a name, and the rest would be grades, you might split the line like this:

($name, @grades) = (split /\t/, $line);

Split is used very frequently, but join is not used as much, as we can use string concatenation or interpolation to achieve the same effect.

using arrays to simplify handling a file

While we are talking about arrays, you might be interested in an approach that is used to handle a file as an array of lines. The following code will open a file, slurp the entire thing into an array, and print out a line at a time:

open THEFILE, "patterns.html";
@wholeThing = <THEFILE>;
close THEFILE;

foreach $line (@wholeThing){
    print "I said: " . $line;
} # end foreach

Of course, you know other ways to do this, but this approach has some advantages. The entire file is read in one line, so the file access will usually be faster and more efficient. The file will not stay open as long. Finally, your entire file will be in memory, where you can manipulate it all you want, so if you want to do sorting, deletions, or whatever, you can do it on the array. To save the array, just print the array to a file open for output. This program will sort the lines of a file in alphabetical order:

open THEFILE, "sortThis.dat";
@wholeThing = <THEFILE>;
close THEFILE;


open THEFILE, ">sortThis.dat";
print THEFILE sort(@wholeThing);
close THEFILE;
sortThis.dat before sorting:
z
r
a
s
q
l
and after sorting
a
l
q
r
s
z

There is one significant limitation to loading the entire file into memory. It can take a lot of memory to do this, especially if it is a really huge data file. Perl will let you grab a massive file, but you could overwhelm the resources of the server if you were to grab several megabytes of data as one array, for example.

Push and pop

Since arrays are dynamic in perl, there are some operators that are used to add or remove elements from the array. The most important of these are push and pop. These are named from a computer science construct called a stack, which is reminiscient of a stack of plates on a buffet line. The restaurant workers will load up the plates one on top of another. This is called pushing. Each plate goes on the top (or end) of the stack. Customers come by and take off plates starting at the top (end) of the stack. Taking a plate off is called 'popping' it off the stack. Perl arrays can act like stacks by using the push and pop functions. Push adds a new element to the end of the array. Pop removes a value from the end of the array. Here's another version of the reverser program that uses push and pop to reverse a string.
#revArray.pl
#reverse an array using push and pop

print "Please enter a phrase to reverse: \n";
$phrase = <STDIN>;
chop $phrase;

@oldArray = split(/ /, $phrase);
@newArray = ();

$numWords = @oldArray;
for ($i = 0; $i < $numWords; $i++){
  $word = pop (@oldArray);
  push (@newArray, $word);
} # end for loop

print "@newArray \n";
This works by splitting the phrase into an array. We then go once for each word in the array, popping off the last word in the old array, and pushing it to the end of the new array.

Patterns and Regular Expressions

Arrays are one of the exceptional features of perl. Another is the ability to deal with regular expressions. If you have ever used the 'find' or 'search and replace' features of a text editor, you have used a form of pattern matching already. Perl has a number of pattern-matching features, but there are two operators we will look at especially carefully. The first is the match operator. Take a look at the following code, and you will see what it does:

#matcher.pl

open THEFILE, "patterns.html";
@wholeThing = <THEFILE>;
close THEFILE;

foreach $line (@wholeThing){
  if ($line =~ m/perl/){
    print "I found a reference to perl: \n   $line";
  } # end if
} # end foreach
It will output this:
I found a reference to perl: 
   perl programmer.  Specifically, we are discussing pattern matching and
I found a reference to perl: 
   <h2>Arrays in perl</h2>
I found a reference to perl: 
   You are already familiar with scalar variables in perl.  As you
I found a reference to perl: 
   character that precedes variable names in perl denotes them as
I found a reference to perl: 
   In perl, arrays are denoted with the '@' symbol.  (Get it?  $ for
I found a reference to perl: 
   using it.  As usual, perl is very flexible about arrays.  In many
I found a reference to perl: 
   scores in perl:
I found a reference to perl: 
   a zero.  This is because arrays in perl start out with zero as a
What's going on here?
Well, the magic line here is this one:
  if ($line =~ m/perl/){

Read the '=~ m' as 'matches', and the line means 'if the line matches on perl.' This condition will come out true if the the pattern 'perl' is anywhere in the line, or false otherwise. I ran it on this chapter (well, an earlier version, or we'd have references to this code output as well!!) and it returns back a message every time we find a line with the word 'perl' in it. That's handy!!

The m stands for 'matches'. We are looking for lines that match the pattern between the slashes. This is much like using the instring function. You might wonder why we need this, since the instring function does the same thing. The answer is the pattern. The pattern does not have to be a simple string. It is what is called a regular expression. Regular expressions are almost a miniature programming language in there own right. They have been part of unix for a long time. They can be intimidating, but they give you really exciting power. For example, we can change the code to this:

  if ($line =~ m/^perl/){
and the only output would be this:
I found a reference to perl: 
   perl programmer.  Specifically, we are discussing pattern matching and
The caret (^) says 'anchor to the beginning of the string', so it looks only for lines that BEGIN with the value 'perl.'

If you look back, we didn't get nearly enough matches in our first time through. It looked only for lowercase perl. I used perl as the first word in a number of sentences, so we should also be looking for uppercase p. This pattern would do that:

  if ($line =~ m/[pP]erl/){

The square brackets refer to alternate values. We are looking here for a lowercase or uppercase p, then the letters 'erl.' Actually, there is another way to do this:

  if ($line =~ m/perl/i){
The i after the last slash means 'ignore case.'

There are a number of special characters we can use as well. The . means 'Any alphanumeric character', so

m/.erl/ 
would match on perl, Perl, gerl, 3erl, and so on.

The code \d means a numeric value, so we could look for phone numbers like this:

if ($line =~ m/\d\d\d-\d\d\d\d/){
Yuck. You can see the biggest problem with regular expressions. They can get really ugly to read. There are some special things you can do. I really don't like combining the / normally used with patterns and the \characaters that commonly show up. You can replace the / with any character you wish, although it should of course be something that won't come up in your pattern. I often prefer #, since it is not used anywhere else in perl, except as a comment operator. Let's look at the phone number example again that way:
if ($line =~ m#\d\d\d-\d\d\d\d#){
That's a little better. Now you can see that we are looking for three digits, a dash, and four digits. In fact, when characters repeat like this, we can use a number in {} brackets to denote how many of the previous character we want, so we could also write the above line like this:
if ($line =~ m#\d{3}-\d{4}#){

Another handy special character is the \b (word boundary) character. This helps ensure that you are checking for complete words. For example, if you wanted to match on the but not 'them' or 'there' or 'together,' you can try this pattern:

if ($line =~ m#\bthe\b#){
This match will only return true if there is a word boundary (beginning of line or some kind of whitespace) before and after the word 'the'.

One more handy feature is the repetition operators. If we want to match a number but we don't know how many digits it is, we can use the following pattern:

if ($line =~ m#\d+#){
The plus sign means 'match at least one or more occurrence of the preceding character.' An asterisk is used to match zero or more occurrences of the character.

pattern memory

If we want to keep track not only of the line that contained the match, but the actual value that was matched, we place the pattern in quotes. The matched value will be in a special variable called $1. If there was another set of parentheses, the second match will be in $2, and so on. Here's an example:

if ($line =~ m#(\d\d\d-\d\d\d\d)#){
print "The phone number is $1 \n";

pattern matching and CGI

In CGI programming, pattern matching is mainly used in database-style applications, to aid in search functions. You might also use it in form validation, for example, to check that an email address contains an @ sign. Another very common use is the use of HTML files as templates. You can generate HTML files with special HTML comments, and then look for those comments to see where to place code. The slideshow program used as a sample final project extensively uses this technique.

Substitution

In addition to matching, perl supports substitution of regular expressions. Substitution works basically like the 'search and replace' feature of a word processor. For an example, we will write a simple program to handle a common issue for writers. I learned to type in the old days with a typewriter. In those days, we were taught to end every sentence with two spaces after the period. Now, publishers often prefer there to be only one space after the period. Hard as I try, I still end up typing two spaces every time. Let's write a simple program that will look at a file and replace every occurrence of ". " (a period with two spaces) to ". " (a period with one space). Here's the program:

#kill spaces
#ask user for a file name and kill double-spaces after periods.

print "What file do you want me to modify? \n";
$file = <STDIN>;
chop $file;

open THEFILE, "$file";
@wholeThing = <THEFILE>;
close THEFILE;

foreach $line (@wholeThing){
  $line =~ s#.  #. #g;
  push @newArray, $line;
} # end forEach

open THEFILE, ">$file";
print THEFILE @newArray;
close THEFILE;

The only part that is unfamiliar is the substitution line. Substitution works much like matching, except there is an 's' instead of an 'm', and we have TWO patterns. The first is what we are searching for, and the second is what we will replace it with. In additon, the substitution operator replaces the variable with the results of the substitution.

Special operators in substitution

All the special characters and tricks of regular expression used in the matching examples will work for regular expressions too. At the end of the substitution pattern, we can place additional codes, like the i used in matches for 'ignore case'. In addition, we can use 'g' for global. This will ensure that if more than one match of the pattern occurs on the variable, we will replace them all.

Here's an example of a more complex program that uses both matching and substitution to convert a phrase into 'pig latin':

#igpay.pl
#given a phrase, converts it to pig latin
#uses matching and substitution

print "Please enter a phrase to 'pigify': \n";
$phrase = <STDIN>;

chop $phrase;
@words = split(/ /, $phrase);

foreach $word (@words){
  #if the word begins with a vowel
  if ($word =~m#^[aeiou]#i){
    #simply add 'way' to the end
    $word .= "way";
  } else {
    #do this substitution
    $word =~s#(^.)(.*)#$2$1ay#gi;
  } #end 'starts with vowel' if
  $newPhrase .= "$word ";
} # end foreach
print "$newPhrase \n";

The first part of the program is familiar by now. We ask the user for a string, then break it into an array of words. Then we use a foreach loop to look at each word in turn. There are two simple rules to pig latin. If a word starts with a vowel, we just add 'way' to the end. If it starts with a consonent, we move that letter to the end, and then add 'ay' to the end of the word.

We check for vowels with this match:

$word =~m#^[aeiou]#i
This will return true if the FIRST character (the ^ means the beginning of the string) is a, e, i, o, or u. The i tells us that case is irrelevant. If the first letter is a vowel, we do a simple string concatenation to add 'way' to the end of the word.

If the word started with a consonant, we have a lot more work to do. Fortunately, it can be done with one substitution, if we build the substitution carefully. Here's the code that does it:

    $word =~s#(^.)(.*)#$2$1ay#gi;
There's a lot going on in that one line of code. Let's take it apart. First, we're doing a substitution on $word. That much makes sense. We're looking for two patterns. First, we want whatever is the first character. The ^ marks the beginning of the string, and . means a character, so ^. together means 'the first character. We will use this later, so we will store its value by placing it in parentheses. Now whatever the first character was will be stored in $1. The next part of the pattern we are matching is (.*). The . means 'any character', and the * means 'any number of times'. In effect, we are grabbing all the rest of the characters. This expression is also inside parenthises, which means it too will be stored in pattern memory as $2. We do this because we will need its value later as well. Now we move on to the replacement part of the equation, which is '$2$1ay'. '$2' means 'whatever was in the second stored match', which is the rest of the word after the first character. So, the replacement will start with the whole word after the first character. The '$1' is the first match, so that is the first character. At this point, the variable will contain the last letters of the word, followed by the first character of the word. Finally we add 'ay'. The 'gi' part of the expression says 'do it globally' (although that should not be necessary the way we have set up our data here) and 'ignore case'.

As an illustration of just how powerful these features are, here's the same algorithm done without regular expressions:

#pig latin converter

print "Please enter a phrase you want converted: \n";

$inPhrase = ;

@words = split(/ /, $inPhrase);

foreach $wordIn (@words){

  chomp($wordIn);
  $wordOut = "";
  $startsWithVowel = "false";

  $firstChar = substr($wordIn, 0, 1);
  $restOfWord = substr($wordIn, 1, length($wordIn)-1);

  #print "$wordIn \n$firstChar \n$restOfWord\n";

  if (uc($firstChar) eq "A"){
    $startsWithVowel = "true";
  } elsif (uc($firstChar) eq "E"){
    $startsWithVowel = "true";
  } elsif (uc($firstChar) eq "I"){
    $startsWithVowel = "true";
  } elsif (uc($firstChar) eq "O"){
    $startsWithVowel = "true";
  } elsif (uc($firstChar) eq "U"){ 
    $startsWithVowel = "true";
  } # end if
  
  if ($startsWithVowel eq "true"){
    $wordOut = $wordIn . "way";
  } else {
    $wordOut = $restOfWord . $firstChar . "ay";
  } # end if
  
  
  print "$wordOut ";
} # end foreach

As you can guess, we are just skimming the surface of pattern matching and replacing, but these examples will get you started. You will need to practice this a lot before you will be comfortable. Still, this should be enough for most basic operations.

Substitution in CGI

One very significant use of substitution is to create 'new' markup languages. This allows us to generate new HTML-like codes for special purposes. As an example, look at the slideshow program, which allows the user to generate an entire set of HTML slides from one HTML-like script. It is described later as an example of a final project using CGI.

Laboratory Assignment

Create an 'Elmer Fudd' simulator. This program should allow the user to type in some text, and then the text should be written in the peculiar speech impediment of Elmer Fudd. Specifically, Use pattern matching and / or substitution. As usual for CGI programs, please provide both links to the source code as well as the running program