Site Map - skip to main content - dyslexic font - mobile - text - print

Hobby Public Radio

Your ideas, projects, opinions - podcasted.

New episodes Monday through Friday.


Please support our Patrons

Our hosting is kindly provided by Josh from AnHonestHost.com. We would appreciate it if you could donate to help reduce his costs in funding the hosting. He is also accepting bitcoins to 1KsxJr9HtsdaUeU7yaV9bk9bQi21UPBtUq
Please also consider supporting the https://archive.org/donate/ who are now hosting our media files. Right now, a generous supporter will match your contributions 3-to-1. So your $5 donation results in $20 for the Internet Archive.

In-Depth Series

Learning Awk

Episodes about using Awk, the text manipulation language. It comes in various forms called awk, nawk, mawk and gawk, but the standard version on Linux is GNU Awk (gawk). It's a programming language optimised for the manipulation of delimited text.

Gnu Awk - Part 11 - b-yeezi | 2018-05-17

Awk Part 11

Gnu Awk Documentation: https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions

Numerical functions

  • atan2: arctangent of y / x in randians
  • cos: cosine of x in radians
  • exp: ex
  • int: floor float to int
  • log: natrual log
  • randn: (pseudo) random number between 0 and 1
  • sin: sine of x in radians
  • sqrt: square root
  • srand: (pseudo) random between 0 and 1, manually setting the seed

String functions

  • asort: array sort. Returns array with the values sorted
  • asori: array sort. Returns array with the keys (index) sorted
  • gensub: Search the target string target for matches of the regular expression regexp. Returns string with substituted text.
  • gsub: Search target for all of the longest, leftmost, nonoverlapping matching substrings it can find and replace them with replacement. Returns string with substituted text.
  • sub: Search target, which is treated as a string, for the leftmost, longest substring matched by the regular expression regexp. Returns string with substituted text.
  • index: Search the string in for the first occurrence of the string find. Returns the position where that occurence begins
  • length: returns length of string
  • match: Search string for the longest, leftmost substring matched by the regular expression regexp and return the character position (index) at which that substring begins.
  • split: Divide string into pieces delimted by field separator. Returns an array of strings
  • sprintf: Allows you to store the a string in the that would have been the output of printf into a variable
  • strtonum: Turn octal representation to number
  • substr: Substring starting at position x for length of y. Returns string
  • tolower: Lower-case the string
  • toupper: Upper-case the string

References


Gnu Awk - Part 10 - Dave Morriss | 2018-04-09

Gnu Awk - Part 10

Introduction

This is the tenth episode of the "Learning Awk" series which is being produced by b-yeezi and myself.

In this episode I want to talk more about the use of arrays in GNU Awk and then I want to examine some real-world examples of the use of awk.

Long notes

The notes for rest of this episode are available here.


Gnu Awk - Part 9 - b-yeezi | 2018-01-29

Awk Series Part 9 - printf

The printf function allows for greater control over the output, in comparison to print.

To follow along, you can either use these show notes or refer to the gawk manual.

There are 3 main areas to cover:

  • Basic printf syntax
  • Format Control letters
  • Format modifiers

Syntax

printf format, item1, item2, …

The big difference in the syntax of printf statements is the format argument. It allows you to use complex formatting and layouts for outputs. Unlike print, printf does not automatically start a new line after the function. This can be useful when you want to print all of the items in a column on a single line.

For example, remember the example file, file1.csv:

name,color,amount
apple,red,4
banana,yellow,6
strawberry,red,3
grape,purple,10
apple,green,8
plum,purple,2
kiwi,brown,4
potato,brown,9
pineapple,yellow,5

Look at the difference between the following outputs:

awk -F, 'NR!=1{print "Color", $2, "has", $3}' file1.csv

and

awk -F, 'NR!=1{printf "Color %s has %s. ", $2, $3}' file1.csv

Control Letters

Control letters control or cast the output to specific types. Use it as a way to convert ints to floats, ints to chars, etc.

%c = to char. printf "%c", 65 prints a
%i, %d = to int. printf "%i", 3.4 prints 3
%f = to float. printf "%c", 65 prints 65.000000
%e, %E = to scientific notation. printf "%e", 65 prints 6.500000e+01. If you use %E will use a capital E instead of e.
%g = to either scientific notation or int. printf "%.2g", 65 prints 65, while printf "%.1g", 65 prints 6e+01
%s = to string. printf "%s", 65 prints 65
%u = to unsigned int. printf "%u", -6 prints 18446744073709551610

There are others. See documentation.

Formatting

N$ = positional specifier. printf "%2$s %1$s", "second", "first"
n = spaces to the left of the string.
-n = spaces to the right of string.
space = prefix positive numbers with a space, negative numbers with a -
+ = prefix all numbers with a sign (either + or -)
0n = leading 0's before input. printf "%03i", 65 prints 065.
' = comma place holder for thousands. printf "%'i", 6500 prints 6,500

Below is an (crude) illustration of how I like to think when formatting output:

          7          2
├──────┼───────┼────┼──┤
 Color: RedXXXX Sum: X6
       18            3
├──────────────────╂───┤
 Total Sum:XXXXXXXX X34

See the following awk file

BEGIN {
    FS=",";
}
NR != 1 {
    a[$2]+=$3;
    c+=$3;
    d+=1;
}
END {
    for (b in a) {
        printf "Color: %-7s Sum: %2i\n", b, a[b];
    }
    print "----------------------"
    printf "%-18s %3i\n", "Total Sum:", c;
    printf "%-18s %3i\n", "Total Count:", d;
    printf "%-18s %3.1f\n", "Mean:", c / d;
}

This gives the following output:

Color: brown   Sum: 13
Color: purple  Sum: 12
Color: red     Sum:  7
Color: yellow  Sum: 11
Color: green   Sum:  8
----------------------
Total Sum:          51
Total Count:         9
Mean:              5.7

Resources

  1. https://www.gnu.org/software/gawk/manual/gawk.html#Printf
  2. http://www.grymoire.com/Unix/Awk.html
  3. http://datascienceatthecommandline.com/

Gnu Awk - Part 8 - Dave Morriss | 2017-12-06

Gnu Awk - Part 8

Introduction

This is the eighth episode of the "Learning Awk" series that b-yeezi and I are doing.

Recap of the last episode

  • The while loop: tests a condition and performs commands while the test returns true

  • The do while loop: performs commands after the do, then tests afterwards, repeating the commands while the test is true.

  • The for loop (type 1): initialises a variable, performs a test, and increments the variable all together, performing commands while the test is true.

  • The for loop (type 2): sets a variable to successive indices of an array, preforming a collection of commands for each index.

These types of loops were demonstrated by examples in the last episode.

Note that the example for 'do while' was an infinite loop (perhaps as a test of the alertness of the audience!):

#!/usr/bin/awk -f
BEGIN {

    i=2;
    do {
        print "The square of ", i, " is ", i*i;
        i = i + 1
    }
    while (i != 2)

exit;
}

The condition in the while is always true:

The square of  2  is  4
The square of  3  is  9
The square of  4  is  16
The square of  5  is  25
The square of  6  is  36
The square of  7  is  49
The square of  8  is  64
The square of  9  is  81
The square of  10  is  100
...
The square of  1269630  is  1611960336900
The square of  1269631  is  1611962876161
The square of  1269632  is  1611965415424
The square of  1269633  is  1611967954689
The square of  1269634  is  1611970493956
...

The variable i is set to 2, the print is executed, then i is set to 3. The test "i != 2" is true and will be ad infinitum.

Some more statements

We will come back to loops later in this episode, but first this seems like a good point to describe another statement: the switch statement.

Long notes

The notes for rest of this episode are available here.


Awk Part 7 - b-yeezi | 2017-07-07

In this episode, I will (very) briefly go over loops in the Awk programming language. Loops are useful when you want to run the same command(s) on a collection of data or when you just want to repeat the same commands many times.

When using loops, a command or group of commands is repeated until a condition (or many) is met.

While Loop

Here is a silly example of a while loop:

#!/bin/awk -f
BEGIN {

# Print the squares from 1 to 10 the first way

    i=1;
    while (i <= 10) {
        print "The square of ", i, " is ", i*i;
        i = i+1;
    }

exit;
}

Our condition is set in the braces after the while statement. We set a variable, i, before entering the loop, then increment i inside of the loop. If you forget to make a way to meet the condition, the while will go on forever.

Do While Loop

Here is an equally silly example of a do while loop:

#!/bin/awk -f
BEGIN {

    i=2;
    do {
        print "The square of ", i, " is ", i*i;
        i = i + 1
    }

    while (i != 2)

exit;
}

Here, the commands in the do code block are executed at the start, then the looping begins.

For Loop

Another silly example of a for loop:

#!/bin/awk -f
BEGIN {

    for (i=1; i <= 10; i++) {
        print "The square of ", i, " is ", i*i;
    }

exit;
}

As you can see, we set the variable, set the condition and set the increment method all in the braces after the for statement.

For Loop Over Arrays

Here is a more useful example of a for loop. Here, we are adding the different values of column 2 into an array/hash-table called a. After processing the file, we print the different values.

For file.txt:

name       color  amount
apple      red    4
banana     yellow 6
strawberry red    3
grape      purple 10
apple      green  8
plum       purple 2
kiwi       brown  4
potato     brown  9
pineapple  yellow 5

Using the awk file of:

NR != 1 {
    a[$2]++
}
END {
    for (b in a) {
        print b
    }
}

We get the results of:

brown
purple
red
yellow
green

In another example, we do a similar process. This time, not only do we store all the distinct values of the second column, we perform a sum operation on column 3 for each distinct value of column 2.

For file.csv:

name,color,amount
apple,red,4
banana,yellow,6
strawberry,red,3
grape,purple,10
apple,green,8
plum,purple,2
kiwi,brown,4
potato,brown,9
pineapple,yellow,5

Using the awk file of:

BEGIN {
    FS=",";
    OFS=",";
    print "color,sum";
}
NR != 1 {
    a[$2]+=$3;
}
END {
    for (b in a) {
        print b, a[b]
    }
}

We get the results of:

color,sum
brown,13
purple,12
red,7
yellow,11
green,8

As you can see, we are also printing a header column prior to processing the file using the BEGIN code block.


Gnu Awk - Part 6 - Dave Morriss | 2017-03-01

Gnu Awk - Part 6

Introduction

This is the sixth episode of the “Learning Awk” series that b-yeezi and I are doing.

Recap of the last episode

Regular expressions

In the last episode we saw regular expressions in the ‘pattern’ part of a ‘pattern {action}’ sequence. Such a sequence is called a ‘RULE’, (as we have seen in earlier episodes).

$1 ~ /p[elu]/ {print $0}

Meaning: If field 1 contains a ‘p’ followed by one of ‘e’, ‘l’ or ‘u’ print the whole line.

$2 ~ /e{2}/ {print $0}

Meaning: If field 2 contains two instances of letter ‘e’ in sequence, print the whole line.

It is usual to enclose the regular expression in slashes, which make it a regexp constant.

We had a look at many of the operators used in regular expressions in episode 5. Unfortunately, some small errors crept into the list of operators mentioned in that episode. These are incorrect:

  • \A (beginning of a string)
  • \z (end of a string)
  • \b (on a word boundary)

The first two operators exist, in languages like Perl and Ruby, but not in GNU Awk.

For the ‘\b’ sequence the GNU manual says:

In other GNU software, the word-boundary operator is ‘\b’. However, that conflicts with the awk language’s definition of ‘\b’ as backspace, so gawk uses a different letter. An alternative method would have been to require two backslashes in the GNU operators, but this was deemed too confusing. The current method of using ‘\y’ for the GNU ‘\b’ appears to be the lesser of two evils.

The corrected list of operators is discussed later in this episode.

Replacement

Last episode we saw the built-in functions that use regular expressions for manipulating strings. These are sub, gsub and gensub. Regular expressions are used in other functions but we will look at them later.

We will be looking at sub, gsub and gensub in more detail in this episode.

Long notes

I have written out a set of longer notes for this episode and these are available here.


Gnu Awk - Part 5 - b-yeezi | 2016-12-15

GNU AWK - Part 5

Regular Expressions in AWK

The syntax for using regular expressions to match lines in AWK is as follows:

word ~ /match/

Or for not matching, use the following:

word !~ /match/

Remember the following file from the previous episodes:

name       color  amount
apple      red    4
banana     yellow 6
strawberry red    3
grape      purple 10
apple      green  8
plum       purple 2
kiwi       brown  4
potato     brown  9
pineapple  yellow 5

We can run the following command:

$1 ~ /p[elu]/ {print $0}

We will get the following output:

apple      red    4
grape      purple 10
apple      green  8
plum       purple 2
pineapple  yellow 5

In another example:

$2 ~ /e{2}/ {print $0}

Will produce the output:

apple      green  8

Regular expression basics

Certain characters have special meaning when using regular expressions.

Anchors

  • ^ - beginning of the line
  • $ - end of the line
  • \A - beginning of a string
  • \z - end of a string
  • \b on a word boundary

Characters

  • [ad] - a or d
  • [a-d] - any character a through d
  • [^a-d] - not any character a through d
  • \w - any word
  • \s - any white-space character
  • \d - any digit

The capital version of w, s, and d are negations.

Or, you can reference characters the POSIX standard way:

  • [:alnum:] - Alphanumeric characters
  • [:alpha:] - Alphabetic characters
  • [:blank:] - Space and TAB characters
  • [:cntrl:] - Control characters
  • [:digit:] - Numeric characters
  • [:graph:] - Characters that are both printable and visible (a space is printable but not visible, whereas an ‘a’ is both)
  • [:lower:] - Lowercase alphabetic characters
  • [:print:] - Printable characters (characters that are not control characters)
  • [:punct:] - Punctuation characters (characters that are not letters, digits, control characters, or space characters)
  • [:space:] - Space characters (such as space, TAB, and formfeed, to name a few)
  • [:upper:] - Uppercase alphabetic characters
  • [:xdigit:] - Characters that are hexadecimal digits

Quantifiers

  • . - match any character
  • + - match preceding one or more times
  • * - match preceding zero or more times
  • ? - match preceding zero or one time
  • {n} - match preceding exactly n times
  • {n,} - match preceding n or more times
  • {n,m} - match preceding between n and m times

Grouped Matches

  • (...) - Parentheses are used for grouping
  • | - Means or in the context of a grouped match

Replacement

  • The sub command substitutes the match with the replacement string. This only applies to the first match.
  • The gsub command substitutes all matching items.
  • The gensub command command substitutes the in a similar way as sub and gsub, but with extra functionality
  • The & character in the replacement field references the matched text. You have to use \& to replace the match with the literal & character.

Example:

{ sub(/apple/, "nut", $1);
    print $1}

The output is:

name
nut
banana
strawberry
grape
nut
plum
kiwi
potato
pinenut

Another example:

{ sub(/.+(pp|rr)/, "test-&", $1);
    print $1}

This produces the following output:

name
test-apple
banana
test-strawberry
grape
test-apple
plum
kiwi
potato
test-pineapple

Resources


Gnu Awk - Part 4 - Dave Morriss | 2016-11-16

Gnu Awk - Part 4

Introduction

This is the fourth episode of the series that b-yeezi and I are doing. These shows are now collected under the series title “Learning Awk”.

Recap of the last episode

Logical Operators

We have seen the operators ‘&&’ (and) and ‘||’ (or). These are also called Boolean Operators. There is also one more operator ‘!’ (not) which we haven’t yet encountered. These operators allow the construction of Boolean expressions which may be quite complex.

If you are used to programming you will expect these operators to have a precedence, just like operators in arithmetic do. We will deal with this subject in more detail later since it is relevant not only in patterns but also in other parts of an Awk program.

The next statement

We saw this statement in the last episode and learned that it causes the processing of the current input record to stop. No more patterns are tested against this record and no more actions in the current rule are executed. Note that “next” is a statement like “print”, and can only occur in the action part of a rule. It is also not permitted in BEGIN or END rules (more of which anon).

The BEGIN and END rules

The BEGIN and END elements are special patterns, which in conjunction with actions enclosed in curly brackets make up rules in the same sense that the ‘pattern {action}’ sequences we have seen so far are rules. As we saw in the last episode, BEGIN rules are run before the main ‘pattern {action}’ rules are processed and the input file is (or files are) read, whereas END rules run after the input files have been processed.

It is permitted to write more than one BEGIN rule and more than one END rule. These are just concatenated together in the order they are encountered by Awk.

Awk will complain if either BEGIN or END is not followed by an action since this is meaningless.

Variables, arrays, loops, etc

Learning a programming language is never a linear process, and sometimes reference is made to new features that have not yet been explained. A number of new features were mentioned in passing in the last episode, and we will look at these in more detail in this episode.

Long notes

I have written out a moderately long set of notes for this episode and these are available here http://hackerpublicradio.org/eps/hpr2163/full_shownotes.html.

With a view to making portable notes for this series I have included ePub and PDF versions with this episode. Feedback is welcome to help decide which version is preferable, as are any suggestions on the improvement of the layout.


Gnu Awk - Part 3 - b-yeezi | 2016-10-19

Awk Part 3

Remember our file:

name       color  amount
apple      red    4
banana     yellow 6
strawberry red    3
grape      purple 10
apple      green  8
plum       purple 2
kiwi       brown  4
potato     brown  9
pineapple  yellow 5

Replace Grep

As we saw in earlier episodes, we can use awk to filter for rows that match a pattern or text. If you know the grep command, you know that it does the same function, but has extended capabilities. For simple filter, you don't need to pipe grep outputs to awk. You can just filter in awk.

Logical Operators

You can use logical operators "and" and "or" represented as "&&" and "||", respectively. See example:

$2 == "purple" && $3 < 5 {print $1}

Here, we are selecting for color to to equal "purple" AND amount less than 5.

Next command

Say we want to flag every record in our file where the amount is greater than or equal to 8 with a '**'. Every record between 5 (inclusive) and 8, we want to flag with a '*'. We can use consecutive filter commands, but there affects will be additive. To remedy this, we can use the "next" command. This tells awk that after the action is taken, proceed to the next record. See the following example:

NR == 1 {
  print $0;
  next;
}

$3 >= 8 {
  printf "%s\t%s\n", $0, "**";
  next;
}

$3 >= 5 {
  printf "%s\t%s\n", $0, "*";
  next;
}

$3 < 5 {
  print $0;
}

End Command

The "BEGIN" and "END" commands allow you to do actions before and after awk does its actions. For instance, sometimes we want to evaluate all records, then print the cumulative results. In this example, we pipe the output of the df command into awk. Our command is:

df -l | awk -f end.awk

Our awk file looks like this:

$1 != "tempfs" {
    used += $3;
    available += $4;
}

END {
    printf "%d GiB used\n%d GiB available\n", used/2^20, available/2^20;
}

Here, we are setting two variables, "used" and "available". We add the records in the respective columns all together, then we print the totals.

In the next example, we create a distinct list of colors from our file:

NR != 1 {
    a[$2]++
}
END {
    for (b in a) {
        print b
    }
}

This is a more advanced script. The details of which, we will get into in future episodes.

BEGIN command

Like stated above, the begin command lets us print and set variables before the awk command starts. For instance, we can set the input and output field separators inside our awk file as follows:

BEGIN {
    FS=",";
    OFS=",";
    print "color,count";
}
NR != 1 {
    a[$2]+=1;
}
END {
    for (b in a) {
        print b, a[b]
    }
}

In this example, we are finding the distinct count of colors in our csv file, and format the output in csv format as well. We will get into the details of how this script works in future episodes.

For another example, instead of distinct count, we can get the sum of the amount column grouped by color:

BEGIN {
    FS=",";
    OFS=",";
    print "color,sum";
}
NR != 1 {
    a[$2]+=$3;
}
END {
    for (b in a) {
        print b, a[b]
    }
}

Gnu Awk - Part 2 - Dave Morriss | 2016-09-29

Gnu Awk - Part 2

This is the second episode in a series where b-yeezi and I will be looking at the AWK language (more particularly its GNU variant gawk). It is a comprehensive interpreted scripting language designed to be used for manipulating text.

I have written out a moderately long set of notes for this episode and these are available here http://hackerpublicradio.org/eps/hpr2129/full_shownotes.html.


Gnu Awk - Part 1 - b-yeezi | 2016-09-08

Introduction to Awk

Awk is a powerful text parsing tool for unix and unix-like systems.

The basic syntax is:

awk [options] 'pattern {action}' file

Here is a simple example file that we will be using, called file1.txt:

name       color  amount
apple      red    4
banana     yellow 6
strawberry red    3
grape      purple 10
apple      green  8
plum       purple 2
kiwi       brown  4
potato     brown  9
pineapple  yellow 5

First command:

awk '{print $2}' file1.txt

As you can see, the “print” command will display the whatever follows. In this case we are showing the second column using “$2”. This is intuitive. To display all columns, use “$0”.

This example will output:

color
red
yellow
red
purple
green
purple
brown
brown
yellow

Second command:

awk '$2=="yellow"{print $1}' file1.txt

This will output:

banana
pineapple

As you can see, the command matches items in column 2 matching “yellow”, but prints column 1.

Field separator

By default, awk uses white space as the file separator. You can change this by using the -F option. For instance, file1.csv looks like this:

name,color,amount
apple,red,4
banana,yellow,6
strawberry,red,3
grape,purple,10
apple,green,8
plum,purple,2
kiwi,brown,4
potato,brown,9
pineapple,yellow,5

A similar command as before:

awk -F"," '$2=="yellow" {print $1}' file1.csv

will still output:

banana
pineapple

Regular expressions work as well:

awk '$2 ~ /p.+p/ {print $0}' file1.txt

This returns:

grape   purple  10
plum    purple  2

Numbers are interpreted automatically:

awk '$3>5 {print $1, $2}' file1.txt

Will output:

name    color
banana  yellow
grape   purple
apple   green
potato  brown

Using output redirection, you can write your results to file. For example:

awk -F, '$3>5 {print $1, $2} file1.csv > output.txt

This will output a file with the contents of the query.

Here’s a cool trick! You can automatically split a file into multiple files grouped by column. For example, if I want to split file1.txt into multiple files by color, here is the command.

awk '{print > $2".txt"}' file1.txt

This will produce files named yellow.txt, red.txt, etc. In upcoming episodes, we will show how to improve the outputs.

Resources

  1. http://www.theunixschool.com/p/awk-sed.html
  2. http://www.tecmint.com/category/awk-command/
  3. http://linux.die.net/man/1/awk

Coming up

  • More options
  • Built-in Variables
  • Arithmetic operations
  • Awk language and syntax