Slice and Dice Text

Chapter 15 - Slice and Dice Text

In Chapter 14 we looked at how to use the grep command to search through text and filter text. In this chapter we're going to look at some of the basic commands which we can use to manipulate text. There are a whole raft of commands and options available.

We'll start with the basics and move onto some of the more sophisticated commands in the next chapter.

Heads and Tails

The commands head and tail are very simple but incredibly useful.

head is used to extract part of the top of a file and tail is used to extract part of the end of a file. Once you starting using these commands you'll find yourself using them regularly.

Let's start with head. Imagine we have a data file which has been sent to us, we don't know exactly what is in it, but we know it is large. How can we take a quick look?

$ head ~/effective-shell/data/top100.csv

"Rank","Rating","Title","Reviews"
"1","97","Black Panther (2018)","515"
"2","94","Avengers: Endgame (2019)","531"
"3","93","Us (2019)","536"
"4","97","Toy Story 4 (2019)","445"
"5","99","Lady Bird (2017)","393"
"6","100","Citizen Kane (1941)","94"
"7","97","Mission: Impossible - Fallout (2018)","430"
"8","98","The Wizard of Oz (1939)","120"
"9","96","The Irishman (2019)","441"

The head command just shows the first ten lines of a file. Here we can see that this is a comma separated values file which seems to be a list of movies. This file is actually a list of the top 100 films on ‘Rotten Tomatoes’ at the time of writing, with the score, tomato meter, name and number of votes. We'll use it a lot in this chapter to demonstrate text manipulation.

You can use the -n flag to specify the number of lines you want to see, for example:

$ head -n 3 ~/effective-shell/data/top100.csv

"Rank","Rating","Title","Reviews"
"1","97","Black Panther (2018)","515"
"2","94","Avengers: Endgame (2019)","531"

The tail command works in the same way - but looks at the end of a file. This is more useful when you are looking content which changes over time, like log files. In this case you probably want to see only the most recent entries.

Here's how we can see the ten most recent commands we entered in our shell:

$ tail $HISTFILE

: 1606818280:0;ls
: 1606818300:0;ln -s $(pwd) ~/effective-shell
: 1606818308:0;cat ~/effective-shell/data/top100.csv
: 1606818342:0;head -n 3 ~/effective-shell/data/top100.csv
: 1606819062:0;head ~/effective-shell/data/top100.csv
: 1606819647:0;gcd
: 1606819649:0;git stash
: 1606819650:0;gcd
: 1606819662:0;git stash pop
: 1606819803:0;tail $HISTFILE

What is $HISTFILE?

Most Bash-like shells keep a file called the history file. This is essentially a record of all of the commands which have been written in the shell. The history command can be used to show the contents of this file. But if we want to work with the file directly, we can find its location with the special variable called $HISTFILE.

Enter help history for more information on the shell history.

We can be more specific, just like with head, by specifying the number of lines to show:

$ tail -n 3 $HISTFILE

: 1606819650:0;gcd
: 1606819662:0;git stash pop
: 1606819803:0;tail $HISTFILE

tail can also be used to show the changes to a file in real time. Add the -f flag to follow the contents of the file - this means the tail command show each new line as it gets added to the file.

To try it out, run the following command in one shell:

$ tail -f $HISTFILE

In another terminal window, start entering commands. You'll see that the tail command in the first window is writing the updates to the terminal as they are entered in the file. Press Ctrl+C to close the tail program.

Another trick I use a lot with tail is to use -n +2. This shows everything from the second line - the + symbol indicates we show everything from the given line onwards. This makes it easy to strip the header, or first line, from content. Here's how you might use it:

$ head ~/effective-shell/data/top100.csv | tail -n +2

"1","97","Black Panther (2018)","515"
"2","94","Avengers: Endgame (2019)","531"
"3","93","Us (2019)","536"
"4","97","Toy Story 4 (2019)","445"
"5","99","Lady Bird (2017)","393"
"6","100","Citizen Kane (1941)","94"
"7","97","Mission: Impossible - Fallout (2018)","430"
"8","98","The Wizard of Oz (1939)","120"
"9","96","The Irishman (2019)","441"

Here I've taken the head of the file (otherwise the output gets quite difficult to follow), then piped the results into tail -n +2 to grab everything from the second line onwards - which removes the heading line. We see the films only, not the titles of each column.

We're going to use head and tail quite a lot when working with text. These are two crucial tools which can really speed up your work.

Replacing Text

The next tool we'll look at is tr (translate characters). This program is very simple. My most common use for tr is to perform a simple substitution of characters.

Let's create a list of each of the columns in the data file we saw before to show how the command works:

$ head -n 1 ~/effective-shell/data/top100.csv | tr ',' '\n'

"Rank"
"Rating"
"Title"
"Reviews"

What about if we wanted to remove the quotes?

$ head -n 1 ~/effective-shell/data/top100.csv | tr ',' '\n' | tr -d '"'

Rank
Rating
Title
Reviews

Here we've seen two variations on how we can run the command. The first form is used to replace characters. Running:

tr ',' '\n'

Replaces the first specified character with the second. The \n character is the special newline character, which is used to create a line break at the end of a file.

The second form uses the -d flag to specify a set of characters to delete:

tr -d '"'

In the form above we delete quote (") characters.

When using tr remember that it works on characters. For example, the following might not work as you expect:

$ echo "Welcome to the shell" | tr 'shell' 'machine'

Wcicomc to tac macii

The reason the output is like this is that we're specifying character replacements - so we're changing characters as shown below:

s -> m
h -> a
e -> c
l -> h
l -> i

There are plenty of ways to replace entire words or perform more complex operations, but we'll use sed or awk for these operations - which we'll see in the following chapter.

There is one final thing it is worth mentioning about tr. It can be provided with character classes. This is easiest to explain with an example:

$ echo "Use your inside voice..." | tr '[[:lower:]]' '[[:upper:]]'

USE YOUR INSIDE VOICE...

In this case we are transforming characters in the lower class (lowercase characters) to the upper class (uppercase characters).

On Linux systems you can find more about character classes with man 7 regex. I am not going to go deeper into character classes at this stage. They provide a simple way to specify things like digits, alphabetic characters and so on, but there are other ways to do this (with extended regexes) which I think are likely to be more useful to learn about instead.

How to Cut

The next command is one which I've used far more than I expected. The cut command splits a line of text, using a given delimiter. Let's see some examples:

$ cut -d',' -f 3 ~/effective-shell/data/top100.csv | head

"Title"
"Black Panther (2018)"
"Avengers: Endgame (2019)"
"Us (2019)"
"Toy Story 4 (2019)"
"Lady Bird (2017)"
"Citizen Kane (1941)"
"Mission: Impossible - Fallout (2018)"
"The Wizard of Oz (1939)"
"The Irishman (2019)"

This is the first way to use cut. We specify the -d flag to choose a delimiter which we will cut the text with, then -f to choose which field we want to see. In this case we show split on the command character and show the third field - the title of the film in the data file.

This can be extraordinarily useful. Let's see how to get the names of the Kubernetes pods I have running on a cluster. I can use the following command to get the pods:

$ kubectl get pods

NAME                                  READY   STATUS    RESTARTS   AGE
elastic-operator-0                    1/1     Running   0          35d
elk-apm-server-65b698fb8c-rzncz       1/1     Running   0          13d
elk-es-default-0                      1/1     Running   0          35d
elk-kb-6f8bb6457b-bbbnn               1/1     Running   0          35d
filebeat-beat-filebeat-ccgl7          1/1     Running   1          13d
filebeat-beat-filebeat-dvf2l          1/1     Running   2          13d
filebeat-beat-filebeat-mnpms          1/1     Running   329        13d
kube-state-metrics-5cb57bdc45-mqv9d   1/1     Running   0          35d
metricbeat-beat-metricbeat-2xm7t      1/1     Running   6103       35d
metricbeat-beat-metricbeat-96dkt      1/1     Running   6097       35d
metricbeat-beat-metricbeat-n7kxm      1/1     Running   6109       35d

Now to get the name I can just cut the lines on the ‘space’ character and grab the first field:

$ kubectl get pods | cut -d' ' -f 1

NAME
elastic-operator-0
elk-apm-server-65b698fb8c-rzncz
elk-es-default-0
elk-kb-6f8bb6457b-bbbnn
filebeat-beat-filebeat-ccgl7
filebeat-beat-filebeat-dvf2l
filebeat-beat-filebeat-mnpms
kube-state-metrics-5cb57bdc45-mqv9d
metricbeat-beat-metricbeat-2xm7t
metricbeat-beat-metricbeat-96dkt
metricbeat-beat-metricbeat-n7kxm

And if we want to strip the first line? We can use the tail -n +2 command to tail everything from the second line onwards:

$ kubectl get pods | cut -d' ' -f 1 | tail -n +2

elastic-operator-0
elk-apm-server-65b698fb8c-rzncz
elk-es-default-0
elk-kb-6f8bb6457b-bbbnn
filebeat-beat-filebeat-ccgl7
filebeat-beat-filebeat-dvf2l
filebeat-beat-filebeat-mnpms
kube-state-metrics-5cb57bdc45-mqv9d
metricbeat-beat-metricbeat-2xm7t
metricbeat-beat-metricbeat-96dkt
metricbeat-beat-metricbeat-n7kxm

Bingo - we've removed the heading line. If you remember grep from the previous chapter, you might have spotted that we could also just filter the content:

$ kubectl get pods | cut -d' ' -f 1 | grep -v NAME

elastic-operator-0
elk-apm-server-65b698fb8c-rzncz
elk-es-default-0
elk-kb-6f8bb6457b-bbbnn
filebeat-beat-filebeat-ccgl7
filebeat-beat-filebeat-dvf2l
filebeat-beat-filebeat-mnpms
kube-state-metrics-5cb57bdc45-mqv9d
metricbeat-beat-metricbeat-2xm7t
metricbeat-beat-metricbeat-96dkt
metricbeat-beat-metricbeat-n7kxm

With even just a few simple shell commands there are often many ways to accomplish the same goal!

There is another way we can cut text. We can cut by slicing a number of characters from each line.

Let's take a look at our web logs file:

$ tail ~/effective-shell/logs/web-server-logs.txt

2020-11-29T12:50:52.721Z: info - Request: GET /en.search.min.1f83b222e24a227c0f5763727cb9e4f3b435f08b936f6ce529c9c9359f6b61a8.js
2020-11-29T12:50:52.722Z: info - Serving file '../../../website/public/en.search.min.1f83b222e24a227c0f5763727cb9e4f3b435f08b936f6ce529c9c9359f6b61a8.js'...
2020-11-29T12:50:52.762Z: info - Request: GET /svg/menu.svg
2020-11-29T12:50:52.763Z: info - Serving file '../../../website/public/svg/menu.svg'...
2020-11-29T12:50:52.763Z: info - Request: GET /svg/calendar.svg
2020-11-29T12:50:52.764Z: info - Serving file '../../../website/public/svg/calendar.svg'...
2020-11-29T12:50:52.765Z: info - Request: GET /svg/edit.svg
2020-11-29T12:50:52.766Z: info - Serving file '../../../website/public/svg/edit.svg'...
2020-11-29T12:50:52.784Z: info - Request: GET /fonts/roboto-v19-latin-300italic.woff2
2020-11-29T12:50:52.785Z: info - Serving file '../../../website/public/fonts/roboto-v19-latin-300italic.woff2'...

We can use the -c (characters) flag to specify the characters in the line we want to see. Let's extract the timestamp only:

$ tail -n 3 ~/effective-shell/logs/web-server-logs.txt | cut -c 12-19

12:50:52
12:50:52
12:50:52

We can also use the character option to extract everything from a specific point onwards:

$ tail -n 3 ~/effective-shell/logs/web-server-logs.txt | cut -c 27-

info - Serving file '../../../website/public/svg/edit.svg'...
info - Request: GET /fonts/roboto-v19-latin-300italic.woff2
info - Serving file '../../../website/public/fonts/roboto-v19-latin-300italic.woff2'...

By cutting from the 27th character onwards (-c 27-) we remove the timestamp and just get the log message.

As a nice trick you can use the same syntax when splitting by fields:

$ tail -n 3 ~/effective-shell/data/top100.csv | cut -d',' -f 3-

"Pinocchio (1940)","55"
"Chinatown (1974)","75"
"The Dark Knight (2008)","342"

This is field three onwards. If we just want fields two and three, we use:

$ tail -n 3 ~/effective-shell/data/top100.csv | cut -d',' -f 2,3

"100","Pinocchio (1940)"
"99","Chinatown (1974)"
"94","The Dark Knight (2008)"

There's a surprising amount you can do with the cut tool. As we introduce more complex tools later on, like sed and awk, we'll see other ways to accomplish the same goals, but I often find that by filtering down the content with grep first I can cut my way to what I need without having to use more complex tools.

A Trick with Rev

There is a very simple command called rev which reverses the given input. For example:

$ echo "A nut for a jar of tuna" | rev

anut fo raj a rof tun A

At first glance this doesn't seem very useful - but there's a nice trick we can do with this:

$ pwd | rev | cut -d\ -f 1 | rev

effective-shell

Here we take the current working directory, reverse it, cut the first field, then reverse it again. Here's what's happening at each stage:

pwd              /Users/dwmkerr/effective-shell
rev              llehs-evitceffe/rrekmwd/sresU/
cut -d'/' -f 1   llehs-evitceffe
rev              effective-shell

This is a neat trick to rip all of the text from the final occurrence of a character. You might not use it very often but it's an interesting reminder that you can often do more than you think by chaining together simple commands into a pipeline!

Sort and Unique

Two other commands which can be really helpful are sort and uniq. Let's see sort first:

$ cut -d',' -f 3 ~/effective-shell/data/top100.csv | sort | head

"12 Years a Slave (2013)"
"A Hard Day's Night (1964)"
"A Night at the Opera (1935)"
"A Quiet Place (2018)"
"A Star Is Born (2018)"
"Alien (1979)"
"All About Eve (1950)"
"Argo (2012)"
"Arrival (2016)"
"Avengers: Endgame (2019)"

Here we've grabbed the third field in our data file (the name of the film), sorted, then shown the first ten values.

You can reverse the direction of sort with the -r flag:

$ cut -d',' -f 3 ~/effective-shell/data/top100.csv | sort -r | head

"Zootopia (2016)"
"Wonder Woman (2017)"
"Won't You Be My Neighbor? (2018)"
"Widows (2018)"
"War for the Planet of the Apes (2017)"
"Us (2019)"
"Up (2009)"
"Toy Story 4 (2019)"
"Toy Story 3 (2010)"
"Toy Story 2 (1999)"

There are actually quite a few other options for sort, you can see them with man sort. However, most of them perform functionality which you can get from other tools (such as making the lines unique, which we can do with uniq). You might find some of them useful so don't be shy to explore some of the other options.

The uniq command removes duplicate lines from a stream of text. Note that this only removes duplicate lines when they are next to each other. This means that you will often have to sort first.

Here's an example of where I might use uniq - getting all unique error messages in a log file:

$ cut -c 27- ~/effective-shell/logs/web-server-logs.txt | grep error | sort | uniq

error - Unhandled error EACCES trying to read '../../../website/public/docs/part-1-transitioning-to-the-shell/5-getting-help/index.html', returning a 500
error - Unhandled error EACCES trying to read '../../../website/public/svg/calendar.svg', returning a 500
error - Unhandled error EACCES trying to read '../../../website/public/svg/edit.svg', returning a 500
info - Request: GET /docs/1-getting-started/images/ls-applications-windows-error.png
info - Request: GET /docs/part-1-transitioning-to-the-shell/3-managing-your-files/images/rm-error-directory.png
info - Serving file '../../../website/public/docs/1-getting-started/images/ls-applications-windows-error.png'...
info - Serving file '../../../website/public/docs/part-1-transitioning-to-the-shell/3-managing-your-files/images/rm-error-directory.png'...

Let's break this down:

  • cut -c 27- ~/effective-shell/logs/web-server-logs.txt - extract log messages from a log file, skipping the timestamp
  • grep error - filter down to lines which contain the text error
  • sort - sort the output
  • uniq - show only unique values

This is a powerful technique - if we had thousands of errors in the file, this would make sure we only see distinct errors, rather than showing every error.

Don't Forget Your Pager!

In Chapter 5 - Getting Help we talked about the pager - the program your shell uses to make it easier to look through larger text files, giving the option to move backwards and forwards a page at a time (or searching and so on). Don't forget to use your pager when you are working with text. When you are trying to build a pipeline and want to see intermediate results (perhaps before you use head or tail) then you can use the pager to avoid filling your screen and terminal with too much text.

For example, when looking at the sorted list of films, I might run this:

$ cut -d',' -f 3 ~/effective-shell/data/top100.csv | sort | less

"Jaws (1975)"
"King Kong (1933)"
"La Grande illusion (Grand Illusion) (1938)"
"La La Land (2016)"
"Lady Bird (2017)"
"Laura (1944)"

/Jaws

I've made the output smaller so that it is easier to see what is happening. In this example I've cut out the film name from my data file, sorted it, then piped the result into less so that I can page through the data and ensure it is correct - I've also searched for the text Jaws to see where it is in the file.

Summary

In this chapter we introduced a number of basic tools which let us work with text.

  • head will show the first ten lines of a file.
  • head -n 30 will show the first thirty lines of a file, using the -n flag to specify the number of lines.
  • tail will show the final ten lines of a file.
  • tail -n 3 uses the -n flag to specify three lines only.
  • The $HISTFILE environment variable holds the path to the shell command history file.
  • tail -f $HISTFILE uses the -f flag to follow the file, printing output as it is written to the file.
  • tr 'a' 'b' is the translate text command, which turns one set of characters into another
  • tr -d '!' shows how the -d or delete flag can specify characters to delete.
  • The cut command can be used to extract parts of a line of text.
  • cut -d',' -f 3 shows how the -d or delimiter flag is used to specify the delimiter to cut on and how the -f or field flag specifies which of the fields the text has been cut into is printed.
  • cut -c 2-4 uses the -c or characters flag to specify that we are extracting a subset of characters in the line, in this case characters two to four.
  • cut -c 10- cuts from character ten to the end of the line
  • The cut command also allows for multiple fields to be specified when cutting by field, such as -f 2,3 for the second and third field, or -f 4- for fields four onwards.
  • rev reverses text - by reversing, cutting and then re-reversing you can quickly extract text from the end of a line.
  • sort sorts the incoming text alphabetically.
  • The -r flag for sort reverses the sort order.
  • The uniq command removes duplicate lines - but only when they are next to each other, so you'll often use it in combination with sort.
  • Your pager, for example the less program can be useful when inspecting the output of your text transformation commands.
comments powered by Disqus

Copyright © , Dave Kerr