Chapter 15 - Slice and Dice Text
In Chapter 14 we looked at how to use the
grep command to search through text and filter text. In this chapter we're going to look at some of the basic commands which we can use to manipulate text. There are a whole raft of commands and options available.
We'll start with the basics and move onto some of the more sophisticated commands in the next chapter.
Heads and Tails
tail are very simple but incredibly useful.
head is used to extract part of the top of a file and
tail is used to extract part of the end of a file. Once you starting using these commands you'll find yourself using them regularly.
Let's start with
head. Imagine we have a data file which has been sent to us, we don't know exactly what is in it, but we know it is large. How can we take a quick look?
$ head ~/effective-shell/data/top100.csv "Rank","Rating","Title","Reviews" "1","97","Black Panther (2018)","515" "2","94","Avengers: Endgame (2019)","531" "3","93","Us (2019)","536" "4","97","Toy Story 4 (2019)","445" "5","99","Lady Bird (2017)","393" "6","100","Citizen Kane (1941)","94" "7","97","Mission: Impossible - Fallout (2018)","430" "8","98","The Wizard of Oz (1939)","120" "9","96","The Irishman (2019)","441"
head command just shows the first ten lines of a file. Here we can see that this is a comma separated values file which seems to be a list of movies. This file is actually a list of the top 100 films on ‘Rotten Tomatoes’ at the time of writing, with the score, tomato meter, name and number of votes. We'll use it a lot in this chapter to demonstrate text manipulation.
You can use the
-n flag to specify the number of lines you want to see, for example:
$ head -n 3 ~/effective-shell/data/top100.csv "Rank","Rating","Title","Reviews" "1","97","Black Panther (2018)","515" "2","94","Avengers: Endgame (2019)","531"
tail command works in the same way - but looks at the end of a file. This is more useful when you are looking content which changes over time, like log files. In this case you probably want to see only the most recent entries.
Here's how we can see the ten most recent commands we entered in our shell:
$ tail $HISTFILE : 1606818280:0;ls : 1606818300:0;ln -s $(pwd) ~/effective-shell : 1606818308:0;cat ~/effective-shell/data/top100.csv : 1606818342:0;head -n 3 ~/effective-shell/data/top100.csv : 1606819062:0;head ~/effective-shell/data/top100.csv : 1606819647:0;gcd : 1606819649:0;git stash : 1606819650:0;gcd : 1606819662:0;git stash pop : 1606819803:0;tail $HISTFILE
What is $HISTFILE?
Most Bash-like shells keep a file called the history file. This is essentially a record of all of the commands which have been written in the shell. The
historycommand can be used to show the contents of this file. But if we want to work with the file directly, we can find its location with the special variable called
help historyfor more information on the shell history.
We can be more specific, just like with
head, by specifying the number of lines to show:
$ tail -n 3 $HISTFILE : 1606819650:0;gcd : 1606819662:0;git stash pop : 1606819803:0;tail $HISTFILE
tail can also be used to show the changes to a file in real time. Add the
-f flag to follow the contents of the file - this means the
tail command show each new line as it gets added to the file.
To try it out, run the following command in one shell:
$ tail -f $HISTFILE
In another terminal window, start entering commands. You'll see that the
tail command in the first window is writing the updates to the terminal as they are entered in the file. Press
Ctrl+C to close the
Another trick I use a lot with
tail is to use
-n +2. This shows everything from the second line - the
+ symbol indicates we show everything from the given line onwards. This makes it easy to strip the header, or first line, from content. Here's how you might use it:
$ head ~/effective-shell/data/top100.csv | tail -n +2 "1","97","Black Panther (2018)","515" "2","94","Avengers: Endgame (2019)","531" "3","93","Us (2019)","536" "4","97","Toy Story 4 (2019)","445" "5","99","Lady Bird (2017)","393" "6","100","Citizen Kane (1941)","94" "7","97","Mission: Impossible - Fallout (2018)","430" "8","98","The Wizard of Oz (1939)","120" "9","96","The Irishman (2019)","441"
Here I've taken the
head of the file (otherwise the output gets quite difficult to follow), then piped the results into
tail -n +2 to grab everything from the second line onwards - which removes the heading line. We see the films only, not the titles of each column.
We're going to use
tail quite a lot when working with text. These are two crucial tools which can really speed up your work.
The next tool we'll look at is
tr (translate characters). This program is very simple. My most common use for
tr is to perform a simple substitution of characters.
Let's create a list of each of the columns in the data file we saw before to show how the command works:
$ head -n 1 ~/effective-shell/data/top100.csv | tr ',' '\n' "Rank" "Rating" "Title" "Reviews"
What about if we wanted to remove the quotes?
$ head -n 1 ~/effective-shell/data/top100.csv | tr ',' '\n' | tr -d '"' Rank Rating Title Reviews
Here we've seen two variations on how we can run the command. The first form is used to replace characters. Running:
tr ',' '\n'
Replaces the first specified character with the second. The
\n character is the special newline character, which is used to create a line break at the end of a file.
The second form uses the
-d flag to specify a set of characters to delete:
tr -d '"'
In the form above we delete quote (
tr remember that it works on characters. For example, the following might not work as you expect:
$ echo "Welcome to the shell" | tr 'shell' 'machine' Wcicomc to tac macii
The reason the output is like this is that we're specifying character replacements - so we're changing characters as shown below:
s -> m h -> a e -> c l -> h l -> i
There are plenty of ways to replace entire words or perform more complex operations, but we'll use
awk for these operations - which we'll see in the following chapter.
There is one final thing it is worth mentioning about
tr. It can be provided with character classes. This is easiest to explain with an example:
$ echo "Use your inside voice..." | tr '[[:lower:]]' '[[:upper:]]' USE YOUR INSIDE VOICE...
In this case we are transforming characters in the
lower class (lowercase characters) to the
upper class (uppercase characters).
On Linux systems you can find more about character classes with
man 7 regex. I am not going to go deeper into character classes at this stage. They provide a simple way to specify things like digits, alphabetic characters and so on, but there are other ways to do this (with extended regexes) which I think are likely to be more useful to learn about instead.
How to Cut
The next command is one which I've used far more than I expected. The
cut command splits a line of text, using a given delimiter. Let's see some examples:
$ cut -d',' -f 3 ~/effective-shell/data/top100.csv | head "Title" "Black Panther (2018)" "Avengers: Endgame (2019)" "Us (2019)" "Toy Story 4 (2019)" "Lady Bird (2017)" "Citizen Kane (1941)" "Mission: Impossible - Fallout (2018)" "The Wizard of Oz (1939)" "The Irishman (2019)"
This is the first way to use
cut. We specify the
-d flag to choose a delimiter which we will cut the text with, then
-f to choose which field we want to see. In this case we show split on the command character and show the third field - the title of the film in the data file.
This can be extraordinarily useful. Let's see how to get the names of the Kubernetes pods I have running on a cluster. I can use the following command to get the pods:
$ kubectl get pods NAME READY STATUS RESTARTS AGE elastic-operator-0 1/1 Running 0 35d elk-apm-server-65b698fb8c-rzncz 1/1 Running 0 13d elk-es-default-0 1/1 Running 0 35d elk-kb-6f8bb6457b-bbbnn 1/1 Running 0 35d filebeat-beat-filebeat-ccgl7 1/1 Running 1 13d filebeat-beat-filebeat-dvf2l 1/1 Running 2 13d filebeat-beat-filebeat-mnpms 1/1 Running 329 13d kube-state-metrics-5cb57bdc45-mqv9d 1/1 Running 0 35d metricbeat-beat-metricbeat-2xm7t 1/1 Running 6103 35d metricbeat-beat-metricbeat-96dkt 1/1 Running 6097 35d metricbeat-beat-metricbeat-n7kxm 1/1 Running 6109 35d
Now to get the name I can just
cut the lines on the ‘space’ character and grab the first field:
$ kubectl get pods | cut -d' ' -f 1 NAME elastic-operator-0 elk-apm-server-65b698fb8c-rzncz elk-es-default-0 elk-kb-6f8bb6457b-bbbnn filebeat-beat-filebeat-ccgl7 filebeat-beat-filebeat-dvf2l filebeat-beat-filebeat-mnpms kube-state-metrics-5cb57bdc45-mqv9d metricbeat-beat-metricbeat-2xm7t metricbeat-beat-metricbeat-96dkt metricbeat-beat-metricbeat-n7kxm
And if we want to strip the first line? We can use the
tail -n +2 command to tail everything from the second line onwards:
$ kubectl get pods | cut -d' ' -f 1 | tail -n +2 elastic-operator-0 elk-apm-server-65b698fb8c-rzncz elk-es-default-0 elk-kb-6f8bb6457b-bbbnn filebeat-beat-filebeat-ccgl7 filebeat-beat-filebeat-dvf2l filebeat-beat-filebeat-mnpms kube-state-metrics-5cb57bdc45-mqv9d metricbeat-beat-metricbeat-2xm7t metricbeat-beat-metricbeat-96dkt metricbeat-beat-metricbeat-n7kxm
Bingo - we've removed the heading line. If you remember
grep from the previous chapter, you might have spotted that we could also just filter the content:
$ kubectl get pods | cut -d' ' -f 1 | grep -v NAME elastic-operator-0 elk-apm-server-65b698fb8c-rzncz elk-es-default-0 elk-kb-6f8bb6457b-bbbnn filebeat-beat-filebeat-ccgl7 filebeat-beat-filebeat-dvf2l filebeat-beat-filebeat-mnpms kube-state-metrics-5cb57bdc45-mqv9d metricbeat-beat-metricbeat-2xm7t metricbeat-beat-metricbeat-96dkt metricbeat-beat-metricbeat-n7kxm
With even just a few simple shell commands there are often many ways to accomplish the same goal!
There is another way we can
cut text. We can
cut by slicing a number of characters from each line.
Let's take a look at our web logs file:
$ tail ~/effective-shell/logs/web-server-logs.txt 2020-11-29T12:50:52.721Z: info - Request: GET /en.search.min.1f83b222e24a227c0f5763727cb9e4f3b435f08b936f6ce529c9c9359f6b61a8.js 2020-11-29T12:50:52.722Z: info - Serving file '../../../website/public/en.search.min.1f83b222e24a227c0f5763727cb9e4f3b435f08b936f6ce529c9c9359f6b61a8.js'... 2020-11-29T12:50:52.762Z: info - Request: GET /svg/menu.svg 2020-11-29T12:50:52.763Z: info - Serving file '../../../website/public/svg/menu.svg'... 2020-11-29T12:50:52.763Z: info - Request: GET /svg/calendar.svg 2020-11-29T12:50:52.764Z: info - Serving file '../../../website/public/svg/calendar.svg'... 2020-11-29T12:50:52.765Z: info - Request: GET /svg/edit.svg 2020-11-29T12:50:52.766Z: info - Serving file '../../../website/public/svg/edit.svg'... 2020-11-29T12:50:52.784Z: info - Request: GET /fonts/roboto-v19-latin-300italic.woff2 2020-11-29T12:50:52.785Z: info - Serving file '../../../website/public/fonts/roboto-v19-latin-300italic.woff2'...
We can use the
-c (characters) flag to specify the characters in the line we want to see. Let's extract the timestamp only:
$ tail -n 3 ~/effective-shell/logs/web-server-logs.txt | cut -c 12-19 12:50:52 12:50:52 12:50:52
We can also use the character option to extract everything from a specific point onwards:
$ tail -n 3 ~/effective-shell/logs/web-server-logs.txt | cut -c 27- info - Serving file '../../../website/public/svg/edit.svg'... info - Request: GET /fonts/roboto-v19-latin-300italic.woff2 info - Serving file '../../../website/public/fonts/roboto-v19-latin-300italic.woff2'...
By cutting from the 27th character onwards (
-c 27-) we remove the timestamp and just get the log message.
As a nice trick you can use the same syntax when splitting by fields:
$ tail -n 3 ~/effective-shell/data/top100.csv | cut -d',' -f 3- "Pinocchio (1940)","55" "Chinatown (1974)","75" "The Dark Knight (2008)","342"
This is field three onwards. If we just want fields two and three, we use:
$ tail -n 3 ~/effective-shell/data/top100.csv | cut -d',' -f 2,3 "100","Pinocchio (1940)" "99","Chinatown (1974)" "94","The Dark Knight (2008)"
There's a surprising amount you can do with the
cut tool. As we introduce more complex tools later on, like
awk, we'll see other ways to accomplish the same goals, but I often find that by filtering down the content with
grep first I can
cut my way to what I need without having to use more complex tools.
A Trick with Rev
There is a very simple command called
rev which reverses the given input. For example:
$ echo "A nut for a jar of tuna" | rev anut fo raj a rof tun A
At first glance this doesn't seem very useful - but there's a nice trick we can do with this:
$ pwd | rev | cut -d\ -f 1 | rev effective-shell
Here we take the current working directory, reverse it, cut the first field, then reverse it again. Here's what's happening at each stage:
pwd /Users/dwmkerr/effective-shell rev llehs-evitceffe/rrekmwd/sresU/ cut -d'/' -f 1 llehs-evitceffe rev effective-shell
This is a neat trick to rip all of the text from the final occurrence of a character. You might not use it very often but it's an interesting reminder that you can often do more than you think by chaining together simple commands into a pipeline!
Sort and Unique
Two other commands which can be really helpful are
uniq. Let's see
$ cut -d',' -f 3 ~/effective-shell/data/top100.csv | sort | head "12 Years a Slave (2013)" "A Hard Day's Night (1964)" "A Night at the Opera (1935)" "A Quiet Place (2018)" "A Star Is Born (2018)" "Alien (1979)" "All About Eve (1950)" "Argo (2012)" "Arrival (2016)" "Avengers: Endgame (2019)"
Here we've grabbed the third field in our data file (the name of the film), sorted, then shown the first ten values.
You can reverse the direction of
sort with the
$ cut -d',' -f 3 ~/effective-shell/data/top100.csv | sort -r | head "Zootopia (2016)" "Wonder Woman (2017)" "Won't You Be My Neighbor? (2018)" "Widows (2018)" "War for the Planet of the Apes (2017)" "Us (2019)" "Up (2009)" "Toy Story 4 (2019)" "Toy Story 3 (2010)" "Toy Story 2 (1999)"
There are actually quite a few other options for sort, you can see them with
man sort. However, most of them perform functionality which you can get from other tools (such as making the lines unique, which we can do with
uniq). You might find some of them useful so don't be shy to explore some of the other options.
uniq command removes duplicate lines from a stream of text. Note that this only removes duplicate lines when they are next to each other. This means that you will often have to
Here's an example of where I might use
uniq - getting all unique error messages in a log file:
$ cut -c 27- ~/effective-shell/logs/web-server-logs.txt | grep error | sort | uniq error - Unhandled error EACCES trying to read '../../../website/public/docs/part-1-transitioning-to-the-shell/5-getting-help/index.html', returning a 500 error - Unhandled error EACCES trying to read '../../../website/public/svg/calendar.svg', returning a 500 error - Unhandled error EACCES trying to read '../../../website/public/svg/edit.svg', returning a 500 info - Request: GET /docs/1-getting-started/images/ls-applications-windows-error.png info - Request: GET /docs/part-1-transitioning-to-the-shell/3-managing-your-files/images/rm-error-directory.png info - Serving file '../../../website/public/docs/1-getting-started/images/ls-applications-windows-error.png'... info - Serving file '../../../website/public/docs/part-1-transitioning-to-the-shell/3-managing-your-files/images/rm-error-directory.png'...
Let's break this down:
cut -c 27- ~/effective-shell/logs/web-server-logs.txt- extract log messages from a log file, skipping the timestamp
grep error- filter down to lines which contain the text
sort- sort the output
uniq- show only unique values
This is a powerful technique - if we had thousands of errors in the file, this would make sure we only see distinct errors, rather than showing every error.
Don't Forget Your Pager!
In Chapter 5 - Getting Help we talked about the pager - the program your shell uses to make it easier to look through larger text files, giving the option to move backwards and forwards a page at a time (or searching and so on). Don't forget to use your pager when you are working with text. When you are trying to build a pipeline and want to see intermediate results (perhaps before you use
tail) then you can use the pager to avoid filling your screen and terminal with too much text.
For example, when looking at the sorted list of films, I might run this:
$ cut -d',' -f 3 ~/effective-shell/data/top100.csv | sort | less "Jaws (1975)" "King Kong (1933)" "La Grande illusion (Grand Illusion) (1938)" "La La Land (2016)" "Lady Bird (2017)" "Laura (1944)" /Jaws
I've made the output smaller so that it is easier to see what is happening. In this example I've cut out the film name from my data file, sorted it, then piped the result into
less so that I can page through the data and ensure it is correct - I've also searched for the text
Jaws to see where it is in the file.
In this chapter we introduced a number of basic tools which let us work with text.
headwill show the first ten lines of a file.
head -n 30will show the first thirty lines of a file, using the
-nflag to specify the number of lines.
tailwill show the final ten lines of a file.
tail -n 3uses the
-nflag to specify three lines only.
$HISTFILEenvironment variable holds the path to the shell command history file.
tail -f $HISTFILEuses the
-fflag to follow the file, printing output as it is written to the file.
tr 'a' 'b'is the translate text command, which turns one set of characters into another
tr -d '!'shows how the
-dor delete flag can specify characters to delete.
cutcommand can be used to extract parts of a line of text.
cut -d',' -f 3shows how the
-dor delimiter flag is used to specify the delimiter to cut on and how the
-for field flag specifies which of the fields the text has been cut into is printed.
cut -c 2-4uses the
-cor characters flag to specify that we are extracting a subset of characters in the line, in this case characters two to four.
cut -c 10-cuts from character ten to the end of the line
cutcommand also allows for multiple fields to be specified when cutting by field, such as
-f 2,3for the second and third field, or
-f 4-for fields four onwards.
revreverses text - by reversing, cutting and then re-reversing you can quickly extract text from the end of a line.
sortsorts the incoming text alphabetically.
sortreverses the sort order.
uniqcommand removes duplicate lines - but only when they are next to each other, so you'll often use it in combination with
- Your pager, for example the
lessprogram can be useful when inspecting the output of your text transformation commands.