Week 2: More shell, regex, version control
Part I: More shell
I/O redirection
Whenever you execute a command / script in Linux, three files are always open.
These files are mapped to the standard input, standard output, and standard
error streams (STDIN
, STDOUT
, STDERR
). By default, STDIN
is your
keyboard and STDOUT
is the terminal from which you are executing the command.
For STDERR
, it depends: shell commands typically appear on the terminal, but
some programs direct the error log to an appropriate file.
Here are 3 examples of output redirection:
$ echo "foo" > file.txt
$ echo "bar" >> file.txt
$ :> file.txt
The first command outputs "foo", but redirects the standard output to file.txt
instead of the terminal using the >
operator (which is why you won't see "foo"
printed when you execute it). However, if the file was nonempty, its previous
content is lost. This is not the case with the second command: any output is
appended to the file when using >>
. Finally, the last command uses the
:
shell builtin, which is a null operator, and redirects its output to file.txt
.
Since :
produces no output, that effectively resets the file contents.
Input redirection is done with the help of the <
operator. Consider the
read
command, which reads standard input into variables (using whitespace
to make tokens of the input). For example:
$ cat file.txt
foo bar
$ read var_1 var_2 < file.txt
$ echo $var_1
foo
$ echo $var_2
bar
Another useful construct related to input redirection is what is known as a here document. It is best described using an example:
$ cat <<EndOfMsg
This is a line.
This is another line.
More lines might follow.
EndOfMsg
# output:
# This is a line.
# This is another line.
# More lines might follow.
In the above, everything between the first and second appearances of EndOfMsg
gets redirected to the standard input of cat
(cat
expects the user to type
an input string by default, but here input is redirected). For it to work properly,
nothing else should follow EndOfMsg
in the same line. Finally, note the use of
the <<
operator (instead of <
).
Note: You can combine input and output redirection at will. Run the following examples on your terminal and see what they do:
$ echo "1 2 3" > file.txt
$ cat < file.txt > output1.txt
$ cat > output2.txt < file.txt
$ cat > output3.txt <<EndOfMsg
Foo
Bar
Baz
EndOfMsg
Finally, it is also possible to redirect one of the three streams to the other.
The syntax is i>&j
, where i
and j
are file descriptors. By default, 0
is STDIN, 1
is STDOUT, and 2
is STDERR. Here is an example, where we run the
ls
command, redirect its output to output.log
, and redirect the standard
error to the standard output (which has been set to output.log
):
$ ls -l > output.log 2>&1.
In fact, you can use these file descriptors more generally. Some examples can be found here.
One of the advantages of the UNIX shell is composability: creating a pipeline
of simple commands to accomplish complex tasks. This is why the last I/O
redirection operator we will look at is the pipe, |
. This operator
"chains" input and output streams, connecting the STDOUT of the previous
program to the STDIN of the next.
Here is a simple example, that uses the grep
command introduced below.
By default, grep <pattern> <input>
searches for lines containing <pattern>
in its input, which can be either STDIN or a list of files:
$ who | grep "vasilis"
vasilis tty7 2021-02-08 09:17 (:0)
Here, the output of who
(a list of users currently logged into the system)
becomes the standard input for grep
, which searches for the pattern vasilis
.
You can chain more than one pipes. For example, here we count the number of files
in the current directory that contain the word "backup" using the wc
utility;
wc -l
counts the number of lines in the standard input.
$ ls | grep "backup" | wc -l
Here is a similar example: suppose you have a list of contacts, contacts.txt
,
with lines in the format <FIRSTNAME> <SURNAME> <PHONENUMBER>
.
You want to find all the lines that contain the name "Johnson", sort them in
increasing order according to first name, strip the phone numbers and put them
in a file called johnsonses.txt
:
$ grep "Johnson" contacts.txt | sort -k1 | awk '{print $1, $2}' > johnsonses.txt
Here, awk '{print $1, $2}'
prints the first and second columns for every line
in its standard input, which corresponds to the first and last names of all the
Johnsonses. Similarly, you can collect all their phone numbers in a separate file:
$ grep "Johnson" contacts.txt | awk '{print $3}' > johnsonses_numbers.txt
Many nontrivial uses of piping arise in text processing. For that reason, we will now move on to regular expressions.
Part II: Regular expressions
Regular expressions are patterns that match characters in strings (called regex for short). They are a mix of "ordinary" characters (like substrings you wish to match exactly) and "special" characters that allow for repetitions, combinations, and other interesting features.
Regular expressions are supported by several languages and command-line tools.
For example, the grep
utility in UNIX allows you to probe files for patterns
using regular expression syntax, the sed
utility allows you to perform
substitutions using regular expressions, and so on. Python has a library called
re
to create regular expressions as well. During this part of the course, we
will be putting on our UNIX hat and working with command-line tools but feel
free to use Python to practice them in your spare time.
Setup & Basic Constructs
The most common use of regular expressions is filtering a collection of strings trying to find matches to a given pattern. Writing correct and unambiguous patterns is the essence of writing regular expressions.
Consider the simplest task of filtering through a set of strings, returning all those that contain the sequence "vasilis". For example:
$ cat example.txt
My name is vasilis and I like to write code.
My name is also Vasilis but I don't like to write code.
I spoke to Mateo and he told me about Python.
I spoke to vasilid and he taught me about regular expressions.
asdfasdfasdfasdfvasilisasdfasdf.
This file contains a collection of sentences (one per line), and we wish to output
each line that contains the sequence "vasilis". To do so, we can write a simple
grep
command as follows:
$ egrep "vasilis" example.txt
My name is vasilis and I like to write code.
asdfasdfasdfasdfvasilisasdfasdf.
The grep
utility works as follows: it treats the first argument is the pattern
and the second argument is typically the input file. It applies the pattern to each
line in the file, and prints all the lines that match. Note that patterns are case-sensitive;
for example, we ignored the second line that contains the word "Vasilis" because
the leading "v" should be lower case to match the pattern. It is good practice
to enclose the pattern in double-quotes when using grep
in a script.
Note: We are using egrep
here for reasons that will be clarified later;
namely, to make sure that meta-characters are treated as expected.
Character ranges
To circumvent the above problem (only match "vasilis" but not "Vasilis") we introduce a fundamental construct: character ranges. If there is a part of your pattern where more than one characters match, you can enclose the set of letters in square brackets:
$ egrep "[Vv]asilis" example.txt
My name is vasilis and I like to write code.
My name is also Vasilis but I don't like to write code.
asdfasdfasdfasdfvasilisasdfasdf.
When using a character range, there are some tricks to simplify the resulting
pattern. For example, if you want to match any number between 0 and 9, you can
write [0123456789]
or [0-9]
- the two are equivalent. The same is true for
[abcdeghijklmnopqrstuvwxyz]
and [a-z]
. If you want to be case-insensitive,
you can also mix the two: [a-zA-Z]
will match any letter between "a" and "z"
as well as their capital versions.
A note of caution: whatever you put inside the brackets will be treated as
a collection of characters to match (or not match), not as a string. For example,
writing [vasilis]
will match one letter from the set {a, i, l, s, v}
rather
than the string.
What if we want to exclude a set of characters from our pattern? In this case we can use the caret (^) inside the square brackets. For example:
$ egrep "vasili[^s]" example.txt
I spoke to vasilid and he taught me about regular expressions.
Here, we match all strings containing a set of characters "vasili" immediately followed by any character other than "s". If there are more than one characters you wish to avoid, you can add them inside the same block following the caret.
$ grep "vasili[^ds]" example.txt
Metacharacters
In the above, the brackets as well as the caret are so-called metacharacters, i.e., characters that take on special function and meaning inside regexes. If we want to match the meta-character itself, we typically add a backslash in front of it (something referred to as "escaping" the character). Note the difference between the following two:
$ egrep "\[vasilis\]" example.txt
...
$ egrep "[vasilis]" example.txt
In the first example, we escape [
and ]
in order to indicate that we want to
treat them as ordinary characters and match the substring "[vasilis]". In the
second example, we are not escaping them and instead end up with a character range
that will match any character from the set {a,i,l,s,v}
.
Note: forgetting to escape a metacharacter is one of the most common mistakes for firstcomers in regular expressions. Make sure you remember the ones you learn!
Here is another metacharacter: the so-called Kleene star (*
).
The star operator indicates
that the preceding character can be "matched" as many (or as few) times as necessary. Consider,
for example, trying to match all strings of the form "hello", "helllo", "hellllo" etc.
Here, the words we are looking for start with "he", followed by at least 2 "l" characters
and the character "o" last. The following will work fine:
$ cat example_star.txt
hello
helllo
hellllo
helo
$ egrep "helll*o" example_star.txt
hello
helllo
hellllo
Here, we are telling grep
to match any strings containing "hell" followed by
any number of occurences of "l", followed by "o". A similar operator to the
Kleene star is the Kleene plus (+
), which matches at least one occurence
of the preceding operator (recall that *
can match as few as zero of them). For
example:
$ cat example_plus.txt
heo
helo
hello
$ egrep "hel+o" example_plus.txt
helo
hello
$ egrep "hel*o" example_plus.txt
heo
helo
hello
Another useful construct is specifying the number of occurences explicitly. The
general syntax for that is <character>{lower_bound,upper_bound}
. For example:
$ egrep "hel{2,3}o" example_star.txt
hello
helllo
The above matched all strings starting with "he" followed by between 2 and 3 "l"'s, followed by "o". You can also omit the upper or lower bound:
$ egrep "hel{2,}o" example_star.txt
hello
helllo
hellllo
The example above matches at least 2 "l"'s. On the other hand, the command below matches at most 2 "l"'s:
$ egrep "hel{,2}o" example_star.txt
hello
helo
Note: omitting the lower bound will allow zero occurences of the sub-expression to be matched. For example:
$ egrep "hel{,2}o" <(echo heo)
heo
Note that the curly braces are also metacharacters, as demonstrated below:
$ cat example_meta.txt
hello
hel{,2}o
$ egrep "hel{,2}o" example_meta.txt
hello
$ egrep "hel\{,2\}o" example_meta.txt
hel{,2}o
Exercise: Write a regex matching US-style phone numbers, i.e., a 3-digit area code followed by a dash and 7 more digits. Note the first digit in the area code cannot be zero.
Solution
$ egrep "[1-9][0-9]{2,2}-[0-9]{7,7}" <file>
Here is one more: the "optional" metacharacter. Consider the following scenario: you are profiling a piece of code and generate a log file that reports how many function calls were performed during a test run. You wish to match lines that look like
24 calls found.
3 calls found.
1 call found.
Here, you decide to match any lines that contain "call", optionally followed by one "s" character. Two equivalent ways to do it:
$ egrep "calls{0,1} found" output.log
$ egrep "calls? found" output.log
Here, "?" applies to the preceding character and indicates that we should try to match "call" or "calls" (whichever produces a successful match).
Exercise: write a regular expression that matches a string starting with "a", followed by any sequence of letters, followed by at most 1 number, and ending in "z".
Solution
The following will work: a[a-zA-Z]*[0-9]?z
Conditional matches
This is another useful construct: suppose you are parsing a file containing paths
to other files and want to list all image files that end in .jpeg
or .png
.
Naively, you can write a regular expression that matches all ".jpeg" substrings,
another that matches all ".png" substrings, and appends the output to a file:
$ egrep "\.jpeg" paths.log >> output.txt
$ egrep "\.png" paths.log >> output.txt
Notice that we are escaping the dot, since it is also a meta-character (matches any character). Because either match is valid, you can use the following:
$ egrep "\.(jpeg|png)" paths.log
Optionally, since some endings might be capitalized, you can use the -i
flag
of the egrep
command to ignore case. This will also match, e.g., a line
containing file.PNG
.
More resources
You can find useful overviews of regular expression syntax here.
Beyond grep
and egrep
, two programs that use regular expressions regularly
(pun unintended) are sed
and awk
. You can find some cheatsheets here:
Part III: git
and version control
Have you ever found yourself naming your files script.py
, then script_1.py
,
script_2.py
(or, even worse, script_1(1).py
) because you want to be able to
go back to the previous version in case anything goes wrong?
If you are this person, git
is the tool for you.
Basic workflow
The git
workflow mostly adheres to the following pattern:
- Create a new project (either local or remotely in a code repository)
- Make incremental changes to the project (e.g. add / edit / remove files)
- "Commit" the last batch of changes (with a message summarizing what they change in the project)
- "Push" the changes to a remote repository
- Repeat steps 2-4 until the project is completed (or abandoned :))
There are several variations to this, and the way people implement each step depends on the nature of the project. For example, if you are working at a software company, you likely want to maintain several "views" of the project:
- A "stable" view, containing the version of the code that you serve to your customers. This code contains the implicit promise that is well-tested and free of any known software vulnerabilities.
- A "testing" view, which is a version of your software that is experimental but stable enough to offer to the public, that acts as a beta-tester. This code is not bug-free, but having several users try it is key to finding any additional bugs.
- A "development" view (or more!), where new features are currently implemented (a work in progress). Typically, this view is intended to be used by experienced users and other developers, but not the end-user.
Git offers tools that make this workflow remarkably easy (via the concept of branches, which we will introduce soon).
Creating a repository, adding and committing changes
If you just installed git
, you first need to set a username and an email.
This is done via git config
:
$ git config --global user.name <your_username_here>
$ git config --global user.email <your_email_here>
This means that you will be using this username and email for all your git
projects / repositories. You can also create a local configuration that only
applies to a particular project (e.g. if you need to use your company's email
domain or a particular username, or any other-project specific settings that
are not necessarily username and email). You can read all about it with
git config --help
.
All git
projects are developed in so-called repositories. The most common
use case is when you create a repository in an online service, such as Github,
and then create a local copy for the computer you are working on. To make a
local copy, you use the git clone
command:
$ git clone <repository_address>.git
# or
# git clone <repository_address>.git <local_directory>
The first command above creates a local folder with the same name as the
remote repository, while the second command specifies the name of the the
local directory (to be created). For now, we will assume that you are cloning
a remote repository, which is the most common use case (if not, you must use
the --local
option in your call).
How does git
track changes? The git system maintains a local index of
changes to files (also sometimes called the staging area). For example, if
you changed a file, you use git add <file>
(or git rm
if you deleted
something, or git mv
if you renamed it) to record the changes to the
staging area. You can repeat this process multiple times:
# change file1 (e.g. via an editor), record the changes
$ git add file1.txt
# change file2, record the changes
$ git add file2.txt
# delete a file completely, record the changes
$ git rm file3.txt
# rename a file to something else
$ git mv file4.txt new_file4.txt
At this point, the staging area has the updated content for these 4 files.
Now comes the important part: committing your changes. Whenever you
git commit
something, git
creates a new snapshot of your project and
assigns a unique identifier to it, called a commit hash.1
$ git commit -m "Added files 1 and 2, deleted file3, renamed file4 -> new_file4"
Here we see the git commit
command in action. The -m
flag specifies a commit
message, which is a short summary describing the changes introduced by the new
snapshot. It is recommended to make your commit messages as informative as
possible, as that gives you an idea of what changes in each snapshot without
having to look at the code itself.
To keep commit messages short but informative, it is a good idea to try and make a habit of commiting changes in small chunks rather than introduce huge blocks of changes. For example, consider the sequence below:
# edit script.py
$ git add script.py
$ git commit -m "Added get_parameters(), get_info()"
# edit it some more
$ git add script.py
$ git commit -m "Fixed error in set_parameter()"
# rename it
$ git mv script.py utils/script.py
$ git commit -m "Move to utilities"
Contrast the above with the more brief (but messier) sequence:
# edit script.py
$ git add script.py
# edit it some more
$ git add script.py
# rename it
$ git mv script.py utils/script.py
$ git commit -m "Added some functions, fixed a typo and introduced new util"
In addition to the first sequence being way more informative (at the cost of extra commit messages), it makes it easier to identify where a mistake happened by looking at the history (reverting problematic changes is easier for the same reason). As a rule of thumb, you should create a new commit for every major change you make to a component of your code (that being said, you should not create a new commit for each new typo you find & fix).
Pushing and pulling changes
Keeping with our assumption that you are working with code on a remote repository,
we now wish to push our changes to the remote repository, so that other people
can grab the updated code. This is the role played by git push
:
$ git push <remote_name> <branch_name>
This pushes your code to the remote repository (its address is specified by
remote_name
) at the given branch <branch_name>
. This publishes your local
changes and makes them available to other users.
Tip: Saving time
If your remote is pointing to a HTTPS address, you will be asked for your
credentials every time you perform a git push
. To avoid typing your
password all the time, you can tell git
to keep it in memory for a few
minutes. To do so, type
git config --global credential.helper cache
By default, this keeps your credentials in memory for 15 minutes. To change the default duration, you can specify the time (in seconds):
git config --global credential.helper "cache --timeout 3600"
The above specifies that git
should cache your credentials in memory
for an hour (3600 seconds). Note the use of quotes here.
The names used most often are origin
for the remote (by convention, the address
from which you cloned the repository) and master
for the branch (by convention,
the "main" branch of your code). Below we introduce and explain these concepts
in some more detail.
Branches
Branches are essentially different "paths in time" for your code. The "main"
branch is called master
by convention, and all other branches were derived
from master
at some point in time. The purpose of branches is best explained
in terms of a development workflow.
Suppose you and your collaborator are working on a project and want to work
on two different features at the same time. Since you will be working on
your local copies of the project, you will be creating different snapshots
that are interspersed with each other in time. In other words, if someone could
see the snapshots of the project in the order you and your collaborator
committed them, the order would not make sense. This has very real implications
when you eventually both want to "push" your changes, since git
does not know
how to combine out-of-order changes (except in very special occasions). This
is because every commit has a parent commit, and you and your collaborator's
commits do not have consistent parents (except for your very first commit
after you start working independently).
With branches, you and your collaborator can create parallel timelines and
merge them at the end. For example, you create a branch called feature_1
and your collaborator creates a branch called feature_2
. You create commits
independently on the respective branches, and then you merge each of the
two branches into master
.
To work with branches, you typically use the git checkout
command:
$ git checkout -b <new_branch> # creates new branch
$ git checkout <existing_branch> # updates the working branch
Branching is a very important feature of git
, but things can get technical
when explaining the mechanics behind them. Reading the reference tutorial for
branching
is a must for every git
user.
Remotes
Remotes are repositories whose branches you are tracking. More often than not, you will only work with a single remote (the one where your project started). A common scenario for working with more than one remotes is if you maintain a code repository with multiple hosting services (e.g. Github and bitbucket). Because working with multiple remotes is somewhat uncommon, this is a topic better deferred to the reference tutorial.
Exercise
To get started with git
, activate your Cornell Github
account and create your first repository!