Week 1: Linux, Command Line
Part I: Setting up a Linux machine
Several of the tools presented in this class work (and were developed) in a UNIX-style operating system. In particular, we will be using GNU/Linux:
- Linux is the name of the operating system kernel; a kernel is a collection of low-level code that interfaces with hardware and provides basic functionality, such as a file system. For example, it contains code that allows it to recognize when a keyboard has been plugged into the USB port and register it so that you can use it properly.
- GNU is a collection of programs, including libraries for developing software, text editors, a web-browser, etc.
Because GNU and Linux are free software, there is a variety of community-developed distributions of it. A Linux distribution ('distro') is essentially a collection of packages that are considered 'sane defaults' by a community. Importantly, Linux distributions contain package managers, which are pieces of software that allow you to manage, add and remove applications. If you haven't tried Linux before, we recommend trying a distribution that is easy to install and has plenty of documentation available online. Ubuntu and Debian are two such distributions.
Installing as a virtual machine
If you haven't used Linux before, it is probably a good idea to try it in a virtual machine. To do so, try the following:
- Download and install Virtualbox, the software that will emulate the Linux machine.
- Download a stable version of Ubuntu or Debian and make a virtual machine in Virtualbox. When prompted, make sure to allocate at least 2GB (or more!) of RAM and at least 16GB of hard drive space.
Installing on a dedicated machine
It is also possible to install GNU/Linux as a standalone operating system, either by itself or alongside Windows ('dual boot'). If you wish to do so, make sure to back up all your important files and read the installation instructions carefully.
Part II: The shell
What is the shell? The shell is a program that exposes an OS's services to a user (or another program). Contrary to popular belief, what we call the shell is not the same as what is called the terminal. However, in Linux the main way to interact with the shell is via the terminal, which provides a command-line interface (meaning: not graphical) to the shell, which is why we will sometimes use the words "shell" and "terminal" interchangeably.
There are serveral different shells available in Linux. Most distributions come
with the bash
shell. Other shells include zsh
, fish
, tcsh
, csh
.
Getting started
To get started, fire up your distribution's terminal. Depending on your distribution, it might be called "Terminal", "Console" or some variant thereof. This will start a command-line interface (CLI) where you will be able to type commands and see their output. You should see something like the following (called the prompt):
[user@computer ~]$
What does this mean? The shell keeps track of which user is currently logged
and displays their login name (user
). It also displays the name of the
computer
they are logged in to. The last piece of information is the
working directory, which is the current location in your computer's filesystem.
Here ~
is a special character that denotes your home directory (assuming one
is defined; more on this later).
Finally, anything on the left of the $
are commands that you type.
For example, my prompt looks like the one below:
[vchariso@vchariso-pc ~]$
Getting Help
Shell commands usually provide a manual page and/or can be invoked with the
--help
that documents their behavior and arguments, return values etc. To
access the manual page, type man <command>
. For example, the following:
[vchariso@vchariso-pc ~]$ man cd
will open up the manual page for the cd
command (described below). Knowing
about the existence of and using these manual pages is essential for working
in the shell. Most newcomers forget that they even exist, and spend precious
time googling for documentation (even though several manual pages also provide
usage examples) that is already available.
Note: To use the man
command, you need to know which command you are looking
for help with. If you have a general idea of what you want to do but do not know
which command to use for it, you can use the apropos
utility, which will search
inside the manual pages for a keyword (and also supports regular expressions;
more on these later). For example, if you want to find out which command to use
to make a directory, you can type the following:
[vchariso@vchariso-pc ~]$ apropos "make directories"
mkdir (1) - make directories
mkdir (1p) - make directories
Indeed, man mkdir
will convince you that mkdir
is what you want to use to
make directories. The different numbers ((1) vs. (1p)) correspond to different
sections in the manual pages. To understand what these are, you can read
more here.
Another help utility you might occasionally need is type
:
[vchariso@vchariso-pc ~]$ type python
python is /usr/bin/python
[vchariso@vchariso-pc ~]$ type echo
echo is a shell builtin
Navigation
The shell is not very useful without knowing how to move around between
different directories. The cd
command does exactly that. For example, to change
to a directory called "Documents", we type:
[vchariso@vchariso-pc ~]$ cd Documents
[vchariso@vchariso-pc ~/Documents]$
Since you will eventually build a mental map of where your files are starting
from your home directory, typing cd
on the shell (without any arguments) will
return you to your home directory:
[vchariso@vchariso-pc ~/Documents/Books]$ cd
[vchariso@vchariso-pc ~]$
Paths: Absolute vs. Relative
When navigating in UNIX, it's important to distinguish between absolute (also known as full paths) and relative paths.
Absolute paths: They always start with "/
" (the so-called base directory).
For example, to find out the absolute path to the working directory, you can
type pwd
(from Print Working Directory):
[vchariso@vchariso-pc ~/Documents]$ pwd
/home/vchariso/Documents
In fact, "~" is a so-called shell expansion for the user's home directory, and the following two commands are equivalent:
[vchariso@vchariso-pc ~/SomeFolder]$ cd ~
[vchariso@vchariso-pc ~/SomeFolder]$ cd /home/vchariso
Relative paths: a path that doesn't start with "/
" is a relative path.
More or less, relative paths are formed by prepending the current working
directory to them. For example:
[vchariso@vchariso-pc ~/Documents]$ pwd
/home/vchariso/Documents
[vchariso@vchariso-pc ~/Documents]$ cd Books
[vchariso@vchariso-pc ~/Documents/Books]$ pwd
/home/vchariso/Documents/Books
As you type cd Books
in the above, the shell interpreter prepends your working
directory before calling the cd
command with the full path.
Making & inspecting directories
To inspect the contents of your current directory, simply type ls
:
[vchariso@vchariso-pc ~/Documents]$ ls
file.txt Folder program.sh
Documents
directory contains 2 files and 1 folder: file.txt
, program.sh
,
and Folder
. In fact, Folder
is also a type of file (but a special one!).
However, unless your terminal environment uses colors or a special font to indicate different
types of files, the output of the above ls
command does not give you any information
about whether or not Folder
is a folder or just a terribly-named ordinary file. To
get this type of information, you can invoke ls
with an extra argument:
[vchariso@vchariso-pc ~/Documents]$ ls -F
file.txt Folder/ program.sh*
I read about the -F
argument on the ls
manual page. Here, an indicator is
appended to the file name to indicate its type. For example, an "/" is appended
to "Folder" to indicate that it is an actual directory, and \*
is appended to
program.sh
to indicate it is an executable file (i.e. a program).
Another option is to use ls -l
, which prints a lot more information:
[vchariso@vchariso-pc ~/Documents]$ ls -l
total 4
-rw-r--r-- 1 vchariso vchariso 0 Jan 25 22:00 file.txt
drwxr-xr-x 2 vchariso vchariso 4096 Jan 25 22:01 Folder
-rwxr-xr-x 1 vchariso vchariso 0 Jan 25 22:00 program.sh
The above will print more detailed information, including the date and time
each file was last modified, its user and group owners,
or its size (here, "total 4
" just means the total size of the files in this
directory is 4 bytes).
File permissions
What about the weird -rw-r--r--
bits at the beginning of the first line? This
part is a sequence indicating the file's permissions - make sure to consult the table here!
You can see that Folder
has a d
character in its permissions, which indicates
it is a directory. Also, note that the x
bit means that a file is executable,
which shows us that program.sh
is executable (even though it seems to be empty).
Somewhat (un?)surprisingly, UNIX allows you to change a file's permissions (as long
as you are the owner of the file). For example, you might want program.sh
to not
be executable until you have inspected its contents (again, forget that it is empty
for now). For that reason, you can use the chmod
command. Consider the following
two calls:
[vchariso@vchariso-pc ~/Documents]$ chmod -x program.sh
[vchariso@vchariso-pc ~/Documents]$ chmod +x program.sh
The first command removes the executable mode bit from program.sh
, while the second
one adds it. Another (advanced) way of using chmod
is specifying the permission bits
explicitly:
[vchariso@vchariso-pc ~/Documents]$ chmod ugo=rw,r,r program.sh
Here, "ugo" stands for "user, group, other". The above indicates:
- the user that owns the file can read and write (i.e. edit) it
- any additional users that are in the group that owns the file can only read it (only makes sense if the group contains more than just the current user)
- any other users (i.e. not in the current group) can only read the file
To make a directory, you can use the command mkdir
(from MaKe DIRectory):
[vchariso@vchariso-pc ~]$ mkdir test
[vchariso@vchariso-pc ~]$ cd test
[vchariso@vchariso-pc ~/test]$
By default, mkdir
will only create directories with one level of nesting, i.e.
[vchariso@vchariso-pc ~]$ mkdir test/test
will succeed if there already is a directory called test
in your working directory,
and create another directory called test
inside the former one. But if there was no
directory called like that, it will fail with an error message:
[vchariso@vchariso-pc ~]$ mkdir one/two/three
mkdir: cannot create directory 'one/two/three': No such file or directory
How do we get around this issue? We'll ask the manual pages for help! Typing
man mkdir
will open up the manual page of mkdir
, where you will see that
you can use the -p
argument if multiple directories need to be created. The
following will work:
[vchariso@vchariso-pc ~]$ mkdir -p one/two/three
There is a wealth of shell commands we will encounter in the coming weeks, and
it's completely normal to feel overwhelmed at the moment. For now, I encourage
you to set up your Linux machine and browse around using the shell to get a
feel for it. Think about tasks you ultimately want to accomplish (e.g. maybe
a text processing pipeline) and try to find some commands that could help you do
it using apropos
and read about their usage using man
.
Here are some quick exercises to get you started:
Exercise 1: Which command would you use to list the contents in the current directory, sorted by increasing order of file size? (Hint: man ls
)
Solution
The following should work: ls --sort=size -r
. The first argument instructs
ls
to sort contents by file size, and the second argument (-r
) to reverse
the order of the result.
Exercise 2: Suppose your current directory contains two files called test1.txt
and test2.txt
. You type the following commands:
$ touch test.txt
$ ls -lt
Which file do you expect to appear first in the directory listing? Why? (Hint: look up ls
and touch
)
Solution
The touch
command updates the access and modification time of test.txt
(and creates it if it did not exist before). Because the -t
flag to ls
indicates to sort by modification time, this means that test.txt
will
appear first.
Exercise 3: The whatis
command displays one-line manual page descriptions. You are curious about what printf
does, and decide to look it up.
You get the following output:
$ whatis printf
printf (1) - format and print data
printf (1p) - write formatted output
printf (3) - formatted output conversion
printf (3p) - print formatted output
What do these numbers indicate? Can you tell which of these printf
s is the one used by the shell?
Solution
These numbers indicate different parts of the manual. According to the output
of man man
, the first section of the manual contains pages for executable
programs or shell commands, while the third section is about library calls.
Therefore, the first two candidates are about the printf
used by the shell.
Writing scripts
Here comes the fun part - our first script! Open a file in your distribution's text editor, name it example.sh
,
and write the following:
#!/bin/bash
echo "Hello World!"
There are two ways to run this program. The first, and more straightforward one, is to open up a terminal and
navigate to the directory containing example.sh
, and type bash example.sh
. The other way is to run
$ chmod +x example.sh
$ ./example.sh
Hello World!
What is this doing? The first line adds the -x
flag to the file modes, which makes it executable. The
second line instructs the shell to execute example.sh
. We prepend ./
before the actual filename, because
we have to indicate that the file is located in the current working directory.
But how does the shell know that this is a bash
executable? That's because of the first line in your
script:
#!/bin/bash
This is an interpreter directive, which essentially tells us that this file is intended to be run by
the executable found under /bin/bash
(i.e. the bash
shell itself).
Variables
Now that you wrote your first script, let us look into different syntactical constructs you can use.
Arguably one of the most important ones is defining and using variables. To define a variable, you
use the format <NAME>=<VALUE>
. For example:
$ myvar=10
$ anothervar=hello
Shell variable names start with a letter or underscore and may contain
any number of following letters, digits, or underscores. By default, bash
interprets all variable values as strings, unless you explicitly declare
them differently. There are 4 variable types in bash
:
- string variables (default)
- integer variables
- constant variables (i.e., read-only after they are declared)
- array variables (rarely encountered in practice; not all shells support it.)
To access / retrieve a variable's value, you need to add the "$" symbol in front of the variable name:
$ echo $myvar
10
$ echo $anothervar
hello
When assigning a value that contains spaces, use quotes:
$ anothervar="hello world"
$ echo $anothervar
hello world
Quoting and variable substitution
We saw that $
is used to access a variable's content. The process of doing so
is called variable substitution. Examine the three versions below:
$ echo $myvar # output: 10
$ echo '$myvar' # output: $myvar
$ echo "$myvar" # output: 10
The above reveals two different types of quoting:
- strong quoting, i.e. using single quotes. In this case, no variable substitution takes place.
- weak quoting, i.e. using double quotes. Weak quoting does not interfere with substitution.
Generally, it is recommended to use weak quoting (i.e., write "$myvar"
),
especially when the content of a variable might contain whitespace. See
here for a discussion.
Shell variables are mutable, which means you can update their values after you have defined them like we did here. If a variable is not defined, its value is the so-caled NULL value, and accessing it returns nothing:
$ echo $undefined_variable
<a blank line should be printed here>
Note: variables you define are not persistent across shell sessions. If you
close your terminal after the above commands are issued and type echo $myvar
, you
will get a blank line. Even if you call a bash script from bash that tries to access
this variable, it will not find it unless you explicitly export
it. To convince
yourselves, write a script called check.sh
like below:
#!/bin/bash
echo "The value is: $myvar"
and then open up a shell, navigate to the directory containing check.sh
, and type:
$ myvar=10
$ bash check.sh
The value is:
On the other hand, if you explicitly make the value of myvar
available,
scripts that you invoke from this shell will be able to access it:
$ myvar=10
$ export myvar
$ bash check.sh
The value is: 10
Keep that in mind when thinking about what the scripts you write will try to access. In fact, it is always a better idea to make sure that your scripts accept all the required information in terms of command-line arguments, which we examine below.
Special variables
There is a number of "special" variables in bash
, whose values you may access but
not set. These variables typically hold function / script parameters, process IDs, and
so on. For example, having echo $1
inside a script will output the first positional
parameter the script was called with. Consider the following script:
# example.sh
#!/bin/bash
echo "The first argument was: $1"
You should expect the following output:
$ bash example.sh first_param
The first argument was: first_param
$ bash example.sh 1 2 3
The first argument was: 1
You can read more about internal variables in bash here.
Control Flow
The term "control flow" refers to conditional statements, such as loops, if statements, and so on. Bash supports both of these constructs. An IF statement in bash looks like the following:
# example.sh
if <condition_1>
then
<statements_1>
elif <condition_2>
then
<statements_2>
else
<more_statements>
fi
Note 1: You always need to add the then
keyword after an if
or an elif
.
Note 2: The elif ...
and/or else
part is optional. As such, the following
are all valid examples:
# example_noelse.sh
if <condition_1>
then
<statements_1>
elif <condition_2>
then
<statements_2>
fi
# example_noelif.sh
if <condition_1>
then
<statements_1>
else
<statements_2>
fi
# example_onlyif.sh
if <condition>
then
<statements>
fi
Note 3: A common gotcha is when you include the then
keyword, but do not put it in a separate line:
if <condition> then
<statements>
fi
The correct way to write this is include a semicolon, as you would when writing two commands one after each other:
if <condition>; then
<statements>
fi
Examples of conditions
We examined the skeleton of if
statements above, so a natural question to ask
is: what kind of conditions do we usually have? We already saw that bash treats
the content of variables essentially as strings, so the answer is not obvious.
Indeed, the answer seems confusing at first: the <condition>
blocks above are
a sequence of statements. If the last statement executed exits successfully, the
condition is met and we proceed to the "then" block. If the last statement
executed does not exit successfully, we proceed analogously. Here is an example:
$ if echo "hi"; then echo "hello"; fi
hi
hello
What just happened? The command echo "hi"
was executed, and it exited successfully
(printing "hi" along the way). Therefore the condition was met, and we proceeded
to the "then" block of our if
statement.
Note: All shell commands have an exit status code that they emit after their
execution. By convention, successful execution returns a status code of 0
. All
status codes >= 1
are considered failures, and their meaning can vary
depending on the program. Manual pages document the precise meaning of each
status code for the program at hand.
Of course, the example above was contrived. We usually want to test more
interesting conditions. For that reason, we commonly use the test
command,
that evaluates a unary or binary expression and outputs exit status 0
if
the expression evaluates successfully.
Here are some examples involving the use of test:
# test if file.txt exists
test -e file.txt
# test if file.txt is a directory
test -d file.txt
# test if file.txt is readable
test -r file.txt
# test if file.txt is a regular file (i.e. not a directory or other special file)
test -f file.txt
# test if variable var1 is greater than variable var2, when interpreted as integers
test $var1 -gt $var2
# same, but test if var1 is greater than or equal:
test $var1 -ge $var2
# test if $var1 is equal to $var2 (interpreted as strings):
test $var1 = $var2
You can find out more about possible conditions that you can check using
test
by checking the manual page: man test
.
There is also a variant form of test
, which works identically to test command:
# the two below are equivalent:
$ test <expression>
$ [ <expression> ]
Exercise 4: the spaces after [
and before ]
above are important. Can you guess why?
Solution
The output of type [
tells us it is a shell built-in command. If we omit
the space after [
, bash
will not recognize it as a command but will try
to parse whatever the result is.
For example, we can rewrite some of the above test
commands using its variant
form:
# test if file.txt exists
[ -e file.txt ]
# test if file.txt is a directory
[ -d file.txt ]
# test if file.txt is readable
[ -r file.txt ]
# test if file.txt is a regular file (i.e. not a directory or other special file)
[ -f file.txt ]
Logical operators
You can combine more than one expressions in your test
constructs, using the
logical AND, OR, and NOT operators. For example:
[ -e file.txt -a -x parse_file.py ]
The above tests if file.txt
exists and parse_file.py
is executable (we
use -a
for the AND condition). However, it is recommended (for the sake of
writing portable code) to use shell-level tests, using the !. &&, ||
operators:
# file.txt exists and parse_file.py is executable
[ -e file.txt ] && [ -x parse_file.py ]
# file.txt is readable and parse_file.py is not a directory
[ -r file.txt] && ! [ -d parse_file.py ]
# file.txt is a directory or a character special file
[ -d file.txt ] || [ -c file.txt ]
Shell-level tests also allow you to chain expressions, like below:
[ -d dir_name ] && cd dir_name
The above does the following: it first runs test -d dir_name
. If it succeeds
(dir_name
indeed pointed to a directory), it runs the second command which
changes the working directory to dir_name
. Here is another demonstration:
( [ -d dir_name ] && cd dir_name ) || echo "dir_name is not a directory"
Here, we first check if dir_name
points to a directory and make it our
working directory if so; otherwise, we evaluate the other part of the OR
(||
) construct which outputs "dir_name is not a directory"
.
Exercise 5: Using the appropriate commands, write a one-liner that
tests if file.txt
is readable, and:
* if it is readable, it displays its content.
* if it is not, it changes its permissions so that it is readable.
Solution
The following command does exactly that:
[ -r file.txt ] && cat file.txt || chmod +r file.txt
Looping
There are three looping constructs: for
, while
, and until
.
The for
loop iterates over a list of objects and executes the loop body for
each object. For example:
for i in *.out
do
cat $i
done
This loops over all files that end with .out
in the current directory (see
here
and here
for an explanation on how the *
pattern-matching operator is used) and
displays their content.
The in ...
part is optional; if you omit it, the
shell will loop over all the command line arguments, if any:
for i
do
<process arguments here>
done
Looping over ranges of integers is easy using the syntax below:
for i in {1..10}
do
echo $i
done
Equivalently, you can use the seq
command, which also allows you to specify
increments. For example; the following will output 1 3 5 7 9
(in separate
lines)
for i in $(seq 1 2 10) # pattern: seq <start> <increment> <stop>
do
echo $i
done
Note: here, we wrap the seq 1 2 10
command using $(...)
because we want
to capture its output and loop over it. Here is what happens if we don't use
this syntax:
for i in seq 1 2 10
do
echo $i
done
# output:
# seq
# 1
# 2
# 10
Exercise 6: Write a bash
script that lists all files in the current
directory that are not themselves directories, sorted in decreasing file size.
You can assume that none of the filenames contain spaces.
Solution
There are two steps here: ls -S
will print the contents of the current
directory in decreasing file size, but will also include directories (whose
file descriptors take 4096 bytes by default). To do some postprocessing, we
will use the [ -d <file> ]
test to exclude directories:
for f in $(ls -S)
do
[ ! -d "$f" ] && echo "$f"
done
Note that if any of the filenames contain spaces, the looping construct will fail. You can read more about this here.