Response to “Class Material – reposted”

In his post titled Class Material – reposted, zickbe asked a very good question about the content of ECE2524.  This is a question that has come up at least once every semester, to paraphrase it is “Since there are modern GUI tools for Linux now, why are we learning all these old command line tools?” The example given in the post was a simple task of replacing all periods ‘.’ with commas ‘,’ in some text input.  Indeed, many graphical editors do have search and replace functionality that make this particular task quite easy.  So what’s the point of learning to do it from the command line?

There are two answers to this question, each from a different perspective.

You as the User

The first is probably the perspective you are all thinking about right now: you as a user of a general-purpose operating system, editing files, writing code, surfing the web, etc.  As we have seen already, Unix has a strong tradition as a platform for text manipulation (remember, its first use was as an OS to run a word processing system for the AT&T Bell Labs patent department).  When we store our data in plain text we have a large collection of powerful tools to manipulate and process that data.

Of course, when learning new concepts we start with simple examples.  One of the simplest ways we can manipulate text is with a literal substitution, for example “replace all occurrences of the word ‘cat’ with ‘dog’ “, or “replace all occurrences of ‘.’ with ‘,’ “.  Literal substitutions are used often enough that many graphical tools have implemented the feature into the interface.  Let’s say we have a file myfile.txt and we want to change all occurrences of ‘dog’ to ‘cat’, we could either use the terminal:

sed -i 's/dog/cat/g' myfile.txt

Or we could open myfile.txt in our favorite text editor, choose the menu option for “search and replace”, enter “dog” and “cat” in the appropriate field, click “ok” and we’re done.  For this simple case it seems like it’s hardly worth the brain-space to remember how to use sed.  Let’s kick it up a notch though.  When writing software applications we often have many files associated with one project.  What if we wanted to replace ‘dog’ with ‘cat’ across several files?  Using the GUI we would open each file in succession, click the menu that contained “search and replace” fill in our search and replace words, hit ‘ok’ and then repeat for the remaining files.  This is probably doable for a few files.  What about 100?  1000?

find project/ -name *.txt -exec sed -i 's/dog/cat/g' '{}' \+

or

find project/ -name *.txt | xargs sed -i 's/dog/cat/g'

The nice thing about this is that the amount of effort we put in is the same no matter how many files we want to process, whether it be 3, 100, 1000 or more.  Try doing 100 text substitutions in a GUI and you’re asking for a repetitive stress injury!

“Ok”, you’re saying to yourself “but how often am I working with hundreds of files at once? I usually just have one or two files I want to modify, it’s not too bad to navigate the GUI menu a few times to do text substitution.”  Let’s think of some more examples of text manipulation you might want to do. In my previous post I described the process I went through to compile a list of links to last semester’s projects. At one point I wanted to prepend each line with a ‘-‘ character to generate a list in Markdown syntax.  I could have just manually added the character to each line, there were only 19, after all, but instead I used a sed command

sed -r 's/^(.*)$/- \1/g'

It didn’t really save me many keystrokes in this case, but it easily fit into the automated workflow I had set up to convert the list of urls to a nice HTML format suitable for posting on the blog. It’s also a task that would have become quite tedious to do by hand if there were more than the 20 or so items that I had. And if I wanted to do somethling a bit more complex like “prepend only the lines containing a url with ‘-‘ but leave all others unchanged”

sed -r 's|^(.*https?://.*)|- \1|g'

Now I can selectively convert lines to a Markdown style list. This is much quicker for even medium sized files than scanning each line by eye to find urls, and then adding a ‘-‘. Can your GUI do that? And of course, if I had a few, or a few hundred files that I wanted to process like this, I could use the same `find … -exec` or `find … | xargs … ` idiom I used above.

Another quick example: You are probably familiar with the two main styles of naming functions with multiple words: CamelCase and underscore_case

def myHelloFunction:
    pass

def my_hello_function:
    pass

Which style you use is largely a matter of preference, although sometimes when working on collaborative projects the project will define a particular style that you must adhere to. Let’s say you’ve been using one style for a few projects and then decide you want to switch (or you get a bunch of code from a friend who was using a different style, or… )

sed -r 's/([A-Z])/_\l\1/g'

Will convert CamelCase to camel_case. Doing the same automatic formatting in a GUI of your choice is left as an exercise for the reader.  A quick google search will turn up a sed command to do the reverse transformation.

The take-away from all of this is that while the examples we use in class may be simple enough that it just so happens that a GUI editor has implemented similar functionality the tools themselves are much more powerful. GUIs are great in that they make it really easy to do the things that the GUI designers planned for. However, they make it difficult or impossible to do things that the designers didn’t plan for.  In the case where you want to perform a text manipulation on a large number of files, or a complex manipulation on one or more files, the command line tools provide a solution where the graphical tools do not.

You as the Developer

But you’re not just any user are you? You are getting a degree in Computer Systems Engineering, and even if you plan to focus on hardware it is a guarantee that you will be writing software at some point (probably many points).  You may even write some software that needs to do text manipulation.  Perhaps a preprocessor for a compiler, or even your own text editor.  What if you want to build in some functionality to allow the end-user to do some text manipulation. Maybe a simple text substitution, or perhaps you’re writing an IDE and want to provide a menu option to automatically convert CamelCase to camel_case across a set of project files.  How would you implement this?  For these examples it probably makes sense to use the regular expression library of whichever language you are programming in, but even in that case, the expressions themselves will be the same as in the sed example.  In some cases you may actually want to spawn a child process running one of the sed commands from above directly (maybe you want to run a complex text manipulation on a large number of files that a user selects with a GUI and let the manipulation run in the background while the GUI is free to take additional requests from the user).

Summary

As you are working with the command line and working through the examples for this class remember to keep in mind the flexibility of the commands you are learning.  In many cases the examples will be so simple that the same functionality has been implemented in any of the popular graphical tools, but the command line version provides much more control and flexibility, as I hope these few examples have demonstrated.  Can you think of any other examples that could be done using command-line text manipulation tools but would be impossible in a general purpose graphical environment?

As I mentioned before, this question comes up every semester.  How could the material in the class be modified to make the power of the tools we learn more apparent?  Should more complex examples be included at the possible expense of clarity? More examples?  Was the explanation I gave here convincing?  If not, please explain why in the comments and I’ll do my best to revise!

Response to “Advanced Python exercises”

Recently Matt brought up some good points in his post Advanced Python exercises.  First, the more easily addressed one:

To break up a Python program into multiple functions, just store related functions in a separate `.py` file, then in the main source use import

For example, if you have a file named `hello.py` that contains

def greeting(string):
    return "Hello, {}".format(string)

Then in your main source file (located in the same directory), you can import `hello.py` and all the functions will be available from the `hello` namespace:

import hello

print hello.greeting("world")

Easy as .py! (sorry)

Also in the post Matt talked about Python’s use of “try” and “except” as a means of flow control, as an example:

try:
    mynumber = float(line)
except ValueError as e:
    sys.stderr.write("We have a problem: {}".format(e))
else:
    print "We have the number {}".format(mynumber)

He said “if complex return types are needed such that you’re throwing exceptions to communicate logic information rather than true fatal errors, your function needs to be redesigned.” which I agree with. But I disagree that Python itself relies on this technique, or encourages it, though of course individual developers may miss-use it.

Like the programs they are a part of functions should be written to do one thing and one thing well. In the preceding example the function float converts its argument to a floating point value. The name of the function makes it very clear what it does, and it does its job well. If it can’t do it’s job, then it raises a ValueError exception.  All other built in and library functions I’ve seen work the same way.  Think about the alternative without exceptions, for example, C:

double n = atof(line);
printf ( "We have the number %f\n" , n);
//Except if n == 0 it could be because there was no valid numeric expression in line

According to the documentation for atof:

On success, the function returns the converted floating point number as a double value.
If no valid conversion could be performed, the function returns zero (0.0).
There is no standard specification on what happens when the
converted value would be out of the range of representable values by a double. 
See strtod for a more robust cross-platform alternative when this is a possibility.

This is less than ideal. If we wanted to check if an input even contained a valid numeric string (which we usually would) we’d have to work harder. The strtod alternative provides a means to do that, but we’d still have to do an explicit check after calling the function.  Other C-style functions use a return code to indicate success or failure.  It is also extremely easy to forget to check return codes, in which case the problem may only manifest itself in a crash later on, or worse, not at all, but instead just produce bad output.  These types of problems are very hard to track down and debug.  Using exceptions the program crashes precisely where the problem occurred unless the programmer handles that particular exception.

So to summarize: Each of your functions should have a well-defined job.  They should do only one job and do it well.  If they can’t do their job because of improper input, then they should raise an appropriate exception.  I think following that idiom results in cleaner, easier to debug code.  You could certainly still return complex data types where appropriate, but trying to incorporate success/failure information in a return type will often lead to difficult to debug errors when the programmer forgets to check the return status!