Stranger in a Commonplace Land

As I began reading the two introduction essays by Janet Murray and Lev Manovich to The New Media Reader I first was a bit overwhelmed with the length of each.  This immediately made me think of an article that was reverenced in the previous reading, “Is Google Making us stupid?“: was the fact that I initially gawked at so many words and pages a result of my immersion in a world of near-instant informational gratification and 140 character thoughts? The thing is, I have no problems whatsoever reading a 500 page novel, if it’s interesting and indeed there were certainly pieces of each introduction piece that jumped out at me:

All creativity can be understood as taking in the world as a problem. The problem that preoccupies all of the authors in this volume is the pullulating consciousness that is the direct result of 500 years of print culture. – Janet Murray

The concept of defining a unifying model that describes all of creativity is quite appealing to me.  “The world as a problem” seems at the same time both a grossly over simplified, and a perfectly succinct description of creativity  as I see it, and particular to my field of engineering.  Murray than goes on to draw contrasts between “engineers” and “disciplinary humanists” which particularly piqued my interest because I often feel like an outsider looking in when talking to other engineers about humanistic concepts, but also an outsider when trying to explain how I see engineering to “disciplinary humanists”.   The second essay   provided a nugget that helped direct my thoughts on this curious feeling of duplicity

Human-computer interface comes to act as a new form through which all older forms of cultural production are being mediated. – Lev Manovich

Whether we like it or not, this is becoming the reality.  We now get our books, music, movies and even long distance personal interaction mediated by a computer and the interface they provide us.  The thing is, any good engineer knows that if a piece of technology is doing its job, it should be transparent to the user.  While reading both of these essays I found myself thinking: why are we trying to force so much focus on the “new” in “new media”?  Is our doing so an indication that we as engineers still have more work to do to make the current technology transparent (I think we do) or is society so transfixed by “new” technology for some other reason that we are refusing to let it become as transparent as it could be?

Manovich, I think would disagree on that point, at least in the U.S. as one of his arguments for the late start of new media exhibits in the U.S. was in part do to the rapid assimilation of new technology so that it became ubiquitous before we had time to reflect upon its potential impacts.  As I’m writing that I feel myself rethinking my own view, because I don’t want to suggest that we not reflect upon the impact of technology that we now take for granted, in fact I have often felt we need to do much more reflecting, and I agree wholeheartedly that we have adopted some technologies that have drastically changed our day-to-day lives (who plans things in advance any more when you can just text your friends last minute to find out where people are?) that may consequences far extending the superficial sphere of their direct influences (if we don’t plan our days, are we losing our skill at thinking into the future and acting accordingly in general? Are we becoming a species obsessed with living in the moment and unable to live any other way?)

I’m in danger of rambling now, but I now have a better understanding of why I found it difficult to focus on the entirety of both essays.  Everything around each nugget either seemed redundant, overly descriptive, or a distraction from the thought process that had started forming in my head.  If good technology should be transparent to the user, why are we spending so much time worrying about it? And what are the consequences if we don’t?

A Matter of Standards

Last night on my bike ride home from PFP class I mentally prepared a “todo list” of things to get done in the couple of hours I’d have  before getting too tired to be productive.  In a classic example of the story of my life, all that mental preparation went out the window when I finally arrived home, checked my email (probably mistake #1, checking email wasn’t on the original todo list) and read a message from a student in the class I’m teaching, ECE2524: Introduction to Unix for Engineers.

On the face of it, the question seemed to be a simple one: “how do I display a certain character on the screen?” Furthermore, they noted that when they compiled their program in Windows, it worked fine and displayed the character they wanted, a block symbole: ▊, but when compiling and running on Linux the character displayed as a question mark ‘?’

Now, before you get turned off by words like “compile” and “Linux”, let me assure you, this all has a point and it relates to a discussion we had in PFP about “standards for the Ph.D.” plus, it resulted in one of my favorite methods of procrastination, exploring things we take for granted and discovering why we do things the way we do.

After some googling around I came across this excellent post, from which I pulled many of the examples that I use here.

The problem was one of standards, but before we can talk about that we need to know a little bit about the history of how characters are stored and represented on a computer.  Even if you aren’t a computer engineer you probably know that computers don’t work with letters at all, they work with numbers, and you probably know they work with numbers represented in base 2, or binary, where ’10’ represents ‘2’, ’11’ represents 3, ‘100’ is a ‘4’ and so on.  And if you didn’t know some or any of that, that’s perfectly ok, because you don’t actually need to know how a computer stores and manipulates information in order to use a computer any more, but back in the early days of computing, you did.  Also important for the story, back in the early days of computing the kind of information people needed to represent was much more limited, pictures and graphics of any kind were far beyond the capabilities of hardware used to represent information, in fact, early computer terminals were just glorified typewriters, only capable of representing letters in the Classical Latin alphabet, a-z, A-Z, numbers 0-9 and, because much of the early development was done in the United States, punctuation used in the English language.  To represent these letters with numbers a code had to be developed: a 1 to 1 relationship between a number and a letter.  The code that  came to widespred use, was called American Standard Code for Information Interchange, or ASCII

ASCII chart

This was a nice code for the time, with a total of 128 characters, any once character could be represented with 7 digital bits (2^7 = 128), so for instance 100 0001 in binary, which is 65 in good ol’ base 10, represents upper case ‘A’ while 110 001, or 97 represents lower case ‘a’.  For technical reasons it is convenient to store binary data in chunk bits totaling a power of 2.  7 is not a power of two, but 8 is, and so early computers stored and used information in chunks of 8 bits (today’s modern processors use data in chunks of 32 or 64 bits).

Well, this was all fine and good, we could represent all the printed characters we needed, along with a set of “control” characters that were used for other purposes needed for transmitting data from one location to another.  But soon 128 characters started feeling limited, for one thing, even in English, it is sometimes useful to print accented characters, such as é in résumé.  Well, people noticed that ASCII only used 7 bits, but recognized that information was stored in groups of 8 bits, so there was a whole other bit that could be used.  People got creative and created extended ASCII which assigned symbols to the integer range 128-255 thereby making complete use of all 8 bits, but taking care not to change the meaning of the lower 127 codes, so for instance 130 now was used to represent é.

The problem was that even 255 characters is not enough to represent the richness of all human languages around the world, and so as computer use became more prevalent in other parts of the world the upper 127 codes were used to represent different sets of symbols, for instance computers sold in Israel used 130 to represent the Hebrew letter Gimel (ג) instead of the accented é.  At first, everyone was happy.  People could represent all or most symbols needed for their native language (ignoring for the moment Chinese or Japanese, which have thousands of different symbols, with no hope of fitting in an 8-bit code).

Then the unthinkable happened.  The Internet, and more to the point, email, changed the landscape of character representation, because all of a sudden people were trying to send and receive information to and from different parts of the world.  So now, when an American sent their résumé to their colleague in Isreal is showed up as a rגsumג.  Woops!

But what to do?  At this point there were hundreds of different “code pages” used to represent a set of 255 characters with 8 bits.  While the lower 127 bits remained mostly consistent between code pages, the upper 127 were a bit of a free-for-all.  It became clear that a new standard was needed for representing characters on computers, one that could be used on any computer to represent any printed character of any human language, including ones that did could not easily be represented by only 255 characters.

The solution is called Unicode, and it is a fundamentally different way of thinking about character representation.  In ASCII, and all the code pages developed after that, the relationship between a character and how that character was stored in computer memory was exact (even if different people didn’t agree what that relationship was).  In ASCII, an upper case ‘A’ was stored as 0100 0001, and if you could look at the individual bits physically stored in memory, that is what you would see, end of story.  Unicode relates letters to an abstract concept called a “code point”, a Unicode A is represented as U+0041.  A code point does not tell you anything about how a letter is stored in 1s and 0s, instead U+0041 just means the concept or idea of “upper case A”, likewise U+00E9 means the “lower case accented e”  (é), and U+05D2 means “the Hebrew letter gimel” (ג).  You can find all the Unicode representation for any supported character on the Unicode website, or for quick reference at a variety of online charts, like this one.

But remember, the Unicode representations are associated with the concept of the letter, not how it is stored on a computer.  The relationship between Unicode value and storage value is determined by the encoding scheme, the most common being UTF-8.  A neat property of the UTF-8 encoding is that it is backwards compatible with the lower 127 ASCII characters, and so if those are the only characters you are using they’ll show up just fine in older software that doesn’t know anything about Unicode and assumes everything is in ASCII.

I know I’m risking losing my point at this point, but one last thing.  Right click on this webpage and click “View Page Source”.  Near the top of the page you should see something that looks like

<meta charset=”UTF-8″ />

or

<meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8″ />

This is the line that tells your web browser what encoding scheme is used for the characters on this web page.  But “wait”, you might say, I have a self-reference problem, to write out “charset=UTF-8”, I need to first pick an encoding to use, so how can I tell the web browser what encoding I’m using without assuming it already knows what encoding I’m using?  Well, luckily, all the characters needed to write out the first few header lines, including “charset=UTF-8” just happen to be contained in the lower 127 characters of the original ASCII specification, which is the same as UTF-8 for that small range.  So web browsers can safely assume “UTF-8″ until they read the line <meta charset=”UTF-16” /> at which point they will reload the page and switch to the specified encoding scheme.

Ok. So where the heck was I going with this?  Well for one thing, the history of character representation is quite interesting and highlights various aspects of the history of computing, and sheds light on something that we all take for granted now, that I can open up a web page on any computer and be reasonably sure that the symbols used to represent the characters displayed are what the author intended.

But it also highlights the importance of forming good standards, because without them, it is difficult to communicate across boundaries.  Standards don’t need to specify the details of implementation (how a character is stored in computer memory), but at the very least, to be useful and flexible they need to specify a common map between a specific concept (the letter ‘A’ in the Latin alphabet) and some agreed upon label (U+0041).

Currently, we don’t really have a standardized way of talking about a Ph.D.  What is a “qualifier exam”? “prelims”?, “proposal”? all of these things could mean something different depending on your department and discipline.  While trying to standardize details such as “how many publications” or “how many years” or “what kind of work” would be difficult at best, nonsensical in many cases, to do across disciplines, we could start talking about standardizing the language we use to talk about various parts of the Ph.D process that are similar across fields.

And incidentally, this is why I still haven’t finished grading the stack of homeworks I told myself I’d finish last night.

And for what it’s worth, the answer to the student’s question was to use the Unicode representation of the ▊ symbol, which is standardized, not the extended-ASCII representation, which is not a standard way to represent that symbol.