First steps with Scala: XML pull parsing

I’m now going to share some of the results of my recent experiments with the Scala programming language. In May I wrote that I had started looking at it. I’ve been using it to make some support tools that I needed for research work since.

First a disclaimer: It’s been 4+ years since I did serious work with a functional programming language (Haskell, in first year of university), so my style is imperative-sprinkled-with-functional rather than the opposite. Also, since I haven’t spent that much time with this language yet, I’m bound to be making obvious mistakes. That said, I’m happy to be able to recommend Scala to pretty much anyone at this point. The learning curve is not steep if you know Java, and it allows for a variety of approaches depending on who you are.

For this particular tool, I needed to parse XML files, edit the contents of certain tags, and spit the data back out again. I’d like to show what I ended up with and point out some of Scala’s powerful features. Let’s first look at some interesting parts, and then the entirety.

1
2
3
object XMLTool {
  val interLink = """\[\[(.*)\]\]""".r
}

Scala lets you define objects as well as classes. Objects are singletons and can be referred to by name. Otherwise they are like classes; they participate in the type hierarchy.
Scala has three kinds of declarations: val, var and def. Values are evaluates once and cannot be reassigned. Vars are variables which can be reassigned. Defs are definitions and can as such be functions or values. My understanding is that they are lazily evaluated. The type of this val declaration is inferred by the highly powerful type system automatically using Hindley Milner type inference. One of my biggest surprises with Scala is how little type information the programmer has to provide, yet how powerful the static checking is. Incidentally, the .r at the end is a shortcut for turning the string into a regular expression object.

1
2
3
4
5
6
def main(args : Array[String]) : Unit = {
 
    val p = new XMLEventReader().initialize(Source.fromFile(args(0)))
    p.foreach(matchEvent)
 
  }

This function is the analogue of Java’s public static void main() and actually compiles to the same bytecode. Unlike in Java, types come after the variable name, separated from it by a colon. We can tell that we’re dealing with a functional language when we see foreach being applied to a function which I’ll declare next:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def matchEvent(ev: XMLEvent) = {
    ev match {
      case EvElemStart(_, "text", _, _) => { 
        readingText = true
        print(backToXml(ev))
      }
      case EvElemStart(_, _, _, _) => { print(backToXml(ev)) }
      case EvText(text) => {
        if (readingText) print(filterText(text)) else print(text) 
      } 
      case EvElemEnd(_, "text") => {
        readingText = false
        print(backToXml(ev))
      }
      case EvElemEnd(_, _) => { print(backToXml(ev)) }
      case _ => {}
    }
  }

Here we see pattern matching in action. We can match on lots of things, including types, partially instantiated types, strings and regular expressions. This style of programming is encouraged in FP languages, unlike in imperative ones. By matching on something like EvElemStart(_, "text", _, _) I’m looking for XML tags whose name is “text”, and I don’t care about their namespace or attributes. _ is a wildcard character.

Incidentally, it’s perfectly fine for me to leave out the return type of this function. Scala will infer that the return type is Unit (which vaguely corresponds to void in Java).

Here’s the whole thing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import scala.xml._
import scala.xml.pull._
import scala.io.Source
 
object XMLTool {
  val interLink = """\[\[(.*)\]\]""".r
  var readingText = false
 
  def main(args : Array[String]) : Unit = {
 
    val p = new XMLEventReader().initialize(Source.fromFile(args(0)))
    p.foreach(matchEvent)
 
  }
 
  def matchEvent(ev: XMLEvent) = {
    ev match {
      case EvElemStart(_, "text", _, _) => { 
        readingText = true
        print(backToXml(ev))
      }
      case EvElemStart(_, _, _, _) => { print(backToXml(ev)) }
      case EvText(text) => {
        if (readingText) print(filterText(text)) else print(text) 
      } 
      case EvElemEnd(_, "text") => {
        readingText = false
        print(backToXml(ev))
      }
      case EvElemEnd(_, _) => { print(backToXml(ev)) }
      case _ => {}
    }
  }
 
  def backToXml(ev: XMLEvent) = {
    ev match {
      case EvElemStart(pre, label, attrs, scope) => {
        "<" + label + attrsToString(attrs) + ">"
      }
      case EvElemEnd(pre, label) => {
        "</" + label + ">"
      }
      case _ => ""
    }
  }
 
  def attrsToString(attrs:MetaData) = {
    attrs.length match {
      case 0 => ""
      case _ => attrs.map( (m:MetaData) => " " + m.key + "='" + m.value +"'" ).reduceLeft(_+_)
    }
  }
 
  def filterText(text: String) = {
    val matches = interLink.findAllIn(text)
    if (matches.hasNext) matches.reduceLeft(_+_) else ""
  }
}

So the purpose of this program is to read the XML, remove everything inside <text> tags that doesn’t match the interLink regular expression, and output the XML again. Towards the end, note how pleasant map and reduceLeft are for string processing – in Java I can’t really think of a succinct way of expressing the same notion.

Another couple of disclaimers: someone brought to my attention that there’s a very compact way of doing XPath queries in Scala, which probably makes my pattern matching on EvElemStart unnecessarily verbose. (Here’s a blog post on the xpath technique) Also, there was no particular reason for me to use pull parsing – push parsing might have been more natural, but I started down that path and this is what I ended up with. It works.

You can tell that I still have an imperative style from the way I use the readingText state variable to keep track of what the program is doing. A much more functional style program is probably hiding behind this one. Fortunately Scala is very forgiving towards people who mix styles like this.

My experience has been that it’s quite easy to get started and do useful things with Scala, once you get past the initial ideas (such as the difference between objects and classes, traits, val/def/var, declaration syntax). I would recommend it to anyone doing things with the JVM.

The future of the web browser

Internet ExplorerThe web browser, it is safe to say, has gone from humble origins to being the single most widely used piece of desktop software (based on my own usage, but I don’t think I’m untypical). This development continues today. The battles being fought and the tactical decisions being made here reach a very large audience and have a big impact.

When exactly did the web browser make the transition from being a hypertext viewer to an application platform? This transition seems in retrospect to have been a very fluid affair. Forms with buttons, combo boxes and lists were supported very early. Javascript came in not too long after. When the XmlHttpRequest was introduced it wasn’t long until AJAX took off, paving the way for today’s “rich” web browser applications.

A couple of years ago I had a personal project ongoing for some time. I had decided that web browsers weren’t designed for the kind of tasks they were being made to do (displaying applications), and I wanted to make a new kind of application platform for delivering applications over the web. Today I’m convinced that this would never have succeeded. Even if I had gotten the technology right (which I don’t think I was ever close to), I would have had no way of achieving mass adoption. Incremental developments of the web browser have, however, placed a new kind of application platform in the hands of the masses. Today the cutting edge seems to be browsers like Google’s Chrome, aggressively optimised for application delivery. But some new vegetables have been added to the browser soup.

chrome_logo

Google’s GWT web toolkit has been available for some time. This framework makes it easier to develop AJAX applications. Some hardcore AJAX developers may consider it immature, but these frameworks are going to be increasingly popular since they bridge the differences between browsers very smoothly, I think. What’s interesting is that the same company is developing GWT and Chrome though. The two sides of the browser-application equation have a common creator. This helps both: GWT can become more popular if Chrome is a popular browser, and Chrome can become more popular if GWT is a popular framework. Google can make and has made GWT apps run very fast with the Chrome browser (I tested this personally with some things I’ve been hacking on). The sky is the limit here; they can easily add special native features in the browser that GWT alone can hook into.

Microsoft have something a little bit similar with their Silverlight, which while not playing quite the same role, has a co-beneficial relationship with Internet Explorer.

firefox_logo

Everyone’s favorite browser, Firefox, recently passed 1 billion downloads. Firefox doesn’t really have a web development kit of their own as I understand it. It just tries to implement the standards well. Which is fair and good, but it demotes FF from the league of agenda setters to people who play catch up, in some sense. Though, it must be said, the rich variety of plugins available for FF might go a long way to remedy this.

All this, and I haven’t even touched on Google’s recent foray into the OS market with “Chrome OS”…

An unusual Java construct

I now break the longstanding tradition of not posting code on this blog. I just wanted to share what I believe to be a somewhat unusual pattern in Java:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public IMethod findMethod(String name, String[] types) {
		outer:
		for (IMethod m: m_methods)
		{
			if (m.getName().equals(name))
			{
				int i = 0;
				for (IType t: m.getArgList().getTypes())
				{
					if (! t.getName().equals(types[i]))
					{
						continue outer;
					}
					i++;
				}
				return m;
			}
		}
		return null;
	}

On line 2, there’s a label outer: which identifies a location in the code. Normally this feature is used with keywords like the hotly debated goto in C. Java has goto as a keyword, but doesn’t support the feature. However, you can still use the labels with statements like continue above (line 12), which in this case starts a new iteration of the outer loop rather than the inner one.

I can’t remember ever having had to use this feature of any C-like language before (perhaps once) so it was intriguing when it popped up. It’s possible that a neater implementation of this would put the inner loop in a matchesSignature method in the IMethod interface instead.

Quantity as a success metric

I have something of an engineering background, so I easily end up thinking of success in terms of quantity. Maximizing this variable or that. Ensuring the greatest possible reward, or the smallest possible cost. But sometimes this is fallacious thinking.

As an academic, I would like to publish prestigious articles. It would be nice to publish 10 papers at second or third rate conferences, but they might all be made irrelevant by a single article at a first rate conference (or even an article in Nature or Science, say). So quality is a better measure than quantity.

I would also like to come up with new and influential ideas, but I suspect I would probably be happier if I managed to influence 10 very highly regarded people than if I managed to influence 10 000 laymen. (These exact numbers were computed using the “wild guess” algorithm and further evaluation may be needed.)

In professional life, I’ve found it dangerously easy to fall into a mode of thinking where you evaluate yourself by your income. This is true up to a point, but I’ve found that there’s a point beyond which additional income has diminishing returns in terms of how much it adds to my overall rewards from life. So beyond this point, quality is a better measure than quantity. What are my tasks, how do they force me to learn and evolve, what kind of satisfaction do I feel and why? So quality is a better measure than quantity.

User satisfaction with computer software can, to some extent, be measured using response time and latency. A snappy, responsive user interface usually produces more satisfaction than a sluggish one. But this can often be compensated for to a surprising extent by having appropriate progress indicators, animations and design features that placate the user in some way, assuring them something is being done. This is in a sense the opposite of the money situation: up to a certain point, quality makes up for quantity, after that point (when the slowness becomes impossible to mask), quantity becomes increasingly important.

What’s most interesting is perhaps the convertibility between quality and quantity. In engineering a device or a software system, quantitative metrics can be crucial tools in the construction process, but the final user experience must be qualitatively right. So quantity is a tool to construct quality. And in the real life situations where quantity is actually the best measure — bargaining, comparing, communicating, constructing, … — I think of it as a way to mask qualities. The numbers are simply easier to consider than the vast number of qualities that lie underneath.

Best bibliography management systems?

A question for readers who happen to manage bibliographies: what, if any, bibliography management systems do you use?

I started using Aigaion for mine. Then I found out that there’s an open system called bibsonomy, which is potentially much better since it lets you tag and share bibliographies socially, and it seems to already know about all the major computer science papers.

Again (see: The problem with standards), I’m frustrated by the fact that I can’t move my data around between applications as I like without lots of manual effort. A worthy research problem would be making data truly application independent once and for all.