First steps with Scala: XML pull parsing

I’m now going to share some of the results of my recent experiments with the Scala programming language. In May I wrote that I had started looking at it. I’ve been using it to make some support tools that I needed for research work since.

First a disclaimer: It’s been 4+ years since I did serious work with a functional programming language (Haskell, in first year of university), so my style is imperative-sprinkled-with-functional rather than the opposite. Also, since I haven’t spent that much time with this language yet, I’m bound to be making obvious mistakes. That said, I’m happy to be able to recommend Scala to pretty much anyone at this point. The learning curve is not steep if you know Java, and it allows for a variety of approaches depending on who you are.

For this particular tool, I needed to parse XML files, edit the contents of certain tags, and spit the data back out again. I’d like to show what I ended up with and point out some of Scala’s powerful features. Let’s first look at some interesting parts, and then the entirety.

1
2
3
object XMLTool {
  val interLink = """\[\[(.*)\]\]""".r
}

Scala lets you define objects as well as classes. Objects are singletons and can be referred to by name. Otherwise they are like classes; they participate in the type hierarchy.
Scala has three kinds of declarations: val, var and def. Values are evaluates once and cannot be reassigned. Vars are variables which can be reassigned. Defs are definitions and can as such be functions or values. My understanding is that they are lazily evaluated. The type of this val declaration is inferred by the highly powerful type system automatically using Hindley Milner type inference. One of my biggest surprises with Scala is how little type information the programmer has to provide, yet how powerful the static checking is. Incidentally, the .r at the end is a shortcut for turning the string into a regular expression object.

1
2
3
4
5
6
def main(args : Array[String]) : Unit = {
 
    val p = new XMLEventReader().initialize(Source.fromFile(args(0)))
    p.foreach(matchEvent)
 
  }

This function is the analogue of Java’s public static void main() and actually compiles to the same bytecode. Unlike in Java, types come after the variable name, separated from it by a colon. We can tell that we’re dealing with a functional language when we see foreach being applied to a function which I’ll declare next:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def matchEvent(ev: XMLEvent) = {
    ev match {
      case EvElemStart(_, "text", _, _) => { 
        readingText = true
        print(backToXml(ev))
      }
      case EvElemStart(_, _, _, _) => { print(backToXml(ev)) }
      case EvText(text) => {
        if (readingText) print(filterText(text)) else print(text) 
      } 
      case EvElemEnd(_, "text") => {
        readingText = false
        print(backToXml(ev))
      }
      case EvElemEnd(_, _) => { print(backToXml(ev)) }
      case _ => {}
    }
  }

Here we see pattern matching in action. We can match on lots of things, including types, partially instantiated types, strings and regular expressions. This style of programming is encouraged in FP languages, unlike in imperative ones. By matching on something like EvElemStart(_, "text", _, _) I’m looking for XML tags whose name is “text”, and I don’t care about their namespace or attributes. _ is a wildcard character.

Incidentally, it’s perfectly fine for me to leave out the return type of this function. Scala will infer that the return type is Unit (which vaguely corresponds to void in Java).

Here’s the whole thing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import scala.xml._
import scala.xml.pull._
import scala.io.Source
 
object XMLTool {
  val interLink = """\[\[(.*)\]\]""".r
  var readingText = false
 
  def main(args : Array[String]) : Unit = {
 
    val p = new XMLEventReader().initialize(Source.fromFile(args(0)))
    p.foreach(matchEvent)
 
  }
 
  def matchEvent(ev: XMLEvent) = {
    ev match {
      case EvElemStart(_, "text", _, _) => { 
        readingText = true
        print(backToXml(ev))
      }
      case EvElemStart(_, _, _, _) => { print(backToXml(ev)) }
      case EvText(text) => {
        if (readingText) print(filterText(text)) else print(text) 
      } 
      case EvElemEnd(_, "text") => {
        readingText = false
        print(backToXml(ev))
      }
      case EvElemEnd(_, _) => { print(backToXml(ev)) }
      case _ => {}
    }
  }
 
  def backToXml(ev: XMLEvent) = {
    ev match {
      case EvElemStart(pre, label, attrs, scope) => {
        "<" + label + attrsToString(attrs) + ">"
      }
      case EvElemEnd(pre, label) => {
        "</" + label + ">"
      }
      case _ => ""
    }
  }
 
  def attrsToString(attrs:MetaData) = {
    attrs.length match {
      case 0 => ""
      case _ => attrs.map( (m:MetaData) => " " + m.key + "='" + m.value +"'" ).reduceLeft(_+_)
    }
  }
 
  def filterText(text: String) = {
    val matches = interLink.findAllIn(text)
    if (matches.hasNext) matches.reduceLeft(_+_) else ""
  }
}

So the purpose of this program is to read the XML, remove everything inside <text> tags that doesn’t match the interLink regular expression, and output the XML again. Towards the end, note how pleasant map and reduceLeft are for string processing – in Java I can’t really think of a succinct way of expressing the same notion.

Another couple of disclaimers: someone brought to my attention that there’s a very compact way of doing XPath queries in Scala, which probably makes my pattern matching on EvElemStart unnecessarily verbose. (Here’s a blog post on the xpath technique) Also, there was no particular reason for me to use pull parsing – push parsing might have been more natural, but I started down that path and this is what I ended up with. It works.

You can tell that I still have an imperative style from the way I use the readingText state variable to keep track of what the program is doing. A much more functional style program is probably hiding behind this one. Fortunately Scala is very forgiving towards people who mix styles like this.

My experience has been that it’s quite easy to get started and do useful things with Scala, once you get past the initial ideas (such as the difference between objects and classes, traits, val/def/var, declaration syntax). I would recommend it to anyone doing things with the JVM.

Comments 1

  1. johan wrote:

    A factual correction since lots of people keep reading this post: “def” values are not lazily evaluated. They are just definitions of expressions or functions (same thing really). In order to get lazy evaluation, one should use the “lazy val” syntax.

    Posted 09 Dec 2009 at 3:25 pm

Trackbacks & Pingbacks 3

  1. From A simple Scala parser to parse 44GB Wikipedia XML Dump | DNA of the TUX on 04 Feb 2014 at 5:54 am

    […] First steps with Scala: XML pull parsing […]

  2. From atomic blonde cda on 21 Jul 2017 at 7:25 pm

    Trackback

    […]Wow, amazing blog layout! How long have you been blogging for?[…]

  3. From ingenieurs maroc on 10 Aug 2017 at 5:07 pm

    هيئة المهندسين التجمعيين – corps des ingenieurs du parti du RNI

    This is my expert

Post a Comment

Your email is never published nor shared. Required fields are marked *