Parsing HTML Table Fragments

Posted on May 14, 2020

go html

Honestly, when I need to write a quick and dirty script, my go-to language is Python. But the other day, as I realized I needed to write yet another small web scraper, I decided to forego Python’s BeautifulSoup to instead take a look at Go’s golang.org/x/net/html package. Needless to say, it is quite bare in comparison… But I also realized that it makes absolutely no concessions when it comes to strictly following the HTML specification¹.

Go is often simple, by no means simplistic and not always easy. It has a tendency to keep abstractions at a minimum (an intentional choice of its creators²), which can be very confusing to people coming from a more complex language (like Python). But less can be more: because Go does very little effort to hide things away, it provides developers with plenty of opportunities to learn about what other languages do not expose. This time is no different: I ended up learning a great deal about HTML in general and table in particular, and figured I could share my findings in a blog post.

A Surprising Parse Result

A little confession: when writing simple little scripts, I usually skip writing any kind of test, and simply try out my code as I go. Not a good practice, I know. But this time was to be different: I just finished reading The Pragmatic Programmer , which strongly advocates writing tests, and I decided I would write simple unit tests to check my code.

It so happens that my code contained logic working with table rows (the tr element). The code looked like:

func parseRow(row *html.Node) {
  if !(row.Type == html.ElementNode && row.Data == "tr") {
    panic("this is not a row")
  }

  // Do some stuff
}

And so I wrote my unit test:

func TestParseRow(t *testing.T) {
  raw := "<tr><td>1</td><td>2</td></tr>"
  doc, _ := html.Parse(strings.NewReader(raw))

  parseRow(doc)
}

Easy enough, right? I thought so too. And so I ran go test.

--- FAIL: TestParseRow (0.00s)
panic: this is not a row [recovered]
	panic: this is not a row

Wait, what? How is this possible? After a little bit of debugging, I found out that the result of Parse(string.NewReader(raw) was in fact a node that contained the following elements:

<>
  <html>
    <head>
    </head>
    <body>
      12
    </body>
  </html>
</>

I was perplexed. Where are my tr and tds? I was expecting something like what Python returns:

from bs4 import BeautifulSoup

soup = BeautifulSoup("<tr><td>1</td><td>2</td></tr>", 'html.parser')
print(soup.prettify())

# this will print:
# <tr>
#  <td>
#   1
#  </td>
#  <td>
#   2
#  </td>
# </tr>

In short, to me, it looked like a bug in the Go HTML parser! But I held my horses, and resisted my initial impulse to create an issue in the Go repository³. Instead, I played around a bit more, and replaced the raw string in my test by "<table><tr><td>1</td><td>2</td></tr></table>". This time, the Parse result was very different:

<>
  <html>
    <head>
    </head>
    <body>
      <table>
        <tbody>
          <tr>
            <td>
              1
            </td>
            <td>
              2
            </td>
          </tr>
        </tbody>
      </table>
    </body>
  </html>
</>

Aside from not really solving my problem, this brought more questions than answers. Why did the parser ignore all tags in the first case but not when adding a table? Why did it add tags like html, body or tbody? This is the moment I remembered that golang.org/x/net/html actually points towards the HTML specification , and I decided to take a look.

Doing so rewarded me with answers to my questions (spoiler: the Go HTML parser works perfectly fine), and allowed me to find a clean solution to my problem. But before showing the exact Go code that did the trick for me, I will first explain why these surprising results in fact show that the Go HTML parser is strictly following the HTML specification and working perfectly as intended. But if you have no interest in that or find yourself short on time, feel free to directly jump to the solution .

The Reason Why

I started browsing the HTML specification with two very simple questions in mind:

Why is the parser generating extra tokens like body or tbody?
What explains the weird output when parsing tr and td outside of table?

Optional Tags: Omitted, But Not Forgotten

I only have basic notions of HTML, but I vaguely remembered that some tags, like html, can be omitted when writing a document. Yet, Go’s HTML parser seemed determined to generate those, even when they were not explicitly present in the input string. Why bother with this extra work? I thought. Reading the first few lines of the HTML syntax page was very enlightening:

Documents must consist of the following parts, in the given order:

Optionally, a single U+FEFF BYTE ORDER MARK (BOM) character.

Any number of comments and ASCII whitespace.

A DOCTYPE.

Any number of comments and ASCII whitespace.

The document element, in the form of an html element.

Any number of comments and ASCII whitespace.

This means that a valid HTML document must contain a DOCTYPE, and that all subsequent tags have to be enclosed in a mandatory <html></html> node. Besides, all normal elements (which is to say pretty much all HTML nodes) have a content model , describing the requirements that all children of said element must match.

In the case of the html element, the content model is very straight forward:

A head element followed by a body element.

In other words, the minimum viable HTML document holding content other than metadata is:

<DOCTYPE html>
<html>
  <head>
  </head>
  <body>
    <!-- content goes here -->
  </body>
</html>

What about omitting tags then? This is of course possible, as the specification states in its optional tags section. But the key takeaway from that section is that omitted tags are not absent: they are implied, but still here! And as it turns out, html, head and body can all have their start and end tag omitted (assuming a few conditions are satisfied).

It is the parser’s role to ensure that omitted tags are still present. The precise steps are described here , but in short, generating tokens like html, body or head is part of Go’s HTML parser job: it is simply ensuring that inputs are turned into viable HTML documents. In my case, the specification was also dictating the generation of the tbody token, as the parsing algorithm is supposed to automatically generate a tbody while encountering a tr in a table (outside of any other tbody, thead or tfoot).

Eagle-eyed readers might have noticed that I am carefully leaving out the DOCTYPE. This is because, to be honest, I am still unsure as to why the Go HTML parser does not add it nor complains about its lack. Shouldn’t it do it, according to the specification? My best guess is that, since DOCTYPE is there to indicate to browsers that the content following it is to be rendered as an HTML document, the Go HTML parser does require it (it assumes the input is HTML anyway).

HTML and Parsing Errors

Optional tags are well and good, but what about the weird output I observed at first? How could the parser possibly transform this:

"<tr><td>1</td><td>2</td></tr>"

Into this?

<>
  <html>
    <head>
    </head>
    <body>
      12
    </body>
  </html>
</>

While the result arguably appears counter intuitive to a HTML novice like myself, the parser is in fact working exactly as intended. Unlike the table example, where the input was semantically correct (although containing omitted tokens), the parser is this time supplied with an incorrect HTML string. Like before, the parser will understand that tags are omitted, generate html, head and body and parse the input assuming it is enclosed in <body></body> tags. But, according to the specification, the body element’s content model only accepts flow content , which does not include tr (nor td)!

To understand how the parser handles such parsing errors, I tried to follow the rules for parsing tokens.

All tags up to body are generated according to the rules of optional tags (the bottommost node of the tag stack is thus body).
The parser encounters tr, which is illegal in a body context. It ignores it.
The parser encounters td, which is illegal in a body context. It ignores it.
The parser encounters the raw character 1, creates a Text node whose data is “1”, and inserts it just after body.
The parser encounters /td, which is illegal as the current node is body and not td. It ignores it.
Same as 3.
The parser encounters the raw character 2, but because there is a Text node just before it, it appends “2” to that node’s data, turning the Text node generated in 4. into a Text node whose data is “12”.
Same as 5.
The parser encounters /tr, which is illegal as the current node is body and not tr. It ignores it.

The result of all these steps is a body node containing a Text node with “12”, which is exactly the conclusion reached by the Go HTML parser. Working as intended :-)

Because an (animated) picture is worth a thousand words, I’ve created a GIF that shows what the parser does given a similar example:

Parsing HTML Table Fragments /img/html-parser.gif

Solution

Back to my initial problem. How can I generate the following node architecture?

<tr data="2">
  <td>
    1
  </td>
  <td>
    2
  </td>
</tr>

The answer stems directly from the HTML spec I’ve detailed in the previous section : such a sequence can only exist in the context of a tbody. Given this, the only thing that has to be done is to supply the Go HTML parser with the context in which it is to parse my fragment:

func TestParseRow(t *testing.T) {
  raw := "<tr><td>1</td><td>2</td></tr>"
  nodes, _ := html.ParseFragment(strings.NewReader(raw), &html.Node{
      Type: html.ElementNode,
      Data: "tbody",
      DataAtom: atom.Tbody})

  // Unlike Parse, ParseFragment can return multiple nodes,
  // but we know that we can only have one here.
  parseRow(nodes[0])
}

And that’s it! I think this little adventure with HTML tables embodies how the Go standard library, sometimes showing a frustratingly low level of abstraction, also provides opportunities to learn more. And in that sense, I feel like Go contributes to raising more knowledgeable software engineers⁴.

As always, shoot me a message or tweet @nicol4s_c if you want to chat about any of this, if you spotted any mistakes or typos, or if you’d like me to cover anything else! Have a great day :)

I am not being fair to Python as I am comparing a library built on top of html.parser which is just as bare as Go’s parser. But my point about respecting the spec still stands. ↩︎
In Rob Pike’s own words, one of Go’s objectives was to avoid having “each programmer using a different subset of the language”, which naturally pushes towards simpler APIs (see this talk ) ↩︎
The Pragmatic Programmer says “if you see hoof prints, think horses, not zebras”: when encountering a bug, it is most likely not the library’s fault. ↩︎
Again, it is arguably one of it’s creators' goals, as Rob Pike says in this talk that Go was originally targeted at “typically, fairly young, fresh out of school” developers. ↩︎