Parsing things in Python

Julius Seporaitis

2017-05-19 Fri

Created: 2017-05-17 Wed 20:30

Who recognizes this?

1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Ng5 d5 5. exd5 Nx5d 6. Nxf7 Kxf7

Portable Game Notation plain text, computer-processible format for recording chess games.

We're going to write a parser for it…

With `pyparsing`

pip install pyparsing
from pyparsing import *  # noqa :-)

What's in a move?

1. e4 e5
  • Move number (`1.`)
  • Two half-moves (`e4 e4`)
  • Pawn or piece half-move (`d5`, `Ng5`)
  • Pawn or piece captures (`exd5`).
  • There's more, but we'll skip them.

Move number

move_number = Word(nums) + Literal(".")
  • `Word` - matches words made of allowed character sets.
  • `Literal` - exactly matches a specified string.

Coordinates

file_coord = oneOf("a b c d e f g h")
rank_coord = oneOf("1 2 3 4 5 6 7 8")
  • `oneOf` - helper to quickly define a set of alternative `Literal`s
  • returns first fitting match.

Pawn half-moves

Pawn half-move is just a file and a rank.

pawn_move = file_coord + rank_coord

Pieces

Any idea how?

piece = oneOf("K Q N B R")

Piece half-moves

A piece half-move has the piece prefix and coordinates.

piece_move = piece + file_coord + rank_coord

Capturing

Capture involves having literal `x` in between coordinates.

capture = Literal("x")

Pawn captures

Start from a file (letter), capture and end with coordinates.

pawn_capture = file_coord + capture + file_coord + rank_coord

Piece captures

Same as pawn, but have a piece prefix instead of file.

piece_capture = piece + capture + file_coord + rank_coord

Half move

A half move is either a:

  • pawn move, or
  • piece move, or
  • pawn capture, or
  • piece capture.
half_move = Combine(pawn_move | pawn_capture | piece_move | piece_capture)

Move

A move is a:

  • move number (don't forget!)
  • half-move
  • half-move
move = Group(Suppress(move_number) + half_move + half_move)

Putting it all together

We started with a PGN, so we finish with:

pgnGrammar = ZeroOrMore(move)

pgn = pgnGrammar.parseString("""
1. e4 e5
2. Nf3 Nc6
3. Bc4 Nf6
4. Ng5 d5
5. exd5 Nxd5
6. Nxf7 Kxf7""")

[list(m) for m in pgn]
[['e4', 'e5'],
 ['Nf3', 'Nc6'],
 ['Bc4', 'Nf6'],
 ['Ng5', 'd5'],
 ['exd5', 'Nxd5'],
 ['Nxf7', 'Kxf7']]

What is it useful for?

  • Writing parsers for semi-structured data
  • Writing quick and dirty DSLs (think 'Cucumber')

Slightly more complicated example

Thank you