Exploring a Large Text Dataset with the Command Line

2015-05-28

Tags: data munging

I have a pet project to create an etymological database. Between Python and graph databases like Neo4j, I can finally realize my Latin major dreams of mapping all the words back to their roots.

I found a likely data source in a snapshot of Wiktionary's dictionary. But what's inside the file? How will I model it?

Command line tools to the rescue. The command line offers powerful tools for searching, counting, moving, transforming, and chopping up files. They let you explore data without programming.

Here's a walkthrough of how I'm using command line tools to explore Wiktionary.

Sizing it up

How big is the Wiktionary file? Listing the files and sizes with ls -l shows 3.8GB uncompressed, 611MB compressed. Oh, my. Configure Time Machine to exclude this from backups.

How many lines of text? Print and count:
cat dictionary.xml | wc -l

139,474,905 lines. Manual processing is not an option.

Find the structure

Let's start reading the file with more dictionary.xml

The first line shows:

<mediawiki  version="0.10" xml:lang="en">

It's in Mediawiki format. The Mediawiki documentation provides overview of the structure, including namespaces.

    <namespaces>
      <namespace key="-2" case="case-sensitive">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="case-sensitive"

This document lists 40 namespaces in all, and most of them don't look relevant to my project.

Browsing father, the dictionary \<page>s begin after the namespace list. Pages with definitions all have namespace 0.

How many pages in the dictionary?

grep -c '<page>' dictionary.xml

4225697 pages total.

How many pages are word definitions, in namespace 0?
grep -c "<ns>0</ns>" dictionary.xml

4010537 - about 95%. With 4.2M pages, there were be about 211K irrelevant pages.

Smallifying the problem

Let's carve off a smaller chunk to work with, say 100K pages.
head -100000 dictionary.xml > sample.xml

To make it easier to see single pages, split the sample into individual files with csplit. csplit will name the new files with a prefix, xx, and a sequence number.

The csplit command is a little quirky on OSX.. It wants to know how many files to split it into before it begins. I picked an improbably huge number, 9000. Split the file wherever there's a ' ' tag, up to a max of 9000. Allow 4 digits in the name (-n 4), and don't delete the files if there are errors (-k , perhaps for keep).

csplit -n 4 -k sample.xml "/<page>/" {9000}

Splitting the 100,000 line sample produces about 4700 files. This directory will look like an episode of Hoarders if I split the entire 4 million pages. 4700 files is a good size to explore.

Move the non-dictionary entry pages out of the way to a separate directory. List the names of all files that don't have namespace 0, and feed them to the move command.

mkdir not-dictionary-pages
grep -L '<ns>0</ns>' * | xargs -I -J % mv % not_dictionary_pages/

Are these entries all the same size? No: ls -l -S -h | more shows the biggest are at 64K and the smallest, half a K. The biggest files were entries for 'time', 'be', 'you', 'go', 'man', and 'do'.

Looking more closely

What does a page consist of? Here's the entry for cat.

The word being defined appears in title tag.

<page>
    <title>cat</title>

Each entry has two identifiers, one for the word, and one for the current revision.

    <ns>0</ns>
    <id>36</id>
    <revision>
      <id>32507082</id>
      <parentid>32498100</parentid>
      <timestamp>2015-04-10T14:37:16Z</timestamp>

There's a sha1 hash at the end of the file, another way to identify the version.

If-- when-- I reimport the data, these will signal whether something has changed or whether it may be skipped. They provide a unique identifier within the dataset so I need not generate one of my own.

Scrolling down more, we find the good stuff is in a field called text:

<text xml:space="preserve">{{ also|Cat|CAT|cât|.cat}}
==English==
{{ wikipedia }}
[[Image:Cat03.jpg|thumb|A domestic cat (1.1)]]

===Pronunciation===
* {{ a|UK }} {{ IPA|/kæt/|[kʰæt]|[kʰæʔ]|lang=en}}
* {{ a|US }} {{ IPA|/kæt/|[kʰæt]|[kʰæʔ]|[kʰeə̯t̚]|[kʰæt̚ ]|[kʰæʔt̚ ]|lang=en}}

===Etymology 1===
From {{ etyl|enm|en }} {{ m|enm|cat }}, {{ m|enm|catte }}, from {{ etyl|ang|en }} {{ m|ang|catt||male cat}} and {{ m|ang|catte||female cat}}, from {{ etyl|LL.|en }} {{ m|la|cattus||domestic cat}}, from {{ etyl|la|en }} {{ m|la||catta }} (used around 75 {{ C.E. }} by Martial)


----
==Indonesian==

===Etymology===
From {{ etyl|ms|id }} {{ term|cat|lang=ms}}, from {{ etyl|nan|id }} {{ term|漆|tr=chhat|lang=nan}}, from {{ etyl|ltc|id }} {{ term|漆|tr=tsit|lang=ltc}}.

Exciting! It points outseveral considerations for how to process entries.

Each section has a header, which starts with two or more equals signs. Languages appear with two equals signs. Subsequent tags have three or more, depending on whether they're a main section or a subsection. The same header word can appear with different numbers of equals signs. If I want to handle headers the same way, such as 'Noun', I'll have to look past variations.

How many unique headers are there in the sample? Find all the headers, remove the equals signs, make it a unique list, and count the lines:

grep -h '^==' * | sed s/=//g | sort | uniq | wc -l

513 headers in the sample. Some are plurals, some have extra spaces. The Etymology header, which I'm most interested in, includes numbered variations from Etymology 1 to 9. I need to look at those words more closely when I decide how to model relationships.

There are multiple languages per file, separated by four dashes. Grep shows that about half the sample contain multiple languages.

Unicode: we got it.

The text includes embedded Mediawiki templates. The Etymology templates encode word relationships in a regular format. Bottom line, I can extract the information programmatically and rely less on understanding the text.

The etymology entries link to words not present in the data set. On Wiktionary pages these show as red links. I'll have to fill them in.

Discoveries and Questions

Exploring the data has told me several things.

The data set is very big. I should process it as a stream, not by loading it into memory first. I'll work off a small sample while I refine the approach.
It includes non-Latin character sets.
There are large chunks I don't need. I'll have to locate the parts that are useful, and accommodate variations in how those are identified.
One entry can have words from multiple languages. How will I model it?
One word can have multiple etymologies. How will I model it?
Entries may link to missing words. How will I fill them in, and how will I differentiate them from words present in the data source?

Overall, it tells me to work iteratively, start small, and expect plot twists along the way.

Best words so far:

horse hockey: B.S., nonsense, baloney
acanthopterous: having spiny wings
phillumenist: someone who collects match-related items

Resources

Is the command line new to you? Check out the Introduction to the Bash Command Line and Counting and mining research data with Unix.

For learning more about individual commands, look for search results from university computing help desks. They're written as walkthroughs, so are easier to understand. 'man pages', the traditional source for help, are best as references and reminders.