Processing arbitrary amount of data in Python

I’ve recently written about data processing in F#, and I thought I’d keep up the trend, but this time showcase a bit of python. A naive Pythonista can easily exhaust all available memory when working with large datasets, as the default Python data structure, which are easy to use, eagerly evaluate elements causing a million elements to be processed in memory instead of processing one element at a time in memory.

First we need data. A few minutes of searching the internet last night led me to a csv dump of Reddit’s voting habits. Let’s download the data to votes.csv.

# Data is compressed, so as we download we decompress it
curl "http://redditketralnis.s3.amazonaws.com/publicvotes.csv.gz" |
  gzip -d > votes.csv

Discover a bit of what the data looks like by taking the top 10 lines from the files with head votes.csv.

00ash00,t3_as2t0,1
00ash00,t3_ascto,-1
00ash00,t3_asll7,1
00ash00,t3_atawm,1
00ash00,t3_avosd,1
0-0,t3_6vlrz,1
0-0,t3_6vmwa,1
0-0,t3_6wdiv,1
0-0,t3_6wegp,1
0-0,t3_6wegz,1

There are no headers and we can see the format looks like username,link,vote where the vote is either -1 or 1. This made me curious if there were votes other than -1 or 1. Maybe there is a super user or admin that can have five votes. Luckily, standard linux tools come to our rescue.

cut --delimiter=',' --fields=3 publicvotes.csv |
  sort |
  uniq

outputs

1
-1

As an aside, csv mongers are probably upset that cut was used instead of some robust and accurate csv tool that can handle escaping characters, but by the output we can see that it is not a problem. No user has a , in their username.

If you are following along, you probably noticed that the last step took a relatively significant amount of time (ie. it wasn’t instant). Executing time on the previous command (bash shortcut !!) gives about nine seconds:

real    0m9.100s
user    0m8.916s
sys     0m0.144s

We could do further analysis on the file, but in an effort to get to the meat of this post, we’ll just find out the number of lines in the file by executing wc --lines votes.csv. 7.5 million lines of data! Puny, but it will work for this example.

The goal of this post is to convert this csv file into json without using excessive memory. I’ll post the code below explain sections subsequently below. Feel free to skip the explanations if you already know Python inside and out.

#!/usr/bin/env python2.7

import csv
import fileinput
import sys
from collections import namedtuple
from itertools import imap
from json import JSONEncoder

RedditVote = namedtuple('RedditVote', ['username', 'link', 'score'])
votes = imap(RedditVote._make, csv.reader(fileinput.input()))
votes = imap(lambda x: x._replace(score=int(x.score)), votes)
votes = imap(RedditVote._asdict, votes)

class Generator(list):
    def __init__(self, generator):
        self.generator = generator

    def __iter__(self):
        return self.generator

    def __len__(self):
        return 1

encoder = JSONEncoder(indent=2)
for chunk in encoder.iterencode(Generator(votes)):
    sys.stdout.write(chunk)

Here is a sample of the output

[
  {
    "username": "00ash00",
    "link": "t3_as2t0",
    "score": 1
  },
  {
    "username": "00ash00",
    "link": "t3_ascto",
    "score": -1
  }
]

Section 1: Parsing

# Create a type that has three fields: username, link, and score
RedditVote = namedtuple('RedditVote', ['username', 'link', 'score'])

# From each row in the csv convert it into the RedditVote by taking the first
# field and storing as the username, the second field as the link, etc.
votes = imap(RedditVote._make, csv.reader(fileinput.input()))

# The csv read in strings, so we convert the score to an integer
votes = imap(lambda x: x._replace(score=int(x.score)), votes)

# Convert each instance into a dictionary, which can easily serialized to JSON
votes = imap(RedditVote._asdict, votes)

The lines of python code packs a punch and relies heavily on the following modules (other than the csv module).

namedtuple: takes away all the boilerplate in creating a class that is immutable that you can treat as a tuple or a dictionary
fileinput: probably the most simple way to read from standard input with the option of reading multiple files in the future (in case we had multiple data files)
generators: easily the hardest concept to grok in this list. All the uses of imap are saying that only as elements are requested in an iteration is the function in imap invoked. The variable votes is only iterated once (and that’s done on the line encoder.iterencode)

What remains left after this section is serializing a sequence of dictionaries to JSON.

Section 2: Inheriting from List

The Generator class is the most “hacky” part of the code, but is needed because Python struggles serializing arbitrary iterators in a memory efficient way. The Generator class is adapted from an answer on StackOverflow.

# Derive from list
class Generator(list):

    # Constructor accepts a generator
    def __init__(self, generator):
        self.generator = generator

    # When asked for the iterator, return the generator
    def __iter__(self):
        return self.generator

    # When asked for the length, always return 1. Additional explanation below.
    def __len__(self):
        return 1

The Generator class has to derive from list because when serializing an object to json, Python will check if it isinstance(list). Not deriving from list will cause an exception because the json module can’t serialize generators. I don’t like this implementation because by deriving from list, Generator is stating that it is a list, but an iterator is not a list! An iterator that is masquarading as an list gives me nightmares because the user may assume certain truths about lists that are violated when given an iterator.

The three functions defined are known as special methods, and are easily identifiable by the signature double underscore wrapping:

__init__ defines how a Generator is customized when someone invokes Generator(foobar).
__iter__ defines the behavior of how elements are iterated (eg. for x in Generator(foobar):). Our implementation simply forwards the request onto the actual generator
__len__ defines how to calculate the length of the class (eg. length = len(Generator(foobar)) Defining __len__ is needed in Generator because the JSON encoder detects if a list is “falsy, and a python list will evaluate to false if it has a zeroth length (eg. if not []: print 'a' will print a), so only [] will be outputted. Defining the function gets around it.

What’s frustrating is that in the json module documentation for encoders, there is an example “to support arbitrary iterators”, but it ends up allocating a whole list for the iterator, which is what we’re trying to avoid!

There is a way around this hack using simplejson, which is “the externally maintained development version of the json library included with Python 2.6 and Python 3.0.” An issue was raised about four years ago, lamenting the same problem. The author created a branch that fixed the problem and asked for testing volunteers. Unfortunately, no one responded so the author never merged the branch into simplejson. Luckily for us, we can still pip install the branch.

pip install git+https://github.com/simplejson/simplejson@iterable_as_array-gh1

Then we can remove the Generator class and just use:

from simplejson import JSONEncoder

# [...]

encoder = JSONEncoder(indent=2, iterable_as_array=True)
for chunk in encoder.iterencode(votes):
    sys.stdout.write(chunk)

Performance difference between the two implementation (deriving from list vs. custom simplejson) is nearly neglible with using the simplejson approach being about 5% faster.

Doing my open source duty, I made sure that I let the author know that their implementation worked for our purposes. Here’s to hoping he hears and merges it into the next version!

Update: The author responded and asked me to merge the branch back into master so it is updated. I accepted and made a pull request the next day.

Update: I’m happy to say that pull request has been accepted and as of simplejson 3.8.0, iterable_as_array can be used, so there is no need to reference a specific (and outdated) branch on Github for the functionality. pip install simplejson will include the option. Now, combining all the tricks of the trade, we end with:

#!/usr/bin/env python2.7

import csv
import fileinput
import sys
import simplejson as json
from collections import namedtuple
from itertools import imap

RedditVote = namedtuple('RedditVote', ['username', 'link', 'score'])
votes = imap(RedditVote._make, csv.reader(fileinput.input()))
votes = imap(lambda x: x._replace(score=int(x.score)), votes)
votes = imap(RedditVote._asdict, votes)
json.dump(votes, sys.stdout, indent=2, iterable_as_array=True)

Section 3: Writing

The shortest and (hopefully) the easiest section:

# Create a JSON encoder that pretty prints json with an indent of 2 spaces per
# nesting
encoder = JSONEncoder(indent=2)

# Iteratively encode our generator as new information is processed
for chunk in encoder.iterencode(Generator(votes)):
    # Send each chunk to stdout as it is generated. Since chunks contain the
    # necessary spaces and newlines, there is no need to use `print` as that
    # add additional format
    sys.stdout.write(chunk)

Results

We’re interested in the amount of time and the max memory usage to show that the votes file, is in fact, not loaded entirely in memory at once. To measure max memory usage, there is a nice utility called memusg online, which we can grab.

curl -O "https://gist.githubusercontent.com/netj/526585/raw/9044a9972fd71d215ba034a38174960c1c9079ad/memusg"
chmod +x memusg

Let’s run it!

time cat votes.csv | ./memusg ./to-json.py >/dev/null

real    3m36.384s
user    3m31.271s
sys     0m2.445s

memusg: peak=6200

Ok, 3 minutes and 36 seconds to run. Not terrible, but not great. Memory usage is about 6KB, so the whole file was processed without exhausting memory.

You may have noticed that I’m piping the output to /dev/null. I do this as a way to measure CPU performance not IO performance. Though, interestingly enough writing to the file didn’t effect CPU performance, which leads me to think that the algorithm is strangely CPU bound.

Comparison with Rust

A program that transforms an input shouldn’t be CPU bound. This made my mind wander. What if I coded it up in a more naturally performance language? Most people would drift towards C/C++, but I chose Rust because the state of package management in C/C++ is sad. I don’t want to re-code a robust csv and json parser, or spend an inordinate amount of time trying to integrate packages like rapidjson so that everything is redistributible.

Anyways, below is the first cut of my first program I have ever written in Rust that produces an identical result as the Python code. Don’t judge too harshly!

#![feature(custom_derive, plugin)]
#![plugin(serde_macros)]

extern crate serde;
extern crate csv;

use serde::ser::Serializer;
use serde::ser::impls::SeqIteratorVisitor;

#[derive(Serialize)]
struct RedditVote {
    username: String,
    link: String,
    vote: i32
}

fn main() {
    let mut rdr = csv::Reader::from_reader(std::io::stdin()).has_headers(false);
    let votes = rdr.records().map(|r| {
        let va = r.unwrap();
        RedditVote { username: va[0].to_string(),
                     link: va[1].to_string(),
                     vote: va[2].parse::<i32>().unwrap() }
    });

    let mut writer = serde::json::Serializer::pretty(std::io::stdout());
    writer.visit_seq(SeqIteratorVisitor::new(votes, None)).unwrap();
}

The code is technically longer, but I feel like it hasn’t lost any readibility. It has different syntax than Python, but that shouldn’t be a big turnoff.

Since this was the first Rust code I have ever written, I can’t tell if it is idiomatic, but I’m satisfied. The hardest part was looking up function signatures for serde and lamenting the fact that the rust csv reader decodes records natively using rustc-serialize, but the community is moving towards serde because it has more features and is faster. There is an issue open for the rust csv parser to move to serde, so the posted code should only become more concise as time passes.

At the time of writing this, Cargo.toml looked like:

[dependencies]
csv = "0.14.2"
serde = "0.4.2"
serde_macros = "0.4.1"

Running the code (after compiling with cargo build --release:

time cat votes.csv | ./memusg ./rs-to-json >/dev/null

real    0m35.366s
user    0m25.384s
sys     0m7.629s

memusg: peak=5588

That’s about a 6x speedup compared to even the Python version that drops down to C for speed. Thus, if speed is of no concern, use Python, but if you want speed and C/C++ might not the right fit – use Rust.

I’ve avoided talking about the sys timing because until now it hasn’t constituted a significant part to the timing, but now that sys is more than a quarter of the real time, it is time to talk about it. In our example, sys measures reading the file and memory allocation, two jobs that are relegated to the kernal to perform. A hypothesis would be that the Rust code is so fast that the program is becoming IO bound, and this statement is backed up by watching top display our Rust program consistently using less CPU (about 10%).

Roundtrip

If you take a look at the resulting json file, it is quite large. Converting it back to csv might be a good idea because we don’t really gain anything from the json and a csv is much smaller in comparison. We use the same tools as before, except we need to use ijson, as that allows for to stream in JSON.

#!/usr/bin/env python2.7

import ijson.backends.yajl2 as ijson
import sys
import csv
from itertools import imap

votes = ijson.items(sys.stdin, 'item')
votes = ((x['username'], x['link'], x['score']) for x in votes)
csv.writer(sys.stdout, lineterminator='\n').writerows(votes)

There are more lines of code dedicated to importing modules than to the actual conversion process. If that is not a powerful snippet of code, I’m not sure what is!

I chose the ijson backend as yajl2 because the pure Python implementation is twice as slow. The downside to this approach is that it may not be cross platform as yajl2 requires a compilation step.

pip install ijson
apt-get install libyajl2
time cat votes.json | ./memusg ./to-csv.py >/dev/null

real    4m21.140s
user    3m58.029s
sys     0m2.489s

memusg: peak=9772

We can ensure that roundtripping produces identical output without creating any additional files:

diff votes.csv <(cat votes.csv | ./to-json.py | ./to-csv.py)

How fast can we go?

From here on out, the content is trivial, but let’s say that you were tasked with converting csv to JSON as fast as possible, and it didn’t matter how re-useable the code was.

For this task we are going to dip our toes into C.

Compile the following code with: gcc -O3 -o to-json to-json.c. gcc is needed because we use a couple gnu-isms, such as getline to read a line at a time and __builtin_expect for branch prediction.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main() {
    ssize_t read;
    char* line = malloc(256 * sizeof(char));
    char* username = malloc(256 * sizeof(char));
    char* link = malloc(256 * sizeof(char));
    size_t len = 255;
    int i = 0;

    fputc('[', stdout);
    while ((read = getline(&line, &len, stdin)) != -1) {
        if (__builtin_expect(i, 1)) {
            fputc(',', stdout);
        }
        i++;

        char* username_end = memchr(line, ',', read);
        ssize_t username_length = username_end - line;
        memcpy(username, line, username_length);
        username[username_length] = 0;

        char* link_end = memchr(username_end + 1, ',', read - username_length - 1);
        ssize_t link_length = link_end - username_end - 1;
        memcpy(link, username_end + 1, link_length);
        link[link_length] = 0;

        fputs("\n  {\n    \"username\": \"", stdout);
        fputs(username, stdout);
        fputs("\",\n    \"link\": \"", stdout);
        fputs(link, stdout);
        fputs("\",\n    \"score\": ", stdout);

        if (*(link_end + 1) == '-') {
            fputc('-', stdout);
        }

        fputs("1\n  }", stdout);
    }
    fputs("\n]", stdout);

    free(line);
    free(username);
    free(link);
    return 0;
}

and the results:

time cat votes.csv | ./memusg ./to-json >/dev/null

real    0m1.886s
user    0m1.284s
sys     0m0.270s

memusg: peak=972

That’s 114.73x speedup compared to our python solution and 18.75x speedup compared to our rust solution. Also note the low memory usage.

Despite the speedup, please don’t code like this. There are many inputs that could wreck our toy example (multi-line records, quotations, long usernames, etc). We get all of these features in the Rust and Python versions because we used libraries that handle all the corner cases.

As a side note the C version is the longest line count and took the longest to code.

How about CSVKit?

CSVKit is nice when working with CSVs and it even has a tool called csvjson that will convert a CSV file into JSON. How does it stack up to our other methods?

First, csvjson determines the JSON keys from column headers, but in our dataset we don’t have headers. Also csvjson doesn’t natively stream data and using the --stream option, it won’t ouput valid JSON! The former problem is easily fixed, but the latter renders this test almost useless. Still, we’ll execute it and record the results.

# First we add the headers: username, link, and score
time (echo "username,link,score" && cat votes.csv) |
    ./memusg csvjson --stream -i 2 >/dev/null

real    7m54.152s
user    6m55.465s
sys     0m11.561s

memusg: peak=12952

Wow, slow as molasses (over 250 times slower than our C version) and the final result is still incorrect, but I figured I should this example to be complete, as for small CSV files it should be the quickest because there is no code you have to write, just execute a command!

And PyPy?

PyPy is a fast, compliant alternative implementation of the Python language (2.7.9 and 3.2.5). […] Thanks to its Just-in-Time compiler, Python programs often run faster on PyPy

time cat votes.csv | ./memusg pypy to-json.py >/dev/null

real    1m22.147s
user    1m12.649s
sys     0m1.827s

memusg: peak=78924

Nice. Dropping in PyPy yielded about a 3x speedup without any additional work. Memory usage is significantly higher due (most likely) to PyPy’s JIT compiler.

Summary

test	real	user	sys	memusg
Python	3m36.384	3m31.271	0m2.445	6200
Rust	0m35.366	0m25.384	0m7.629	5588
C	0m1.886	0m1.285	0m0.270	972
PyPy	1m22.147	1m12.649	0m1.827	78924
CSVKit	7m54.152	6m55.465	0m11.561	12952
Roundtrip	3m21.140	3m58.029	0m2.489	9772