Processing arbitrary amount of data in Python
Published on:Table of Contents
I’ve recently written about data processing in F#, and I thought I’d keep up the trend, but this time showcase a bit of python. A naive Pythonista can easily exhaust all available memory when working with large datasets, as the default Python data structure, which are easy to use, eagerly evaluate elements causing a million elements to be processed in memory instead of processing one element at a time in memory.
First we need data. A few minutes of searching the internet last night led me to a csv dump of Reddit’s voting habits. Let’s download the data to votes.csv.
# Data is compressed, so as we download we decompress it
curl "http://redditketralnis.s3.amazonaws.com/publicvotes.csv.gz" |
gzip -d > votes.csv
Discover a bit of what the data looks like by taking the top 10 lines from the
files with head votes.csv
.
00ash00,t3_as2t0,1
00ash00,t3_ascto,-1
00ash00,t3_asll7,1
00ash00,t3_atawm,1
00ash00,t3_avosd,1
0-0,t3_6vlrz,1
0-0,t3_6vmwa,1
0-0,t3_6wdiv,1
0-0,t3_6wegp,1
0-0,t3_6wegz,1
There are no headers and we can see the format looks like username,link,vote
where the vote is either -1 or 1. This made me curious if there were votes
other than -1 or 1. Maybe there is a super user or admin that can have five
votes. Luckily, standard linux tools come to our rescue.
cut --delimiter=',' --fields=3 publicvotes.csv |
sort |
uniq
outputs
1
-1
As an aside, csv mongers are probably upset that cut
was used instead of some
robust and accurate csv tool that can handle escaping characters, but by the
output we can see that it is not a problem. No user has a ,
in their
username.
If you are following along, you probably noticed that the last step took a
relatively significant amount of time (ie. it wasn’t instant). Executing time
on the previous command (bash shortcut !!
) gives about nine seconds:
real 0m9.100s
user 0m8.916s
sys 0m0.144s
We could do further analysis on the file, but in an effort to get to the meat
of this post, we’ll just find out the number of lines in the file by executing
wc --lines votes.csv
. 7.5 million lines of data! Puny, but it will work for
this example.
The goal of this post is to convert this csv file into json without using excessive memory. I’ll post the code below explain sections subsequently below. Feel free to skip the explanations if you already know Python inside and out.
#!/usr/bin/env python2.7
import csv
import fileinput
import sys
from collections import namedtuple
from itertools import imap
from json import JSONEncoder
RedditVote = namedtuple('RedditVote', ['username', 'link', 'score'])
votes = imap(RedditVote._make, csv.reader(fileinput.input()))
votes = imap(lambda x: x._replace(score=int(x.score)), votes)
votes = imap(RedditVote._asdict, votes)
class Generator(list):
def __init__(self, generator):
self.generator = generator
def __iter__(self):
return self.generator
def __len__(self):
return 1
encoder = JSONEncoder(indent=2)
for chunk in encoder.iterencode(Generator(votes)):
sys.stdout.write(chunk)
Here is a sample of the output
[
{
"username": "00ash00",
"link": "t3_as2t0",
"score": 1
},
{
"username": "00ash00",
"link": "t3_ascto",
"score": -1
}
]
Section 1: Parsing
# Create a type that has three fields: username, link, and score
RedditVote = namedtuple('RedditVote', ['username', 'link', 'score'])
# From each row in the csv convert it into the RedditVote by taking the first
# field and storing as the username, the second field as the link, etc.
votes = imap(RedditVote._make, csv.reader(fileinput.input()))
# The csv read in strings, so we convert the score to an integer
votes = imap(lambda x: x._replace(score=int(x.score)), votes)
# Convert each instance into a dictionary, which can easily serialized to JSON
votes = imap(RedditVote._asdict, votes)
The lines of python code packs a punch and relies heavily on the following modules (other than the csv module).
- namedtuple: takes away all the boilerplate in creating a class that is immutable that you can treat as a tuple or a dictionary
- fileinput: probably the most simple way to read from standard input with the option of reading multiple files in the future (in case we had multiple data files)
- generators: easily the hardest concept to grok in this list. All the
uses of
imap
are saying that only as elements are requested in an iteration is the function inimap
invoked. The variablevotes
is only iterated once (and that’s done on the lineencoder.iterencode
)
What remains left after this section is serializing a sequence of dictionaries to JSON.
Section 2: Inheriting from List
The Generator
class is the most “hacky” part of the code, but is needed
because Python struggles serializing arbitrary iterators in a memory efficient
way. The Generator
class is adapted from an answer on
StackOverflow.
# Derive from list
class Generator(list):
# Constructor accepts a generator
def __init__(self, generator):
self.generator = generator
# When asked for the iterator, return the generator
def __iter__(self):
return self.generator
# When asked for the length, always return 1. Additional explanation below.
def __len__(self):
return 1
The Generator
class has to derive from list
because when serializing an
object to json, Python will check if it isinstance(list)
. Not deriving
from list
will cause an exception because the json module can’t serialize
generators. I don’t like this implementation because by deriving from list
,
Generator
is stating that it is a list, but an iterator is not a list! An
iterator that is masquarading as an list gives me nightmares because the user
may assume certain truths about lists that are violated when given an iterator.
The three functions defined are known as special methods, and are easily identifiable by the signature double underscore wrapping:
__init__
defines how aGenerator
is customized when someone invokesGenerator(foobar)
.__iter__
defines the behavior of how elements are iterated (eg.for x in Generator(foobar):
). Our implementation simply forwards the request onto the actual generator__len__
defines how to calculate the length of the class (eg.length = len(Generator(foobar))
Defining__len__
is needed inGenerator
because the JSON encoder detects if a list is “falsy, and a python list will evaluate tofalse
if it has a zeroth length (eg.if not []: print 'a'
will printa
), so only[]
will be outputted. Defining the function gets around it.
What’s frustrating is that in the json module documentation for encoders, there is an example “to support arbitrary iterators”, but it ends up allocating a whole list for the iterator, which is what we’re trying to avoid!
There is a way around this hack using simplejson, which is “the externally maintained development version of the json library included with Python 2.6 and Python 3.0.” An issue was raised about four years ago, lamenting the same problem. The author created a branch that fixed the problem and asked for testing volunteers. Unfortunately, no one responded so the author never merged the branch into simplejson. Luckily for us, we can still pip install the branch.
pip install git+https://github.com/simplejson/simplejson@iterable_as_array-gh1
Then we can remove the Generator
class and just use:
from simplejson import JSONEncoder
# [...]
encoder = JSONEncoder(indent=2, iterable_as_array=True)
for chunk in encoder.iterencode(votes):
sys.stdout.write(chunk)
Performance difference between the two implementation (deriving from list vs. custom simplejson) is nearly neglible with using the simplejson approach being about 5% faster.
Doing my open source duty, I made sure that I let the author know that their implementation worked for our purposes. Here’s to hoping he hears and merges it into the next version!
Update: The author responded and asked me to merge the branch back into master so it is updated. I accepted and made a pull request the next day.
Update: I’m happy to say that pull request has been accepted and as of
simplejson 3.8.0, iterable_as_array
can be used, so there is no need to
reference a specific (and outdated) branch on Github for the functionality.
pip install simplejson
will include the option. Now, combining all the tricks
of the trade, we end with:
#!/usr/bin/env python2.7
import csv
import fileinput
import sys
import simplejson as json
from collections import namedtuple
from itertools import imap
RedditVote = namedtuple('RedditVote', ['username', 'link', 'score'])
votes = imap(RedditVote._make, csv.reader(fileinput.input()))
votes = imap(lambda x: x._replace(score=int(x.score)), votes)
votes = imap(RedditVote._asdict, votes)
json.dump(votes, sys.stdout, indent=2, iterable_as_array=True)
Section 3: Writing
The shortest and (hopefully) the easiest section:
# Create a JSON encoder that pretty prints json with an indent of 2 spaces per
# nesting
encoder = JSONEncoder(indent=2)
# Iteratively encode our generator as new information is processed
for chunk in encoder.iterencode(Generator(votes)):
# Send each chunk to stdout as it is generated. Since chunks contain the
# necessary spaces and newlines, there is no need to use `print` as that
# add additional format
sys.stdout.write(chunk)
Results
We’re interested in the amount of time and the max memory usage to show that
the votes file, is in fact, not loaded entirely in memory at once. To measure
max memory usage, there is a nice utility called memusg
online, which we can
grab.
curl -O "https://gist.githubusercontent.com/netj/526585/raw/9044a9972fd71d215ba034a38174960c1c9079ad/memusg"
chmod +x memusg
Let’s run it!
time cat votes.csv | ./memusg ./to-json.py >/dev/null
real 3m36.384s
user 3m31.271s
sys 0m2.445s
memusg: peak=6200
Ok, 3 minutes and 36 seconds to run. Not terrible, but not great. Memory usage is about 6KB, so the whole file was processed without exhausting memory.
You may have noticed that I’m piping the output to /dev/null
. I do this as a
way to measure CPU performance not IO performance. Though, interestingly enough
writing to the file didn’t effect CPU performance, which leads me to think that
the algorithm is strangely CPU bound.
Comparison with Rust
A program that transforms an input shouldn’t be CPU bound. This made my mind wander. What if I coded it up in a more naturally performance language? Most people would drift towards C/C++, but I chose Rust because the state of package management in C/C++ is sad. I don’t want to re-code a robust csv and json parser, or spend an inordinate amount of time trying to integrate packages like rapidjson so that everything is redistributible.
Anyways, below is the first cut of my first program I have ever written in Rust that produces an identical result as the Python code. Don’t judge too harshly!
#![feature(custom_derive, plugin)]
#![plugin(serde_macros)]
extern crate serde;
extern crate csv;
use serde::ser::Serializer;
use serde::ser::impls::SeqIteratorVisitor;
#[derive(Serialize)]
struct RedditVote {
username: String,
link: String,
vote: i32
}
fn main() {
let mut rdr = csv::Reader::from_reader(std::io::stdin()).has_headers(false);
let votes = rdr.records().map(|r| {
let va = r.unwrap();
RedditVote { username: va[0].to_string(),
link: va[1].to_string(),
vote: va[2].parse::<i32>().unwrap() }
});
let mut writer = serde::json::Serializer::pretty(std::io::stdout());
writer.visit_seq(SeqIteratorVisitor::new(votes, None)).unwrap();
}
The code is technically longer, but I feel like it hasn’t lost any readibility. It has different syntax than Python, but that shouldn’t be a big turnoff.
Since this was the first Rust code I have ever written, I can’t tell if it is
idiomatic, but I’m satisfied. The hardest part was looking up function
signatures for serde and lamenting the fact that the rust csv
reader decodes records natively using
rustc-serialize, but the
community is moving towards serde because it has more features and is
faster. There is an
issue open for the rust csv
parser to move to serde
, so the posted code should only become more concise
as time passes.
At the time of writing this, Cargo.toml
looked like:
[dependencies]
csv = "0.14.2"
serde = "0.4.2"
serde_macros = "0.4.1"
Running the code (after compiling with cargo build --release
:
time cat votes.csv | ./memusg ./rs-to-json >/dev/null
real 0m35.366s
user 0m25.384s
sys 0m7.629s
memusg: peak=5588
That’s about a 6x speedup compared to even the Python version that drops down to C for speed. Thus, if speed is of no concern, use Python, but if you want speed and C/C++ might not the right fit – use Rust.
I’ve avoided talking about the sys
timing because until now it hasn’t
constituted a significant part to the timing, but now that sys
is more than a
quarter of the real
time, it is time to talk about it. In our example, sys
measures reading the file and memory allocation, two jobs that are relegated to
the kernal to perform. A hypothesis would be that the Rust code is so fast that
the program is becoming IO bound, and this statement is backed up by watching
top
display our Rust program consistently using less CPU (about 10%).
Roundtrip
If you take a look at the resulting json file, it is quite large. Converting it back to csv might be a good idea because we don’t really gain anything from the json and a csv is much smaller in comparison. We use the same tools as before, except we need to use ijson, as that allows for to stream in JSON.
#!/usr/bin/env python2.7
import ijson.backends.yajl2 as ijson
import sys
import csv
from itertools import imap
votes = ijson.items(sys.stdin, 'item')
votes = ((x['username'], x['link'], x['score']) for x in votes)
csv.writer(sys.stdout, lineterminator='\n').writerows(votes)
There are more lines of code dedicated to importing modules than to the actual conversion process. If that is not a powerful snippet of code, I’m not sure what is!
I chose the ijson backend as yajl2
because the pure Python implementation is
twice as slow. The downside to this approach is that it may not be cross
platform as yajl2
requires a compilation step.
pip install ijson
apt-get install libyajl2
time cat votes.json | ./memusg ./to-csv.py >/dev/null
real 4m21.140s
user 3m58.029s
sys 0m2.489s
memusg: peak=9772
We can ensure that roundtripping produces identical output without creating any additional files:
diff votes.csv <(cat votes.csv | ./to-json.py | ./to-csv.py)
How fast can we go?
From here on out, the content is trivial, but let’s say that you were tasked with converting csv to JSON as fast as possible, and it didn’t matter how re-useable the code was.
For this task we are going to dip our toes into C.
Compile the following code with: gcc -O3 -o to-json to-json.c
. gcc
is
needed because we use a couple gnu-isms, such as getline
to read a line at a
time and __builtin_expect
for branch prediction.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main() {
ssize_t read;
char* line = malloc(256 * sizeof(char));
char* username = malloc(256 * sizeof(char));
char* link = malloc(256 * sizeof(char));
size_t len = 255;
int i = 0;
fputc('[', stdout);
while ((read = getline(&line, &len, stdin)) != -1) {
if (__builtin_expect(i, 1)) {
fputc(',', stdout);
}
i++;
char* username_end = memchr(line, ',', read);
ssize_t username_length = username_end - line;
memcpy(username, line, username_length);
username[username_length] = 0;
char* link_end = memchr(username_end + 1, ',', read - username_length - 1);
ssize_t link_length = link_end - username_end - 1;
memcpy(link, username_end + 1, link_length);
link[link_length] = 0;
fputs("\n {\n \"username\": \"", stdout);
fputs(username, stdout);
fputs("\",\n \"link\": \"", stdout);
fputs(link, stdout);
fputs("\",\n \"score\": ", stdout);
if (*(link_end + 1) == '-') {
fputc('-', stdout);
}
fputs("1\n }", stdout);
}
fputs("\n]", stdout);
free(line);
free(username);
free(link);
return 0;
}
and the results:
time cat votes.csv | ./memusg ./to-json >/dev/null
real 0m1.886s
user 0m1.284s
sys 0m0.270s
memusg: peak=972
That’s 114.73x speedup compared to our python solution and 18.75x speedup compared to our rust solution. Also note the low memory usage.
Despite the speedup, please don’t code like this. There are many inputs that could wreck our toy example (multi-line records, quotations, long usernames, etc). We get all of these features in the Rust and Python versions because we used libraries that handle all the corner cases.
As a side note the C version is the longest line count and took the longest to code.
How about CSVKit?
CSVKit is nice when
working with CSVs and it even has a tool called
csvjson
that
will convert a CSV file into JSON. How does it stack up to our other methods?
First, csvjson
determines the JSON keys from column headers, but in our
dataset we don’t have headers. Also csvjson
doesn’t natively stream data and
using the --stream
option, it won’t ouput valid JSON! The former problem is
easily fixed, but the latter renders this test almost useless. Still, we’ll
execute it and record the results.
# First we add the headers: username, link, and score
time (echo "username,link,score" && cat votes.csv) |
./memusg csvjson --stream -i 2 >/dev/null
real 7m54.152s
user 6m55.465s
sys 0m11.561s
memusg: peak=12952
Wow, slow as molasses (over 250 times slower than our C version) and the final result is still incorrect, but I figured I should this example to be complete, as for small CSV files it should be the quickest because there is no code you have to write, just execute a command!
And PyPy?
PyPy is a fast, compliant alternative implementation of the Python language (2.7.9 and 3.2.5). […] Thanks to its Just-in-Time compiler, Python programs often run faster on PyPy
time cat votes.csv | ./memusg pypy to-json.py >/dev/null
real 1m22.147s
user 1m12.649s
sys 0m1.827s
memusg: peak=78924
Nice. Dropping in PyPy yielded about a 3x speedup without any additional work. Memory usage is significantly higher due (most likely) to PyPy’s JIT compiler.
Summary
test | real | user | sys | memusg |
---|---|---|---|---|
Python | 3m36.384 | 3m31.271 | 0m2.445 | 6200 |
Rust | 0m35.366 | 0m25.384 | 0m7.629 | 5588 |
C | 0m1.886 | 0m1.285 | 0m0.270 | 972 |
PyPy | 1m22.147 | 1m12.649 | 0m1.827 | 78924 |
CSVKit | 7m54.152 | 6m55.465 | 0m11.561 | 12952 |
Roundtrip | 3m21.140 | 3m58.029 | 0m2.489 | 9772 |
Comments
If you'd like to leave a comment, please email [email protected]