Parse Log output from Hadoop Yarn MapReduce DFSIO Benchmark Utility to CSV Files

Recently I've been spending my time running lots of dfsio benchmark jobs.
The DFSIO utility is part of the Hadoop distribution and can be found in jars located in ./hadoop/share/mapred
– the JARS have a name like "hadoop-mapreduce-client-jobclient-*-tests.jar
.
Output from the tool is a column order which is readable in single runs. However, I've been running these in a loop with varying parameters. This emits output friendly to the eyes, but after looping through the output is not great to parse and later analyze.
----- TestDFSIO ----- : read
Date & time: Wed Oct 16 11:09:00 EDT 2013
Number of files: 10
Total MBytes processed: 10000.0
Throughput mb/sec: 40.946519750553804
Average IO rate mb/sec: 45.240928649902344
IO rate std deviation: 18.27387874605978
Test exec time sec: 47.937
So, the following script in Python3 can take this output and turn it into a usable CSV file just using the row headings as the columns:
#!/usr/bin/env python3
def parse_file(file_name):
keys = list()
rows = list()
row = dict()
with open(file_name, 'r') as log:
items = [line.split(':', 1) for line in log]
for item in items:
if len(item) < 2:
rows.append(row)
row = dict()
continue
key = item[0].strip()
value = item[1].strip()
if key not in keys:
keys.append(key)
row[key] = value
return keys, rows
def write_csv(file_name, keys, rows):
import csv
with open(file_name, 'w') as outfile:
w = csv.DictWriter(outfile, keys)
w.writeheader()
w.writerows(rows)
def parse_arguments():
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('-f', '--file', help='File to parse')
parser.add_argument('-o', '--output', help='Output file')
return parser.parse_args()
def main():
args = parse_arguments()
keys, rows = parse_file(args.file)
write_csv(args.output, keys, rows)
if __name__ == '__main__':
main()