Notes: Reading The CSV
4 min read
In an interest to keep writing instead of focusing too much on the perfect form, I decided to do intermediate posts as "Notes". The paragraphs will be more stream of consciousness and less polished, but perhaps they will be consolidated into more coherent pieces at a later time.
I considered doing a deep dive on shell utilities like cut, awk, awk for csv to name a few. I decided not to spend too much time on it because the reality is I already know I will be focused on doing this in Rust. However, I really want to play with XSV at some point -- looks incredible.
Both Python's csv and Rust's csv solutions optimize for reading by record. But pandas allows you to do optimized field-wise operations with something like
apply. I started looking into how they do this and ran into BlockManager. It sounds interesting, but I quickly realized I didn't want to get too deep into the weeds there. I have a habit of prematurely optimizing and at some point my project will just never get finished. So I'm linking to these resources for now and when the time comes, I can refer back to them again: good starter summary, linked reference to an article by Wes McKinney, McKinney's motivation for an alternative to the earlier block manager implementation, and Pandas roadmap for its rewrite.
As I was writing this, I actually came across numpy's loadtxt. It seems it has a functionality where it will provide a transposed version of the data if the right argument is passed. This is really neat because as I was reading through Rust's CSV reader implementation, I was thinking that I want the same reader as if I rotated my data 90 degrees (gosh, I hope I remember geometry). I couldn't think of the word before but I remember now that I have run into it when working on Excel and Google Sheet's TRANSPOSE function. Once again, I arrive at a potential optimization question that I will defer answering: if we want to have easy access to both record-wise and field-wise data, what's the most efficient way to store it? The cheap solution seems to be storing the data twice for now. The harder solution might involve matrices (ndarray, nalgebra), and potentially digging more into Pandas/numpy internals. We'll cross the bridge when we get there.
So storing twice. Although I love the idea of rotating the data 90 degrees, it doesn't seem feasible to do that since data comes in one byte at a time? I'm bound to forget this so I'll write it here: the Rust csv library uses the csv_core crate which has a method called
read_field. The field referred to there is more of a cell in spreadsheet parlance, not the entire column as I had hoped. After playing with the csv library a little bit more, I decided to write a wrapper struct that can iterate by record or by field. The data actually ends up buffered in memory once and I re-read the data whenever I traverse the records or the fields. Here's the messy code for reference.
Not captured very well here is that while I was tinkering with the library, I started thinking long about SQL. I've been following the Aurae project and was surprised to see that SQLite was one of their building blocks for something as complex as a systemd alternative. For background, my early introduction to SQLite had me believe that it might not be right for "important work". With fresh eyes and actually reading the website, I have a newfound appreciation and interest in the database format. I think the flexible typing really sold me on giving this an earnest try. Realm also looked interesting, but given I know that there will be extensive work with relational data for these flatfiles, I'll have to set it aside for now. I think where I'm headed is I'm going to have to get decent at SQL here in the next couple of months. I'll pick some text to study along the way, but I want to emphasize building so I need to think about how to share artifacts of my work.
Did you find this article valuable?
Support Yusuph Mkangara by becoming a sponsor. Any amount is appreciated!