I would be very interested if you or tsholmes could discuss the thought process ...

tsholmes · on Dec 2, 2021

Sure thing!

The sqlite file format spec (https://www.sqlite.org/fileformat.html) is fairly short. I looked at hexdumps of where the db header should be, and a few page headers, and it seemed to line up, but not everything was filled in correctly. Trying to patch the header fields to match the sqlite spec and running it through sqlite3 didn't work, so I figured there was some more structural differences.

Since the spec is pretty short, I wrote a small parser for the sqlite files, printing out hexdumps of various sections as I went to check against the spec. The cell pointers in the btree leaf pages were missing, so I just tried reading them out in order from the cell data section and that worked (the pointers are really just an optimization to skip over cells when scanning). The table schemas are stored differently, but its a text format so its pretty easy to read (comma-separated columnName:type).

The record format was the most different part. Instead of type-tagged values like in sqlite, these records just had the column count and offsets of values. The offsets can be 1 or 2 bytes depending on the length of the record, which tripped me up for a bit. This was mostly figured out by looking at hexdumps of records. The values have to be interpreted based on the schema of the table the record is in.

The file format used little-endian for the record values, despite everything else in the file being big-endian. The schema in these files allowed array-of-vals and array-of-structs column types that I haven't decoded yet, but they don't seem to be on any of the important tables, mostly file metadata tables. There are some untyped binary columns containing an array of 64-bit floats, which was found by comparing against the csv of the same data.

Not counting some of the binary columns which I haven't decoded yet, the alternate version of the various sections were still pretty simple, so I was able to figure them out by eyeballing it. There weren't any numeric enums (outside of the ones in the sqlite spec) or flag masks that would have taken a while to decipher.

luckywatcher · on Dec 2, 2021

I had been working on this myself! I haven’t dug into SQLite to this level before, but found the header easy to correct. I got stumped when eyeballing the additional pages of information.

I really enjoyed learning more about b-trees and binary encoding formats taking this on. My primary exposure to those concepts has come from reading Designing Data-intensive Applications — so having something concrete to work on to test that knowledge was fun.

I’m excited to review your script. Great work!