I work with webshops, and as such I work with .XLSX files a lot, and convert these to .CSV prior to import/export. It's the OS-independent format. The big standard. Also: the much bigger file! A 821KB XLSX is 87MB large as a CSV!! Considering the difference - even though they supposedly contain the same data, I thought I'd dig into the file format a bit, using this particular file as my example, and see how it really works and what they really contain.
I used to think XLSX files used some smart algorithms or formulas to effectivise the way they store data, like finding similar instances and grouping them together as one entity, or skipping delimiter symbols altogether, and instead using spaces or line breaks (which I assume take less space), but as it turns out it's nothing like that. They just use heavy compression, and if you change the filename you can see exactly what they contain - which is not just the worksheet itself (apparently only 8MB big, though, in XML format), but a lot of other resources, and even the image file for the script icon! Some very unnecessary things, I feel, but the file size does decrease drastically with compression.
Looking at the files, the worksheet I'm currently working with (it's a workbook with only one worksheet) contains these:
Folder
- file.ext size/size compressed (in bytes)
docProps
- app.xml 781/398
- core.xml 616/338
xl
- drawings
-- _rels
--- drawing1.xml 292/190
--- vmlDrawing1.vml 292/190
-- drawing1.xml 7,027/978
-- vmlDrawing1.vml 9,808/1,150
- media
-- image1.gif 42/42 (Hitchhiker's Guide to the Galaxy anyone?)
-- image2.png 238/238
- printerSettings
-- printerSettings1.bin 3,408/536
- theme
-- theme1.xml 7,079/1,684
- worksheets
-- _rels
--- sheet1.xml.rels 612/240
-- sheet1.xml 8,284,198/715,170
- _rels
-- workbook.xml.rels 698/244
- sharedStrings.xml 660,797/112,797
- styles.xml 16,808/2,203
- workbook.xml 543/337
_refs
- .rels 588/245
[Content_Types].xml 1602/422
The styles are basically standard, so not much to look at there, but if we dig into the XML file we can see that each row is actually stored in a format as so:
25580 25581 25582 25583 25584 25585 |
Whereas in a CSV export, it's stored like this:
COL1;COL2;COL3;COL4;COL5;COL6;COL7
COL1 Entry; COL2 Entry;"Column Entry with script";
...and so fourth.
The latter has repeat values in all entries under the column header - too big to include here. In total, there are 6 repeat entries for column 2 through 7, and unique IDs in column 1. This is all stored in order, and plain text.
In the XML variant, the structure tags do take up more space initially, yet repeat entries are referenced by number within each field, rather than actually repeated one field at a time.
Conclusion: for worksheets with unique entries in all fields, the XML version will be both bigger in size, whereas the CSV has no such formatting, and takes less space by default. I assume most bigger worksheets do have repeat entries, though. That's just how these things usually work.
So, it appears this turned into a file reference and comparison between storage format techniques, but I hope you found it useful. That's about all I've learned about the inner workings of these particular files, and about all I feel I need to know. Tune in again for potential rare reports on similar self-studies! Until then.
Comments
The Comment Form
© CyberD.org 2024
Keeping the world since 2004.
surprised?? I was a bit baffled to find out that .doc files are smaller than .txt equivalents too, despite that .doc contains formatting, font, and metadata. The compression does the trick! It's just raw versus compressed files. in fact, i've just used pure csv files, so columns are separated by commas and rows by carriage returns. or i just need separate entries have everything on a single line delimited by commas. Might as well just stick with excel spreadsheets (or perhaps better OpenOffice, LibreOffice, or if you're nerdy enough R...not sure how well they pack data though) and it's extended table modifying abilities.
Interesting! Hadn't noticed the difference between .txt and .doc. I started opting for .txt instead of .rtf as soon as I noticed the latter was so much bigger, but this...
Well, I just tested throwing together a .docx for a .txt I use, and this seems to apply only with larger files. For example: a 196KB .txt file (ca 200,000 characters) took up just 124KB as .docx (it works!), yet a .txt with just a three letter word took just 3 bytes, whereas the .docx alternative took 9,74KB. So: for any .txt files smaller than ca 20KB I assume you'll save the extra size in formatting, and up till at least 100KB (which is a LOT for a .txt file) the difference should be slim.... or hey, might as well test: cut out segments of a .txt file, mirroring the content in a .docx, to see where you start saving size with the latter. Seems I was right about that 20KB mark! :) A 19,4KB .txt was 21,4 as a .docx, and up around 30 KB the .docx starts getting a few KB smaller.
The ratio between saved filesize and start-up time correleating to said compression though... would be interesting to research a bit!
Yeah, if you need to just edit a cell it's much easier via .CSV, since it's basically just .txt format anyway. Can be opened with Notepad or similar too. Can't carry formatting though, so for larger spreadsheets the .xlsx has some benefit! And macros. Formulas. Useful stuff.
i've had text files up to a GB, but at that point they're basically just meant to be interpreted as database type files. Like you said, there really is no fundamental difference between a plain text and .csv file.
Don't even dare use Excel to open a file that's over 100 MB though LOL! I don't think Excel was ever intended to handle big-data systems, especially after all these years and they've never really improved their file buffering capabilities. For sorting & filtering, I use R, and/or write my own sorting/filtering algorithms.
Right, yeah, I've had pretty big database exports too, in .sql format, though really is the same thing as well...
Apart from .CSV files then. :) I gotta try that sometime, wonder what'll happen... R hmm... you mean the language R, R Project? Do you have any graphical interface for that too? Seems to be a bunch of programs built around it.
I use R now but for a different purpose...back then I had only used it only once- for sorting a very large amount of entries.
I think R by default comes with RGUI, which is decent, but RStudio is really nice.
Ah man I didn't remember you mentioned R here back in the day too. Interesting.
But for what purpose, I wonder? :P Very curious as to what kind of data you dabble with here.