I work with webshops, and as such I work with .XLSX files a lot, and convert these to .CSV prior to import/export. It's the OS-independent format. The big standard. Also: the much bigger file! A 821KB XLSX is 87MB large as a CSV!! Considering the difference - even though they supposedly contain the same data, I thought I'd dig into the file format a bit, using this particular file as my example, and see how it really works and what they really contain.
I used to think XLSX files used some smart algorithms or formulas to effectivise the way they store data, like finding similar instances and grouping them together as one entity, or skipping delimiter symbols altogether, and instead using spaces or line breaks (which I assume take less space), but as it turns out it's nothing like that. They just use heavy compression, and if you change the filename you can see exactly what they contain - which is not just the worksheet itself (apparently only 8MB big, though, in XML format), but a lot of other resources, and even the image file for the script icon! Some very unnecessary things, I feel, but the file size does decrease drastically with compression.
Looking at the files, the worksheet I'm currently working with (it's a workbook with only one worksheet) contains these:
Folder - file.ext size/size compressed (in bytes)
docProps - app.xml 781/398 - core.xml 616/338 xl - drawings -- _rels --- drawing1.xml 292/190 --- vmlDrawing1.vml 292/190 -- drawing1.xml 7,027/978 -- vmlDrawing1.vml 9,808/1,150 - media -- image1.gif 42/42 (Hitchhiker's Guide to the Galaxy anyone?) -- image2.png 238/238 - printerSettings -- printerSettings1.bin 3,408/536 - theme -- theme1.xml 7,079/1,684 - worksheets -- _rels --- sheet1.xml.rels 612/240 -- sheet1.xml 8,284,198/715,170 - _rels -- workbook.xml.rels 698/244 - sharedStrings.xml 660,797/112,797 - styles.xml 16,808/2,203 - workbook.xml 543/337 _refs - .rels 588/245 [Content_Types].xml 1602/422
The styles are basically standard, so not much to look at there, but if we dig into the XML file we can see that each row is actually stored in a format as so:
25580 25581 25582 25583 25584 25585
Whereas in a CSV export, it's stored like this:
COL1;COL2;COL3;COL4;COL5;COL6;COL7 COL1 Entry; COL2 Entry;"Column Entry with script";
...and so fourth.
The latter has repeat values in all entries under the column header - too big to include here. In total, there are 6 repeat entries for column 2 through 7, and unique IDs in column 1. This is all stored in order, and plain text.
In the XML variant, the structure tags do take up more space initially, yet repeat entries are referenced by number within each field, rather than actually repeated one field at a time.
Conclusion: for worksheets with unique entries in all fields, the XML version will be both bigger in size, whereas the CSV has no such formatting, and takes less space by default. I assume most bigger worksheets do have repeat entries, though. That's just how these things usually work.
So, it appears this turned into a file reference and comparison between storage format techniques, but I hope you found it useful. That's about all I've learned about the inner workings of these particular files, and about all I feel I need to know. Tune in again for potential rare reports on similar self-studies! Until then.