{
  "WorkItem": {
    "AffectedComponent": {
      "Name": "",
      "DisplayName": ""
    },
    "ClosedComment": "fixed in changeset 53391.  First binary&#58;  v1.9.0.35",
    "ClosedDate": "2010-01-01T11:07:21.337-08:00",
    "CommentCount": 0,
    "Custom": null,
    "Description": "I use the DotNetZip library to store a large number of XML files in a ZIP file, \nwhich then is archived in a Subversion repository. Binary files are\nstored in SVN using a binary diff. \nIt would be great if the files in the ZIP file would be stored \nsorted by file path, so that the binary diff is as small as \npossible (containg only the compressed changed \nfiles in the ZIP files + the header changes).",
    "LastUpdatedDate": "2013-05-16T05:31:57.097-07:00",
    "PlannedForRelease": "",
    "ReleaseVisibleToPublic": false,
    "Priority": {
      "Name": "Low",
      "Severity": 50,
      "Id": 1
    },
    "ProjectName": "DotNetZip",
    "ReportedDate": "2009-12-27T01:10:08.857-08:00",
    "Status": {
      "Name": "Closed",
      "Id": 4
    },
    "ReasonClosed": {
      "Name": "Unassigned"
    },
    "Summary": "Store files in zip file sorted by file path",
    "Type": {
      "Name": "Issue",
      "Id": 3
    },
    "VoteCount": 1,
    "Id": 9831
  },
  "FileAttachments": [
    {
      "FileId": 2551,
      "FileName": "ZipEntryList.cs",
      "DownloadUrl": ".\\2551"
    }
  ],
  "Comments": [
    {
      "Message": " I agree that it might be nice to have the zip entries sorted in the saved archive. \r\nBut in your case, a better option might be to diff the output of \"unzip -l\" or some other text representation of the zip. \r\n",
      "PostedDate": "2009-12-27T07:51:21.49-08:00",
      "Id": -2147483648
    },
    {
      "Message": "In my use case your solution does not work since I unpack the needed files only to memory but never to disk\r\n(Subversion is extremely slow when it has to handle many (even small) files).\r\nIn my copy of your library (1.8.x) I implemented the sorted output by exchanging the List<ZipEntry> type\r\nof the _entries field (ZipFile class) by the attached class. Feel free to use it if you like.",
      "PostedDate": "2009-12-28T01:20:24.037-08:00",
      "Id": -2147483648
    },
    {
      "Message": "",
      "PostedDate": "2009-12-28T01:22:11.497-08:00",
      "Id": -2147483648
    },
    {
      "Message": "Thanks for the code. Did you consider using the SortedList type that is included in the .NET Framework? \r\n\r\nBecause of performance problems checking for uniqueness of names for the ZipEntry objects (See workitem 9596), in v1.9 I'm modifying DotNetZip to use Dictionary<String,ZipEntry>, rather than List<ZipEntry>.  SortedDictionary is the wrong thing in the general case, because not everyone wants sorting, and it can be expensive for zip files with many, many entries.  So there's a new property, SortEntriesWhileSaving, that if set to true, causes the sort to be performed just once, at the time of Save().\r\n\r\nThis change is in changeset 53130.\r\n",
      "PostedDate": "2009-12-28T10:35:55.677-08:00",
      "Id": -2147483648
    },
    {
      "Message": "I also looked at the SortedList type, as replacement of the List or Dictionary, but the ContainsKey method is much slower O(log n) \r\nthan the equivalent ContainsKey method of the Dictionary type O(1). Therefore I use (see my code) a dictionary to overcome the\r\nperformance problem you also described and a normal list which is always sorted using the modified Add method which\r\nuses the BinarySearch method (order O(log n)) to find the correct place to insert new entries.",
      "PostedDate": "2009-12-28T12:31:39.2-08:00",
      "Id": -2147483648
    },
    {
      "Message": "I looked at your changes made in set 53130. My first try at implementing the\r\nsort looked similar, but it was very slow with the large number of files I have\r\n(>150000, most are <1kB). The main reason for this is, that once the zip file\r\nis sorted the Quicksort algorithm has at best O(log n), but without doing much.\r\nThis was the reason for me to \"sort\" the list while inserting the entries into it.\r\nThis way I can do a simple comparison with the last entry\r\n\"(zipEntryCompare.Compare(zipEntry, zipEntryList[zipEntryList.Count - 1]) == 1))\"\r\nand append the new entry in case of success and have to do a search only if the\r\nnew item does not fit in the established order. This resulted in a dramatic \r\nperformance improvement for (almost) completely ordered ZIP files.",
      "PostedDate": "2009-12-28T23:47:04.783-08:00",
      "Id": -2147483648
    },
    {
      "Message": "Taschna, interesting. Do you have a test harness that models the performance of the various scenarios?  It sounds like you have a particular scenario you'd like to support - sorted zip files with \"almost completely ordered\" sets of files.  That seems very specific; I don't think it's the 80% scenario.   I'm inclined to keep the balance that I struck with the Dictionary and an optional sort.  I'll do some further testing to understand the performance implications, varying the randomness of the input filenames, and also the number of entries in the zip.  Pending the outcome of those tests, I'll finalize the design.   If you can show me compelling evidence that your approach to sorting supports the more general case well, I'll include it in DotNetZip.  But right now I'm leaning the other way.\r\n\r\n",
      "PostedDate": "2009-12-31T11:50:45.023-08:00",
      "Id": -2147483648
    },
    {
      "Message": "My main concern about the way the SortEntriesWhileSaving feature is implemented right now is that\r\npeople will wonder why it is so slow if you do for example the following with your library:\r\n\r\n- create a ZIP file with several thousand files and save it sorted\r\n- open the ZIP file, delete one of the entries and save it sorted\r\n\r\nThe second save will have to do the Quicksort of the already sorted list and is\r\ntherefore for the user surprisingly slow.\r\n\r\nMy proposal would be, if you want to provide such a sorted save feature,\r\nto declare the _entries as IDictionary<String,ZipEntry> and add a new constructor for\r\nthe ZipFile class which enables the user to specify whether he want to have a\r\nsorted save. In the default case (no sort) you could initialize the _entries field\r\nwith a normal Dictionary<String,ZipEntry> and in the sort case you can specify a class\r\nwhich uses the the described method with the always sorted list.",
      "PostedDate": "2009-12-31T17:00:49.233-08:00",
      "Id": -2147483648
    },
    {
      "Message": "I just ran some tests.  On my machine, adding 60,000 random entries into a Dictionary<String,Object> took .25s if unsorted, .55s if sorted once, after all entries were added, and took .55s if sorted during the add (via SortedDictionary).  Clearly there is a \"big\" difference between sorting and not sorting: the sorted version is 2x the cost.  But the difference between sort-as-you-go and sort-at-the-end is very small, effectively zero.   But even the high range of that cost is likely to be small in comparison to the cost incurred in compressing, maybe encrypting, and then writing zip entries to a file or stream.  Supposing your special handling with a binary search has a cost that is somewhere between 0.25s and 0.55s, the total cost of sorting is 0.3s.  You didn't provide code so I can only theorize about the performance of your custom approach.   Let's be very generous and estimate your custom approach incurs 0.15s for sorting, rather than 0.3s; in other words it takes half as much time as my straighforward sort-at-the-end approach.   I find it doubtful that people will think \"it is so slow\" when the difference is 0.15s for 60,000 entries.  Even if your custom approach takes no time at all, which is impossible, the total savings is still only 0.3s. No one is going to notice that, for 60,000 entries.   \r\n\r\nUsing 1.2m items, the sort cost becomes more significant.  unsorted: 5.9s.  Sorted once: 15.3s (sort cost of 9.4s).  Sorted continuously: 19s (sort cost of 12.4s).   We might guess the cost for your custom approach would be between 5.9s and 19s. There may be some savings to be had, but again these figures will likely be small in comparison to the other costs involved in producing a  zip file.  This is all for totally random (unsorted) input keys. \r\n\r\nI also timed the results for \"mostly sorted\" input keys, where 1 out of every 100 keys was out of order. In this case, for 60,000 items, the SortedDictionary was .47s (versus 0.55s for the unsorted input), and the regular Dictionary with one sort after all adds was .57s (versus 0.55s).  For 1.2m items, the SortedDictionary was 11.8s (versus 19s for unsorted input), the regular Dictionary with one sort after all adds was 13.3s (versus 15.3s).  The time advantage for the sort-as-you-go approach (my rough approximation of your proposal) over the sort-once-at-the-end (my current implementation) for mostly sorted input doesn't justify special handling for the mostly sorted case, because (a) the sort-as-you-go approach has a more substantial disadvantage for completely unsorted input, which in my estimation is the 60-80% case; (b) the cost difference in the special case isn't substantial in comparison to the compression and IO cost to justify the additional code complexity. \r\n\r\nI hope this all makes sense. \r\n\r\nThe result of all this is that I'll stay with the current implementation, which does a sort at the end, only when necessary.  \r\n",
      "PostedDate": "2010-01-01T11:04:10.683-08:00",
      "Id": -2147483648
    },
    {
      "Message": "",
      "PostedDate": "2010-01-01T11:07:21.337-08:00",
      "Id": -2147483648
    },
    {
      "Message": "",
      "PostedDate": "2013-02-21T18:43:45.227-08:00",
      "Id": -2147483648
    },
    {
      "Message": "",
      "PostedDate": "2013-05-16T05:31:57.097-07:00",
      "Id": -2147483648
    }
  ]
}