wget ENOTDIR patch
Friday June 26, 2009
When wget downloads recursively (archiving web content for example) it tends to break with weird sites that treat certain URIs as having content and being a subdirectory.
Most of it stems from a shitty implementation of checking existence for a file, file_exists_p() in utils.c – to summarize:
file_exists_p(const char * filename) {
return stat(filename) >= 0;
}
^ The problem above is that certain error conditions are ignored. When stat() or access() fail they set a global “errno” variable with the possible causes of failure, but the function above doesn’t check this.
There are two scenarios here – Scenario 1:
1. wget downloads /at/this/uri/first
2. wget downloads /at/this/uri
The content /at/this/uri/first is created fine, “uri” is a directory on the filesystem. /at/this/uri is also fine, because wget sees it already exists through stat(), so it creates /at/this/uri.1
Scenario 2:
1. wget downloads /at/this/uri
2. wget downloads /at/this/uri/second
The content /at/this/uri downloads initally, “uri” is now a file on the filesystem. When wget tries to save /at/this/uri/second, it fails, wget assumes the file doesn’t exist, because stat() fails, and there’s no error handling. As a consequence it does not try to rename the file, and the error that comes from writing out the file (ENOTDIR) gets swallowed.
Patch (for version 1.11.4) can be found here
The patch allows wget (in the second scenario) to download /at/this/uri.1/second, where uri.1 is a directory.