How to copy a directory tree from one place to another preserving permissions (and skipping erroneous directory entries)

I already documented how to use tar and ssh to copy a directory tree to another host. This time lets copy a filtered subset of a directory to another (local) directory.

Lets start in the middle:
find . \( -type l -exec sh -c 't="$(readlink "{}" 2> /dev/null)" && [ -n "$t" -a -e "$t" ]' \; -o -type f \) -size -2097152 -print0 2>> /tmp/error.log | tar -cf - --ignore-failed-read --ignore-command-error --null -T - 2>> /tmp/error.log | tar -xpf - --ignore-zeros -C /mnt/target 2>> /tmp/error.log

The above command will find all regular files and valid symbolic links with a size less than a gigabyte and copy them from the current working directory to /mnt/target. It'll create a logfile (error.log) of all the entries that it could not read or had any problem with. I've used this particular commandline to create a backup of the contents of a read-only mounted filesystem that got corrupted and contained a few invalid entries (eg. files with huge -several GB- sizes, etc.). Obviously I didn't want to make a copy of "bogus files" (that had a size of several gigabytes) thus I created a list of the probably "good" files that are worth saving. The readlink check is there to skip bogus symbolic links as well.

The filesize check is of course far from perfect, but choosing a proper filesize limit for a find might get you close enough to distinguish the effectively valid directory entries from the corrupted ones.

The -print0 switch of find and the --null switch of tar make sure that even the most exotic file and directory names (eg. the ones containing a whitespace character or a newline) are handled properly. The --ignore-failed-read switch is quite self-explanatory. The -p switch of the second tar command makes sure file and directory permissions are preserved.

Note that I do all the heavy lifting to determine "bad" filesystem entries because tar is quite sensitive when it comes to invalid files. Eg. if it runs into a "bad" symbolic link, it simply exits (or segfaults) and you'll end up wondering why did it not copy over all of the specified directory tree. To debug issues like this you can create a verbose log for all files it reads by supplying the -v and the --index-file switches (the latter specifies the filename for the verbose log).

P.S.: the seemingly complex validity check on symbolic links is there because I've experienced "symlinks" (in failing filesystems) that readlink verified (and returned a zero exit value meaning the symlink is OK), however the resolved path (printed to stdout by readlink) seemed to be an empty string (and I say "seemed" ... because a test -e $(readlink filepath) returned zero as well!).