Tuesday, March 1, 2011

How can I find all of the distinct file extensions in a folder hierarchy?

On a Linux machine I would like to traverse a folder hierarchy and get a list of all of the distinct file extensions within it.

What would be the best way to achieve this from a shell?

From stackoverflow
  • Try this (not sure if it's the best way, but it works):

    find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort -u
    

    It work as following:

    • Find all files from current folder
    • Prints extension of files if any
    • Make a unique sorted list
  • Reursive version:

    find . -type f | sed -e 's/.*\.//' | sed -e 's/.*\///' | sort -u
    

    If you want totals (how may times the extension was seen):

    find . -type f | sed -e 's/.*\.//' | sed -e 's/.*\///' | sort | uniq -c | sort -rn
    

    Non-recursive (single folder):

    for f in *.*; do printf "%s\n" "${f##*.}"; done | sort -u
    

    I've based this upon this forum post, credit should go there.

    : you are going to execute bash for each file name you find??
    ChristopheD : Good point, changed the solution...
  • Find everythin with a dot and show only the suffix.

    find . -type f -name "*.*" | awk -F. '{print $NF}' | sort -u
    

    if you know all suffix have 3 characters then

    find . -type f -name "*.???" | awk -F. '{print $NF}' | sort -u
    

    or with sed shows all suffixes with one to four characters. Change {1,4} to the range of characters you are expecting in the suffix.

    find . -type f | sed -n 's/.*\.\(.\{1,4\}\)$/\1/p'| sort -u
    
    SiegeX : No need for the pipe to 'sort', awk can do it all: find . -type f -name "*.*" | awk -F. '!a[$NF]++{print $NF}'
    : And it's output is also uniq! Nice!
  • Since there's already another solution which uses Perl:

    If you have Python installed you could also do (from the shell):

    python -c "import os;e=set();[[e.add(os.path.splitext(f)[-1]) for f in fn]for _,_,fn in os.walk('/home')];print '\n'.join(e)"
    
  • None of the replies so far deal with filenames with newlines properly (except for ChristopheD's, which just came in as I was typing this). The following is not a shell one-liner, but works, and is reasonably fast.

    import os, sys
    
    def names(roots):
        for root in roots:
            for a, b, basenames in os.walk(root):
                for basename in basenames:
                    yield basename
    
    sufs = set(os.path.splitext(x)[1] for x in names(sys.argv[1:]))
    for suf in sufs:
        if suf:
            print suf
    
  • Powershell: dir -recurse | select-object extension -unique

    thanks to http://kevin-berridge.blogspot.com/2007/11/windows-powershell.html

    GloryFish : Hey that's pretty cool, and very readable.
    Kevin Berridge : Thanks for linking to my blog!

0 comments:

Post a Comment