Log in

23 March 2006 @ 10:18 am
Command-line file searching  
Just found an excellent article on using Mac OS X’s Spotlight file index from the command line. Basically, instead of the very slow:
find / -name '*Oracle*'

or the not-up-to-date:
locate Oracle

Both of these just work by name. If I don’t know for sure that there’s an Oracle in the name then I have to resort to:
grep -r Oracle /*

That sucks. But on Mac OS X you can just do:
mdfind Oracle

Now, I’ve done a lot of work with Oracle and therefore this still takes a while (5 seconds on a second run)—but the biggest problem is that it returns 942 records, including text files, Excel spreadsheets and e-mail messages. Now I could pipe the results through grep to narrow down the filename but if I know I’m looking for a spreadsheet I can get a lot cleverer (and here’s where find and locate just don’t cut it).

mdfind works with the whole file—not only its content but also metadata added specific to the file format. To that end, I can look for spreadsheets that I wrote. But how to find what the metadata fields are? Well, the best way is by example. So I take an existing Excel file that I know I wrote and run it through mdls:
kMDItemAttributeChangeDate     = 2006-03-21 17:56:41 +0000
kMDItemAuthors                 = ("Gareth Boden")
kMDItemContentCreationDate     = 2006-01-20 07:12:01 +0000
kMDItemContentModificationDate = 2006-03-21 17:56:40 +0000
kMDItemContentType             = "com.microsoft.excel.xls"
kMDItemContentTypeTree         = ("com.microsoft.excel.xls", "public.data", "public.item")

OK, so If I revise my Spotlight search to look for spreadsheets I created in my home directory which mention Oracle:
mdfind -onlyin /Users/gareth "kMDItemAuthors = 'Gareth Boden' && kMDItemContentType = 'com.microsoft.excel.xls' && kMDItemTextContent = 'Oracle'"

Bingo! Which reminds me of another neat feature in Mac OS X—content types.

Not content with MIME types and file extensions and creator and file types, Apple decided to invent yet another classification scheme. But this time, they did a good job. Rather like MIME, it’s hierarchical (but to more than two levels, which is better). Unlike MIME (and like other smart ideas like Java package names) it piggybacks onto DNS registration to avoid centralised registration. Apple manage the public namespace (and the com.apple one) but the rest is up for grabs. From my example before, you can see that an Excel spreadsheet is also a public.data and a public.item content type. My AAC grab of Gorillaz’ DARE is:

And I can use those to search for all MPEG 4 audio, all audio files, etc etc. Very cool.

Is there anything like Spotlight for Linux? Is it part of Darwin’s open source stuff?
John: coffee cupjarel on March 23rd, 2006 11:35 am (UTC)
There's Beagle, which I think does much the same thing (and I think it's been around in various incarnations since a while before Apple announced spotlight).
Tapinatapina on March 23rd, 2006 11:50 am (UTC)
Looks cool.. I didn't know there was a C# port of the excellent Lucene, either.

I take it inotify is the kernel's way of telling user processes that something's been updated? And extended attributes in filesystems permit the addition of file metadata? Is there a standardised schema for file metadata (like the "kMDItemAttributeChangeDate" stuff mentioned above)?
Johnjarel on March 23rd, 2006 05:07 pm (UTC)
Re: Beagle
Yep, inotify is a mechanism (in recent kernels) for getting file change notifications - it's a replacement for the older 'fam' daemon, avoiding all the polling overheads etc.

No idea about the metadata, though!