lundi 28 septembre 2009

Index your hard drive and find duplicates with M-Trees

Truth to be told, I'm not a computer scientist... I'm a developer. I'm not working in research area because I'm too dumb to think, I just prefer to build.
As a developer, I wanted to learn by curiosity AI and I've bought this book in during a trip in China (it was not expensive !!) Artificial Intelligence: A Modern Approachh by Stuart Russell and Peter Norvig.
Out of luck, this book is mathematician oriented, I'm really really too dumb to understand what's going on... (I understood some parts, but my mind had to turn off the creative mode).

AI was too hard, so I wanted to learn Data Mining instead; then I've stackoverflown a little bit.

Then, I amazoned Data Mining: Practical Machine Learning Tools and Techniques by Ian H. Witten , Eibe Frank, and I don't regret it !!!!!
I discovered that Data mining is just another word to say AI, some sections of this book overlap with my last one, the difference is that this book is really for developers, don't wait just buy, if you want ideas, you'll have !
Metric Tree (M-Tree) are not aborded by this book but they mention it, curiousity obliged, I googled.

Quickly, you just have to define what is the distance as defined on wikipedia between two objetcts (not necessarily between two numbers).
And then you can easily search the nearest neighbour of an object. Or the objects contained in a range. And that very very quickly approximatively O(log(n)) where n is the number of objects.

What if these objects where the files on my hard drives, and the distance function was the Hamming distance between these two files ? Yeah, I will easily find similar or duplicate files on my hard drive ! See you, I need to code NOW !

0 commentaires:

Enregistrer un commentaire