I've been playing with some Fuzzy Logic code.
Basically, you tell it to match on DomainName (from the user's email address) plus their Country and it looks for matches based on that.
Then the Fuzzy part, you tell it the Company name to search for and it looks for matches based on similarities based on percentages.
-- Jaro-Winkler returns a value between 0 and 1, the closest to 1
-- the more similar it is. This variable allow us to ignore matches
-- with a lower score.
@SimilarThreshold = 0.825;
So you tell it to match on 0.825 and it only returns data equal or above that range for matching percentages.
I've been running the code for 375k rows and it runs for about a half hour give or take.
After it renders results, I can identify the percentage of matches found, and adjust the percentage accordingly.
Actually, I'd prefer to keep the percentage higher than lower, to make this code air tight.
However, once I get the figures in place, then I can begin testing lower percents as well as do some Fuzzy logic for other fields, like IP Address, Address, etc.
This really opens up a lot of doors when trying to match desperate data source accross servers that otherwise would be able to mash.
Fun stuff for a Friday!
I signed up for the Hortonworks Certified Associate exam last Thursday. Figured if I sign up, I'd have to take the test. And if I tak...
Saw a post today on Twitter, " Microsoft releases CNTK, its open source deep learning toolkit, on GitHub " This is big news. Be...
It seems like open source applications are the mainstream today. So many new products delivered through Aache foundation. Some do this. S...