My Understanding of Mahout Machine Learning

Mahout is used in Hadoop for Machine Learning.


It consist of three primary features:

  1. Classification - read some data, then read new data, label new data based on initial data, i.e. email systems classify new data based on existing data to identify Spam
  2. Clustering - explore data by grouping data into buckets based on common characteristics, used when the known categories are not labeled upfront, you can specify the # of clusters as well as Min and Max thresholds, here are the three commands to accomplish this:
    1. mahout seqdirectory - create directory of Textfiles and convert to Sequence Files
    2. mahout seq2sparse - transforms Sequence Files into Vectors
    3. mahout kmeans - does the actual clustering (iterative so can be slow) (there's also Fuzzy kmeans)
  3. Recommendations - combines local data with community data to determine likelihood of you liking something (has a lot of support in the Mahout ecosystem)
With data growing to massive levels, machine learning is an effective tool to find patterns in the data.  However, it's not an exact science, usually an iterative approach.  Most of the work involved is done at the ETL level, preparing the data, usually done in Hadoop.

Machine Learning using Mahout is a multi-step process from start to finish.  From gathering data, ingesting to Hadoop, running algorithms, indexes the results and constructing a readable output, perhaps in a Visualization format for consumption.  It's an iterative process where the model gets updated from time to time, which means the data must be re-run based on the new outputs for more accurate results.

Here's a link to walk through a demo.

I have a few version of Hadoop running on my local machine:

 After starting and connecting to the HDP Windows 1.3 version from Hortonworks, I started the Services:

Next, navigated to the Mahout directory:

Read the README.TXT file:

Opened the Mahout-\bin directory:

Typed "Mahout", it spawned off a job:

And basically displayed a Help file with a list of Mahout commands:

Not sure what just happened, appeared to kick off a job, so ran a "hadoop fs -ls" to scan folders:


In the meantime, went to the Hortonworks site to verify the Mahout specs for HDP 1.3 for Windows:

It says Apache Mahout 0.7.0 is included in the bundle.  And apparently there is sample data in one of the folders of the Mahout sub folders:

Opened the web to drill down to the Filesystem, see if any new files were generated:

The only date from today is the MapRed folder, after drilling down, didn't see anything that would have been generated from the Mahout sample job:

While I'm hear might as well check the Map/Reduce Administrator:

Well, I've gotten a little further in learning Mahout today.  I'll need to do some more research.  In order to run this in HUE, I may need to download and install HDP version 2.0 for Windows.  Running from a GUI interface is usually easier than the command line.  Having Hadoop available on the Windows platform sure is nice, you get to run DIR instead of LS.  I've been working with DOS command line since the mid 1980's so it definitely feels easier.  Everyone has their own personal preferences.

No comments:

Post a Comment