So who are the early adopters of #Hadoop ?

Who should be learning Hadoop?

DBAs?  They administer databases, do backups and restores, grant / deny permissions, make sure the database is up and running efficiently.  Many do not know how to write SQL.  Or .net.  Or Java.  Or Python.  So why should they be excited about learning new skills?

Programmers?  I've always been surprised how many programmers don't know SQL.  However, a programmer is concerned about front end visualization, middle tier functionality or connectivity to the database.  They must know security, networking, authentication, client side scripting, I don't see many programmers making a conscious effort to learn Hadoop.

Project Managers?  They manage projects.  They should know the basics, but don't expect them to be installing Hadoop clusters any time soon.

Business Analysts?  They understand the business and they understand technology.  They are the go between / middlemen.  Same thing, I don't see many writing PIG or HIVE scripts.  Although they may consume the data via reports.

Report Writers?  I could see some report writers dabbling in Hadoop.  They write reports against data, so why not Big Data?  Except why would they want to architect Hadoop?  Do they know ETL and how to move data through the system?  Can they program in real languages?  Maybe.  Maybe not.

Database Developers?  I see this group to be the most active in learning Hadoop.  They know data.  Probably know the business.  Maybe they can code besides SQL.  Perhaps an curiosity to stand up a cluster.

Supervisors / Managers?  They stopped coding long ago.  They are up to their ears in politics, paperwork and a mild form of babysitting.  They may not want to risk their careers in jumping head first.  Or the time.  Or the resources.  Or the capital.  Or the blessing from Execs, to get started with Hadoop.

Quality Assurance?  I find this group to be somewhat isolated, in my 20 years of coding I've worked at one or two places that had dedicated QA.  I'm sure they'll exist in Hadoop land, except they will not be the drivers.

What I'm trying to point out is who would become early adopters of Hadoop.

Because of the complexity, the number of projects contained within the ecosystem and the rapid change and most of all the number of skills required, who exactly is the main target audience for Hadoop?

And even if a group or segment had a some or most of the skills sets, even if some actually had the desire, I see many people steer clear of jumping into Hadoop for the reasons mentioned above.  And most of all, they fear change.  They are comfortable in the knowledge set.  People run from change more than heights and public speaking.  You want to scare someone, tell them they must learn all new skills, their old job is going away and then let them know how much of a learning curve is involved.  Ha!  Funny but true.

I have:
  • Database Admin experience
  • SQL experience
  • Report Writing experience
  • Database Development experience 
  • Data Warehouse experience
  • Programming experience 
  • Business Intelligence experience
  • Project Management experience
  • Supervisor experience
According to my stats, I should be an early adopter of Hadoop.  Except not everyone has a wide skill set.  So who are the early adopters of Hadoop?  What is the typical skill set?  From what pool is the skill set coming from?


Install Mahout, RHadoop, Configure Map Reduce Job

Found a good blog post on installing Hadoop Mahout and RHadoop:


So I got out my reference for Linux VIM editor and started down the path:


And off we went.

Installed Mahout, then installed the streaming software, ran into an issue because skipped a set, of adding "export" commands to the environment variables, took an hour to troubleshoot.

Installed files:

Ran the RHadoop example from the blog:

 Found the pg100.txt file on the web, downloaded and then uploaded into the HDFS file system:

Uploaded file into File System:

And the GUI file system:

Typed in the code for Map Reduce job in VIM:

Tried running, found a missing comma, specified the incorrect Jar file in the .bash_profile, corrected that, removed the /Data/Output folder, reran:

 Output folder:

And the browser view:

So it seems Streaming is working as well as RHadoop.  Thanks  again for the great tutorial:


Hortonworks Ambari on Sandbox 2.0 VM

Today we downloaded Hortonworks Sandbox 2.0 VM Ware:


Took about an hour to download on wireless internet.

First step is to load the VM into VM Player:

Then, set the Network Adapters:

Logged into CentOS: Username: root Password: hadoop

Hadoop Registration Form:

 First Web Page:

And now HUE the GUI web based navigation page:

A new feature Ambari, it provides a warning to bump up the memory:

Bump memory to 4096:

Enable Ambari in the web based gui:

Ambari is now started:

Login to Ambari Usernme: admin Password: admin

And we're in:

Nice visualization dashboard, if you're accustomed to command line, this is really great!  Here's screenshot of the NameNode Dashboard:

And Summary Dashboard:

And HIVE screenshot:

And new feature Security:

And High Availability:

And Users:

And Ambari Summary:

And finally Heatmaps:

This is quite awesome improvement, centralized monitoring and management of Hadoop systems.

And that should give a good look and feel of some Ambari and how you can get up and running on the VM in a short amount of time.

Now to explore the VM further...with HIVE (Stinger), PIG, SQOOP, HCatalog, WEBHDFS, OOZIE, ZooKeeper and more~!


3 Technologies to Learn 2014

My opinion on the top 3 thing to learn for 2014:

  1. Data (hadoop)
  2. AI Machine Learning
  3. Mobile
Here's why:
  • Data isn't getting any smaller.
  • Algorithms trump human interpretation for:
    • accuracy
    • speed
    • volume
    • no down time, coffee breaks, vacation, health insurance, pensions
    • adapt over time
    • no bias
  • Seems everybody's connected to the web somehow.
  • Increase number of automated jobs.
  • Increase in lower paying jobs.
  • Baby boomers tail end of their careers.
  • New generation born into Technology world.
  • Mashing of disparate data sets.
My only question, when all the jobs have been automated, who's going to have any money to buy anything, guess we'll cross that bridge when it happens.

What's Going on in the Data Space?

Data to information.  It's been around for a while.  So why all the hype?  Actually seems like layers upon layers of hype.  In search of the golden nuggets of insight and wisdom.  Propelled by success stories of reducing costs and increasing sales and process efficiency.

I think there should be a website where actual customers can show off their actual accomplishments of finding actual insight and what the total ROI was for their investment.  Yes, companies are using new technologies and winning, there's no question.  But selling it to everyone in hopes of finding a pot of gold at the end of the rainbow, that's fuel for the current bubble called the "Big Data Bubble".

I've posted on this subject before:

Silicon Valley Data Rush of 2014

I personally don't know many companies using Hadoop in a production environment.  And if you look at the skills sets to hire potential employees, talk about delusional, they're asking for every skill set imaginable.  Hadoop in itself is a full time job understanding the Projects, their integration, Architecture, Versions, Vendors, writing code, and then keeping up with the changes / new features added about every other day.  Then throw in multi years of Business Intelligence.  Then throw in Java object oriented coding, then Python, Ruby, Project Management, then throw in Business acumen, then throw in PhD, Statistician, Mathematician, I couldn't imagine somebody who's got all those skills.

I think the shift has already occurred, traditional reporting is passe, Big Data has got everyone's attention, but the people changing the world are into Artificial Intelligence pure and simple.  And they are creating a digital survelience society, blogged about it here:

Future Prediction of Glass House Living

You could think of it as a Digital Cage where your privacy no longer exists.  Your job will be automated perhaps.  Robots will be working along side you.  The Hybrid Homo Roboticus, new species being formed.  Like when the Cro-Magnum and Home Sapiens lived side by side, the smaller frame, nimble, adaptable species won, the other one eventually got wiped out, blog post here:

Opportunity + Threat = Change 

See my Anthropology degree finally paid off...

I'm interested in Big Data, I've got some good high level knowledge, a proof of concept and some training from Cloudera and some experimentation with Hortonworks and HDInsight, blogged about it here:

Hadoop Basics 

My father worked for IBM for 34 years, many of them in Poughkeepsie, NY.  He had friends that worked in the labs as researchers, said that were eccentric and didn't have to wear white shirts and blue suits, they made their own hours and they thought of bizarre stuff.  I've always wanted to work in a lab doing research, like a think tank.  Maybe.



Centralized Data Science - Selling Pre Interpreted Insights for a Fee

Data is the raw material.
Information is the final output.
Everyone is so concerned with converting Data into Information.
Because that's the hot thing right now.
And because people can see the value.
And it has a ton of media press.
Which is attracting venture capital.
In hopes of making some mega bucks.

Let's fast forward a few years.
What happens when most every company is leveraging data to information.
There will be tremendous data, tremendous information leading to tremendous insight.
At some point we'll need a way to organize this information.
Perhaps insights will be commodity.

I've got this subset of information on this particular subject.
What if I want to leverage that information as a service.
You want to know everything about the whooly mammoth, connect to an informational hub, insert your request, out comes knowledge.  You do with it what you choose.

Companies will be scouring the world, learning new things.
What will they learn?

Facts.  Opinions.  Past events.  Present events.  Future events.  History.  Subject matter experts.  Physics.  Math.  Astronomy.  Anthropology.  Comics.

I'm suggesting a collection of information providers to replace the newspapers.
Dealers of knowledge.  Dispensed electronically, consumed publicly and devoured on your terms.

A distributed knowledge center.  And your data can be certified, grade A, top quality facts.
i.e. what is the boiling point of water.  How many miles to the moon.  When was a person born, where, by whom. etc.

In order to get your knowledge certified, it must pass some organizational approval.  Perhaps a fee, 100 facts approve for $10.

Or it could be UN-certified meaning you are not getting certified facts, perhaps opinions, theories or interpretations.  "I think so and so will win the election.  Hamburgers are my favorite, although greasy.  Wine is good for you.  Cholesterol is bad for you, etc."

People can accumulate as much or little knowledge as possible.  Learn at your own rate.

Of course business could leverage this Public Knowledge base.  Purchase demographics on specific knowledge.  Why go through the effort of mining and interpreting your own data to form insight, why not purchase pre-calculated insights?  For a fee.

That way you could centralize your Data Scientist, have them do the heavy lifting on variety of data.

Kind of like a production line, here's some data, tell me everything there is to know, we'll package it and sell it.

Centralized Data Science.  Coming to a planet near you.

T-SQL CTE Recursive Update DimDate

I was asked to update the OPSMTDValue field in the DimDate table.
The OPSMTDValue is the addition of the OPS value summed up by month, for each day.
Here's the query in T-SQL:

AS (
       SELECT FirstDayOfMonth
             ,0 AS UnionType
       FROM dimdate
       WHERE opsvalue <> 0
             AND date_date = FirstDayOfMonth
       UNION ALL
       SELECT d.date_date
                    SELECT Sum(Opsvalue) AS mtdopsvalue
                    FROM dimDate
                    WHERE opsvalue <> 0
                           AND date_date >= d.FirstDayOfMonth
                           AND date_date <= d.date_date
             ,1 AS UnionType
       FROM dimdate d
       WHERE opsvalue <> 0
             AND d.date_date >= FirstDayOfMonth
             AND d.date_date <= LastDayOfMonth
       GROUP BY date_date

       UPDATE dimdate
       FROM dimdate AS dd
       JOIN CTE_OPSMTD AS c ON c.date_date = dd.date_date
       and c.OPSVALUE = dd.OPSVALUE;

#AI Must Have Intuition

Will computers take over?

They sure are gaining ground.  With Machine Learning, computers can sift through data, predict your actions based on past experience.  With Neural Networks, computer models can provide future probability based on given weights and existing data.  Do we really want computers to attain the level of Humans, or even surpass us?

Have you ever called a company to ask for service.  In order to speak with the call rep, you must first answer the IVR.  The IVR is a set of voice based commands which receive input based on voice recognition or keys entered on the phone.  IVR systems are unforgiving.  If you don't pay particular attention, you may be routed to the wrong place.  Trying to navigate your way through the maze is a challenge in itself.  Because the computer can only provide certain choices and receive certain choices.  It does not think outside the box.

What is it missing?

Intuition.  Intuition is that gut feeling you get.  When you meet someone and something doesn't feel right, that's a warning to you to pay attention.  Computer's don't have this, yet.  Sure they can analyze details, but it's cold hard rational logic.  They don't have the awareness that Humans have to sense, cross reference our built in experience, knowledge and gut feelings.

I would say if an IVR system could sense your frustration, anger, if it could sense you were about to throw the phone against the wall, it could route your call faster, maybe even to the correct department.  Yet the IVR system understands the number 3 followed by the pound sign, it can understand "I want to cancel my account", it knows how many callers are before you, but it's just a freaking computer program with lines of code behind it with a voice synthesizer front end, it has no intuition.

With all the advances today, they can scan brain images to record memories.  That will catapult AI towards it's ultimate goal, but the computer needs to do more than react to input, it needs to understand, to feel and have intuition.

If our goal is simply to mimic the human brain, you better program the computer to know things like jealousy, revenge, anger, love, hatred, lust, otherwise your AI system will be severely lacking in that it won't pick up emotions and that limitation makes the computer a cold heartless piece of hardware / software with the ability to route your IVR call to the incorrect department.  I'm saying that it should know what these traits and characteristics are, I'm not saying they should be programmed to carry out these acts.

Take the characters Spock and Data on Star Trek, they are purely logical.  They can not deviate from logic.  And as we know, humans may think they are logical, but emotions drive almost every action.  If an AI system can not understand and respond to human emotions, they will be lacking.  AI must have intuition.


Speaking Event March 22 2014

Today I presented at the IT Pro Camp in Sarasota, Florida hosted at Kaiser University.

The topic of my discussion was originally Intro to Enterprise Data Warehouse, which I presented earlier this month at the Microsoft SQL BI User Group in Tampa.

Last night I redid my slide deck to focus on Intro to Hybrid Data Warehouse: Slides

The session started at 9am, there were a half dozen people in attendance.  The topic began with why is there a need for EDW.  And then I gave an introduction to the EDW.  Then it switched gears towards Hadoop and how it integrates with EDW to form the Hybrid Data Warehouse.

The audience participation was good, asked some good questions and seemed interested in the topic.  For somebody new to this world, the presentation may have been tool low level, so I skimmed over the slides and ad libbed mostly.

One member of the audience works for Microsoft as PFE.  We spoke afterwords and he liked the topic and had desire to learn Hadoop and saw the slide deck last night.  Said he enjoyed the presentation.

Overall pleased with the delivery this time and think having the computer in back of the classroom may have helped.  I spent most of school sitting in the back of the class going unnoticed by teachers so perhaps feel comfortable there.  Felt like I knew the topic well, now just need some on the job experience to complete the learning life cycle and then Hadoop will be second nature.  Hadoop is wide and deep, so many projects to learn and working on a real world solution would help turn high level knowledge into practical skill set.

And they gave us free thumb drive for speaking!


Opportunity + Threat = Change

How are people motivated to change?

1. Threat
2. Opportunity

This is based on evolution.  Those in the jungle who wanted to survive, went out of the forest into the grasslands in search of opportunity.  There was a threat of starvation.  And an opportunity of food.

Both of which helped mankind to "change".

Without the threat, we may still be swinging from the vines like Tarzan.

With programming, we are always changing.  To adapt to threats and opportunity for rewards.

Perhaps we should apply this model to Artificial Intelligence.  Models could adapt and change if there were some benefits, or threats.

On a side note, humans need to adapt to upcoming events.  At some point in the near future, algorithms and data models will replace jobs.  And we sit enjoying ourselves, like the last hours on the Titanic, with the orchestra playing, as the ship was slowly sinking.

Humans need to wake up to the possible threat of loss of jobs on massive scale, and / or opportunity to thrive in the new economy.

Threat or Opportunity.  Change.  That's the blueprint.  What you do with it is up to you.

Hadoop Basics

Data Gosinta (goes into)

When thinking about Hadoop, we think of data.  How to get data into HDFS and how to get data out of HDFS.  Luckily, Hadoop has some popular processes to accomplish this.

SQOOP was created to move data back and forth easily from an External Database or flat file into HDFS or HIVE.  There are some standard commands for moving data by Importing and Exporting data.  When data is moved to HDFS, it creates files on the HDFS folder system.  Those folders can be partitioned in a variety of ways.  Data can be appended to the files through SQOOP jobs.  And you can add a WHERE clause to pull just certain data, for example, just bring in data from yesterday, run the SQOOP job daily to populate Hadoop.

Once data gets moved to Hadoop HDFS, you can add a layer of HIVE on top which structures the data into relational format.  Once applied, the data can be queried by HIVE SQL.  If creating a table, in the HIVE database schema, you can create an External table which is basically a metadata layer pass through which points to the actual data.  So if you drop the External table, the data remains in tact.

From HIVE SQL, the tables are exposed to ODBC to allow data to be accessed via Reports, Databases, ETL, etc.

So as you can see from the basic description above, if you can move data back and forth easily between Hadoop and your Relational Database (or flat files).

In addition, you can use a Hadoop language called PIG (not making this up), to massage the data into a structure series of steps, a form of ETL if you will.

Hybrid Data Warehouse
You can keep the data up to data by using SQOOP, then add data from a variety of systems to build a Hybrid Data Warehouse.  As Data Warehousing is a concept, a documented framework to follow with guidelines and rules.  And storing the data in Hadoop and Relational Databases is typically known as a Hybrid Data Warehouse.

Connect to the Data
Once data is stored in HDW, it can be consumed by users via HIVE ODBC or Microsoft PowerBI, Tableau, Qlikview or SAP HANA or a variety of other tools sitting on top of the data layer, including Self Service tools.

Machine Learning
In addition, you could apply MAHOUT Machine Learning algorithms to you Hadoop cluster for Clustering, Classification and Collaborative Filtering.  And you can run Statistical language analysis with a language called Revolution Analytic R version of Hadoop R.

And you can receive Steaming Data.

There's Zookeeper which is a centralized service to keep track of things.

And Girage, which allows Hadoop the ability to process Graph connections between nodes.

In Memory
And Spark, which allows faster processing by by-passing Map Reduce and ability to run In Memory.

You can run your Hybrid Data Warehouse in the Cloud with Microsoft Azure Blobstorage HDInsight or Amazon Web Services.

On Premise
You can run On Premise with IBM Infosphere BigInsights, Cloudera, Hortonworks and MapR.

Hadoop 2.0
And with the latest Hadoop 2.0, there's the addition of YARN which is a new layer that sits between HDFS2 and the application layers.  Although HDFS Map Reduce was originally designed as the sole batch oriented approach to getting data from HDFS, it's no longer the sole way.  HIVE SQL has been sped up through Impala which completely bypasses Map Reduce and the Stinger initiative which sits atop Tez.  Tez has ability to compress data with column stores which allows the interaction to be sped up.

New Features 2.0
With Hadoop 2.0, you can now monitor your clusters with Ambari which has an API layer for 3rd party tools to hook into.  A well known limitation of Hadoop has been Security which has now been addressed as well.

Hbase allows a separate database to allow random read/write access to the HDFS data, and surprisingly it too sits with the HDFS cluster.  Data can be ingested to HBASE and interpreted On Read, which Relational Databases do not offer.

Sometimes when developing, users don't know where data is stored.  And sometimes the data can be stored in a variety of formats, because HIVE, PIG and Map Reduce can have separate data model types.  So HCatalog was created to alleviate some of the frustration.  It's a table abstraction layer, meta data service and a shared schema for Pig, Hive and M/R.  It exposes info about the data to applications.

Here's a quick diagram showing the basics of SQOOP, Hive, HDFS, HIVE ODBC, etc.

I hope you enjoyed this blog on Hadoop basics.