Gutting Hadoop with YARN

There's a building near my home.  The building is owned by a company.  The building was probably built in the 1960's or earlier.  Lately, they've been remodeling the building to give it a fresh new feel.  The thing is, they did not bulldoze the existing building.  They simply kept the structure in place, and build around it.

Now take Hadoop.  It's core was based upon two things.  HDFS files system.  MapReduce framework.  Along comes YARN.  A layer between the HDFS (now HDFS2) and MapReduce.  Although MapReduce is still bundled into the product, it's basically one of many applications which sit atop of YARN.  So it's no longer the sole way of getting at the data / file system.

In fact, with TEZ, the code actually gets compiled down to a lower level than MapReduce, which explains the speed up.  And yes, MapReduce still runs it's legacy applications.  However, it will be compiled down to a lower level to take advantage of YARN (not sure if this is available now).

The way I heard it is that MapReduce has to spawn off multiple jobs/threads to handle complex SQL like Join or Group By.  Those extra threads have to be monitored, they have to write to disk, then aggregated and shuffled, then brought back together, slowing it down.  YARN does not use as many jobs to get the same results.  And that's one way it was able to return the queries faster.

So what they've done, is the same thing as that building down the street from me.  They've built around it using the existing framework, added a basement, new walls and made it two stories.

Very impressive.

Likewise, Microsoft has found an alternate method of storing the data instead of HFDS file system.  They use BlobStorage which allows easy scale out on the web.

So are things changing in the world of Hadoop?  Yes siree Bob.  Change is a good thing.

And so it goes~!

Just Give Me the Insights Please

Everyone is jockeying for position in the Big Data space.  As there's a lot of loot to be made.  Developers are scrambling to learn as fast as possible.  Business knows they need large data but not sure why.

So let's cut to the chase.  People want insights.  Getting from raw data to insight takes resources and technology.  Yet if people could go straight to insights, they would.  Bypass the middle man.

So let's say you need insights on a particular subject.  Wouldn't it be a lot easier to go out and purchase those insights.  "Insights as a Service".

Your CEO probably isn't too technical.  Once you start blabbing about the intricacies, the costs and the risk involved with a big data project, they don't want that.  They want insights.  Results.  Which translates to how can this big data give us more sales, less costs and streamline process.

In a sense how can it bring us more money, in order to increase the stock price, so I can get a 7 figure bonus this year.

However, getting to insight will require a person to collect the data, mash it, cleanse it, work with it, to bake the insights.

And that's where the real value is at this point in time.  Personally, I plan to be one of those people who create insights.  There's a lot of value to be added and you can earn a nice living doing so.


No Computer Science Degree?

My degree is in Anthropology.  Thing is, I probably couldn't do that for a living.  First off, I only learned the basics, didn't go for a Masters degree, and I'm not sure what jobs are out there in that industry.  Perhaps work in a museum, work for an archaeology company doing digs or leading projects or doing administration.

Still, I have a 4 year degree from a major university (University of Florida, Gainesville, FL).

Even though I attended a few computer courses, my learning has been mostly self taught.  My skill set has grown over the years, revolving mostly around data, programming and reporting.  My current job requires me to build data warehouses full life cycle.  And soon I hope to be programming in Hadoop, just a matter of time.

So what do I think about the IT industry?  It changes fast.  Too fast.  The speed at which it evolves is surpassed by the complexity in which technology unfolds.  So not only is there more to learn, the pace at which one must learn is increasing as is the complexity.  Because no longer can you do one technology and survive, each technology is tightly integrated with other technology.  So if you're into data for example, you have to know about relational database, NoSQL databases, Hadoop.  Plus SQL, all the reporting tools out there, the dashboards and visualizations, the ETL tools, the reporting tools.  Plus the web, security, authority, project management, agile methodology, sales, presentation skills, interact with management and clients, the list goes on and on.

And you must keep pace with learning in addition to doing your full time job, plus have a life outside of work.

So it is a difficult career indeed.  You can never stop learning.  Or become un-marketable.

Now, I'm here to say that if you are looking for the best programming in the world, one who writes flawless code, who knows everything about everything, they may be out there.  I've seen a lot of talented people in my years and there are some brilliant people in the workforce.  Just looking at my twitter connections, it blows your mind to see how smart some people really are. 

My code style is based on maintainability so the next person can pick it  easily and figure it out.  I use the coding techniques which I know, when running into difficulty, I first scan the internet to look for existing solutions.  When none can be found, I can roll up the sleeves and troubleshoot with the best of them.  My level of knowledge is okay, I'm not really an expert in anything at a deep level, so no super star status for me.  But at the end of the day, I do quality work and the clients are generally satisfied with the results.

Had my degree been in computer science, there probably wouldn't be much change to my coding style, problem solving ability or work ethic.  And besides, I graduated in 1991, things were a bit different back then. 

So at the end of the day, nobody knows everything in the world of IT.  I make an effort to keep current, my skills are descent and I don't think having a degree in Anthropology has limited me in any capacity.

Successful Data Warehouse Projects

Working as a consultant who builds data warehouses, you get to see a lot of different organizations and how they operate.

You go in, assess the project, estimate a scope and begin work.  From my perspective one of the biggest challenges seems to be the business rules.

More times than not, the business rules are not documented.  They are embedded in people's heads or buried deep within the code.  Deciphering the business logic is the toughest part of the project, in my opinion.

At the end of the day the numbers have to match theirs.  Identifying and locating the data sources is sometimes difficult.  Translating the business logic from Access or Excel or in the programmers noggin are quite difficult.  And the exceptions.  That's the one that gets you.

What makes sense is to grab the data from the source systems.  Many times we are given views or tables pre loaded.  There are business rules hidden from view, or the views are outdated or there's missing or incorrect data, or the timing of the loads are not in synch.

There are many reasons why it makes sense to get the data from the source data repository.

Finally, to create a data warehouse you need access to the correct data, you need to understand the business rules as well as business processes and you need someone from the organization to assist with questions as they arise.

Otherwise you're just asking for trouble.  Enough said.


Generalize or Specialize

The question is, do you generalize or specialize?

Do you learn surface level stuff about everything.  Or do you go deep on specific technologies and become an expert.

In today's day and age, things are changing so fast nobody can learn everything there is to know about everything.  So, you can take the approach where you learn enough to talk about any subject.  Or you pick a technology that interest you and learn everything there is to know.

So with Hadoop, there's a lot to learn.  Hadoop version 1 is already legacy, now there's version 2.  And there must be 20 individual projects associated with it now.  And there's the installation, administration, support, loading data, getting the data, security, graph databases, machine learning, SQL like actions, ETL languages, ingesting data tools, streaming data, and so on.

Plus there's dozens of third party tools which integrate with Hadoop.

How can one become an expert on everything and stay current?  So do you generalize or specialize?  That is the question.


3 Fears Blocking Your Move to the #Cloud

I say the Cloud is taking off.

Yet I hear many people shy away.

Hiding behind "data security", "data breaches", "HIPPA", "PCI", etc.

Fact is if you store your data with a top vendor, it's probably more secure than your on-premise data.

What's the deeper concern with the Cloud? 
  1. Change (a 4 letter word)
  2. Fear of losing their job (stagnant cushy jobs, no learning involved = lazy)
  3. It's not how we've always done it (see #1)
There's a dozen reasons why you should be thinking about the Cloud now.

Once you get passed the 3 fears.


Attended Hadoop Class in Orlando

I attended the Orlando Data Science Meetup Group today 6/11/2014.

Great turnout.  I was working in Lakeland about an hour away, so the drive to Orlando wasn't too bad, even with the rain and rush hour downtown traffic.

Started with introductions, then dove into an clean install of Hadoop on a Linux box.

Note: diff between Hadoop v.1 and v.2 is they've reduced the number of folders which comprise Hadoop.

For me Linux isn't an everyday environment so I just watched the install on projector.

Meanwhile I already had Hyper-V Hortonworks VPN running a 1.3 Linux install, a 2.0 Linux install and as well as a Windows 2.0 version of Hadoop, and I don't have a Linux operating system to do the install.

After a short time the environment was setup and Hadoop was up and running.

The presenter who did the Hadoop installation, really knew his stuff and managed to do the install from scratch seamlessly.  I've poked around in the folders and config files and there's really a lot of things to know so I understand the level of knowledge and details involved.

Understanding Linux is required for the install including commands, editors, RSA keys, passwords, users, folder structures and a ton more.

They recommend using a bunch of Mac minis to set up a small cluster, cool huh?

For larger clusters, they recommend renting time on someone else's cluster.

There was mention of creating a cluster on VMs but overall there's no cheap solutions for creating, running and administrating a large cluster.

In speaking with some attendees, they recommend to download a Linux ISO and install from scratch for Hyper-V.

Here's a link: http://blogs.msdn.com/b/virtual_pc_guy/archive/2010/10/21/installing-ubuntu-server-10-10-on-hyper-v.aspx

This was the first of three sessions, next up is Yarn, then MapReduce.

Looking forward to it~!

New SlideShare Intro to Hadoop

I just posted a new SlideShare on "Intro to Hadoop"


Hopefully I'll get to present it soon~!


Tampa Bay Technology Meetup Groups

There's a lot of Meetup groups forming in the Tampa Bay area for advanced technology.

I'm currently member of:

Tampa Bay BI User Group (just renamed to Tampa Bay Analytics User Group) (member)

Tampa Bay SQL Server User Group (Pinellas + Tampa) (member)


Tampa Analytics Professionals (co-organizer)


Tampa Bay Cloud User Group (member)

Tampa Bay Hadoop User Group (asked to speak at this event)

Tampa Bay MongoDB User Group (member)

Orlando Data Science (member)

Tampa Tableau User Group

The Tampa Bay area is definitely experiencing an up tick in advanced data technology.

Stay tuned for more updates~!


My Personal Computers Over Time

What was the first interactive computer game I played?  Pong:

My next interactive electronic game?  Atari:

Our friend down the street had Intelivision which connected somehow to the main server:

What was the first personal computer I worked on?  An IBM PC original:

In the 8th or 9th grade, not sure which, we used the TRS-80:

In college, I used the VAX:

My first laptop, IBM (suitcase sized, orange screen):

My first computer on my own, an IBM PS2 (connected to AOL):

My next computer, HP:

Then a Dell:

And finally a Dell Laptop 17 in monitor:

 And a Samsung Tablet:

Then an IPhone 4:

And finally a Windows Phone:

And there you have it, my personal computer choices over a lifetime.

A blast from the past.