My Intro to Hadoop

Big Data is big news!

And what's driving Big Data?


So today I did some research.

I watched the following video:

Disclaimer, this is not my information, I transcribed the following notes from the you tube video listed above, I don't own nor do I take credit for any of this info...

Hadoop is downloadable from the Apache Website.

It basically works with large sets of files, structured or unstructured.

It works based on the premise of 'Data Nodes'.

Map-Reduce = Computation 

A way of taking the distributed out of a distributed system.

HDFS = Storage 

Hadoop Distributed File System - (all file system things) make directory, copy, list, permissions, groups, file sizes, rm, etc.

Hadoop works well with really big files.

Splits into blocks 128 mg default.

Each block makes 3 copies, or backups.

Blocks / replica's go onto individual machines that make up the cluster.

Distribute as even as possible.

The NameNode watches over the other nodes, kind of like a 'pointer' in the C++ language.

NameNodes will watch the nodes looking for failures, and replicating as needed.

If all nodes are lost that's bad, if you lose all but one, then you are okay because it will replicate back the bad ones.

It's basically a file system app written in Java,

The name node is a single point of failure, which doesn't fail that often.

Which you can query using 'Map-Reduce' jobs.

Where it groups everything into key/value pairs.

Which get fed into a reducer.

Which takes keys and values and created more keys and values.

The Elephant represents Hadoop.

Then you can query this language using a new language called 'Pig'.

This is equivalent to the 'Assembly' language of long ago, very tedious.

And then there's another language called 'Hive' which is very similar to SQL which is widely used.

Apache 'H Base' project is another language gaining steam.

And a another 'real time' H Base language called 'accumulo'.

Which has cell level security.

And many more new languages with different features.

And another project with data serialization project called 'Avro'.

Define data serialization which keeps the schema with all the data like a map which travels with the data.

You can really store a lot of data in Hadoop.

And here's a tutorial online from Cloudera

And another YouTube Link.

And this link gives good summary.

And one last link.

I started to download Hadoop at work today and got the files unzipped and loaded, then downloaded the Java JDK from Oracle / Sun, still need to set the Java_home config setting.

This will take some time and patience, that's for sure!


Some Concerns #BigData #BiWisdom

As we are learning, Big Data and Business Intelligence are pivotal in today's IT strategy.

Get the data in front of the person who makes decisions in timely manor in a readable format.

In order to make decisions.

Yes, we have larger data sets available and better technology to slice and dice.

Yet, who are all these people making decisions?

And what decisions are they making?

Do we know any of them?

And what questions / concerns do they have?

Timeliness, accuracy, security, privacy, completeness, etc.?

The info I'm seeing is mostly driven from the people who have access to the data.

In order to drive this technology forward, it seems to me that we need more input from the actual users.

The people who do make the decisions.

To identify what types of problems we are trying to solve.

Also, what if we present our version of the data and it's wrong?

What if the data is out dated?

Or what if 2/3 the data is correct, but the 1/3 data that was mashed together was wrong?

And the customer that pays the bills made a business decision based on incorrect data.

Then what?

Kind of slows down the process, makes you think about the bigger picture.


And what if after spending all this time and money, the data is marginal, meaning the decision could go either way, resulting in a decision based on intuition or business experience instead of 'facts'.

And what if one of these extremist groups uses this new technology to slant their version of the truth.

It could happen.  Intentionally or not.  Skew the facts.  Paint a picture that satisfies a particular agenda.

I'm just throwing these ideas out there.

With all the mad rush to dive in an be the first to cash in, we may want to step back and look at some of the bigger issues.


Create Order from Chaos #BIWisdom #BigData

Business Intelligence has been around for a while, long before the name appeared.

Basically, getting data in front of the user as information to base decisions.

However, Big Data has spawned because of the need to extract nuggets of gold from the mountains of data.

The data is not uniform.  Instead it's humongous, scattered and cluttered.

So I would say the purpose of both BI (Business Intelligence) and BD (Big Data) is "Create Order from Chaos".

Simply put, our goal is to parse through all this information, structure it in a way that is comprehensive, in real time, to produce some level of value, ongoing.

Information, the new "man made oil / gold".

To be mined and harnessed.

Bought and sold as a commodity.

And the new gold diggers are the data miners and business intelligence developers.

Remember the Gold Rush of the mid 1800's.

Move out west and claim your fortune.

Fast forward to today,  we search for gold in that heap of data, shovel optional.

I figure there's no more land on this planet to plunder.

Might as well create some artificial commodity.

And you can do that by creating some type of order out of chaos.


10k Blog Visits

Tonight this blog BloomConsultingBI.com passed 10,000 visits.

I didn't realize so many people were reading.

I am suprised and humbled.

Thank you everyone for taking the time out of your busy day to have a read.

I do not proclaim to know much about much.

But I do like to type my thoughts every day.

And I thank you very much.



First Year Team Lead

Looking back at the past year of supervising the reporting team, I look for milestones.

What have I accomplished?

First, I tried to standardize the reports.

I implemented report templates.

I introduced re-usable queries for permissions and date parameters.

I added dynamic code instead of hard coded values across the enterprise reports.

Second, in an effort to wrap processes around chaos, I introduced a New Report Request template.

All new requests require a form to be completed and attached to the ticket request.

No ticket, no report. It goes against the 'Self Service' delivery model, but Rome wasn't built in a year.

At least I can prioritize the reports based on needs, deadlines and available resources.

Third, I am on the committee to clean up the ticketing system.

I introduced ideas to clean out the tickets older than 180+ days.

Similar to sending a letter in the mailbox, once you send it, you are out of the loop. The mailman picks up the letter, delivers to the mail warehouse, he's out of the loop, Plane sends letter to next destination and so on.

There's no need to keep everyone on every ticket, so remove your name off the ticket when you complete your portion of the code.

That will ensure we can identify the owner of the ticket at any time.

New rule that a ticket must have comments at least every 30 days to inform user.

So anyone at anytime from anywhere can find the history of the ticket, who worked on it, why its not done and reason it's not complete.

I also created reports and send to the field users every week to identify outstanding tickets as well as a report to identify tickets 'unassigned'.

Since implementing these procedures, we have reduced the number of tickets by half.

Fourth, I've learned to manage a small team of developers, delegate workloads, set customer expectations, document reports, maintain the web server, maintain all reports in Visual Source Safe code repository, hold meetings with customers to gather specs, do performance appraisals, approve time off, work with other departments, automate reports internally and for customers, etc.

Overall, my first year as Team Lead, or Supervisor, or Application Administrator, whatever the correct term is, it's been a great year to say the least.

Luckily I still get to code, however, I do see the benefits of finally being in a position to make a difference, to improve things, where my voice finally gets heard.

And there you have it!


Big Data Questions #BigData

With all the hype of big data going on there are still basic questions that need to be addressed.

What is big data?

Are we talking huge volumes of data?

Or the gathering of desperate data across multiple sources?

Or is it the blending of structured and unstructured data?

Is it mining data?

Or is it predictive analysis??

Is big data just the next iteration of Business Intelligence?

And who performs the functions of big data?

Since algorithms and mathematical formulas are involved should the role be dedicated to a statistician?

Since programming is involved would an IT person do the analysis?

Would the person work independently or as a team?

Would one person have all the necessary information to accomplish the task?

Who would oversee the operation?

How would the goals be determined?

How would success be defined?

Who would ensure data governance?

Who would ensure the data is correct?

Who would understand and translate the business rules to the developer?

Who would oversee the project to ensure timelines are met?

Who would fund the project?

Would the project work in tandem with other parts of the organization or be a silo?

Could big data be sold as a service?

I think these and other questions need to be addressed with industry standards.

So we are not aiming at a moving target.

Any thoughts?


Microsoft BI Stack

When you think of Microsoft Business Intelligence, there is so much under that umbrella.

Each utility is a tool in a programmers tool belt.

There's reporting - SSRS.  I've heard some people call this 'simple'.  However, it is quite complex and a great info delivery utility.  There is big demand for report developers except many positions also required DBA, .net, SharePoint, etc. skills in combination.

There's ETL - SSIS.  I've been writing DTS packages for a decade.  However, this is really a great skill to have and I wouldn't mind getting more exposure in order to learn the intricacies involved. I have basically created data extracts in comma / pipe delimited files or Excel destination files.  And transformed the data along the way.  There seems to be an increase demand for ETL developers and limited supply of qualified developers.

There's cubes - SSAS.  These are relational databases on steroids.  You can have multiple layers of data, pre-aggregated  for faster retrieval and ability to slice and dice the data quickly.  This sits a top relational databases or data warehouses and can be queried using a language called MDX.  There is a big demand for this skill and it is more on the advanced side of Business Intelligence, and the pay is compensatory.

There's SharePoint - which is two fold.  You can integrate with SSRS and / or use as file repository for an organization.  There are complexities regarding pricing and installation and sometimes requires one or many full time employees to maintain.  Also, once in production, it has the capabilities of growing into a jungle of files and disarray.

There's Data Quality Services - which is fairly new.  This puts the business logic into the hands of the business user along with Data Governance in order to ensure 'good' data.  When data is brought into the ecosystem, it can be cleansed based on defined business rules and this utility allows that to happen.

There's Master Data Services.  This allows for a centralized repository of data to be maintained via a web-front end.  See my Blog Post.

There's Power Pivot.  This is a free plug in with Excel 2010 and allows business users to bring in data from a variety of sources, apply joins and create pseudo Cubes rather quickly.  These cubes can then be dissected using Excel Pivot Tables and uploaded into SharePoint.  A great tool for business users, just be sure to keep security in mind.  In 2012, you can also develop these cubes in BIDS and then upload to SharePoint or SSAS.

There's Performance Point.  This allows users to create visually exciting and informative Graphs, Scorecards, Charts, etc. with drill through capabilities.  These are intended for high level exec's.

There's PowerView.  These allow users to create, on the fly, visually enticing reports in SharePoint.

There's a variety of other new features included in the Microsoft Stack.  I'm just going off memory here.

All in all, if you submerge yourself in this stack of Business Intelligence, you will have plenty of opportunity.

SSRS vs. Crystal Reports

Not all reporting tools are created equal.

Well, maybe they are, but I don't believe it's so.

I started off with Crystal Reports version 5 in 1996.

That was the latest version back then.

I learned Crystal Reports in a day, from a contractor from St. Louis, who answered all my questions.

And then I was the Reporting guy for NationsBank.

They also had a product back then called Crystal Info, which was a scheduling mechanism to deploy reports.

That product is still in production today, under a different name, Business Objects.

I found that Crystal Reports is a great tool for generating quick reports.

You can add complexity to the reports, but only so far.

Back in the day, we had to manually keep track of running totals using Variables, and then set the Variable to zero in the Group you wanted to keep track of.

This was time consuming and labor intensive.

This got corrected in later versions, and the new programmers never knew how difficult it was back in the day.

Seagate Software used to have a preferred Help Desk support line I called many times around 1998.

By calling this 800 #, your call was automatically pushed to the front of the queue.

They had cool music on the phone, and you could actually change the music genre.

They eventually did away with the 800 # over time.

Crystal Report changed with each version, but the basic concepts stayed the same.

That's how I was able to jump versions over time and not have to really re-learn anything.

However, all of a sudden, Business Objects appeared with Universes and I kind of got left behind over night.

Around the same time I learned Microsoft SQL-Server Reporting Services (SSRS).

To be honest, first time I viewed the SSRS web, I thought it was primitive and child like.

Sure you could create reports in BIDS or embed them in .net, similar to Crystal Reports.

You could create sub-reports, same as Crystal Reports.

However, the introduction of multiple Data Sets is far superior.

The formula syntax was easy to use.

The ability to program in a pseudo .net environment was cool.

The ability to deploy reports, set permissions, view the report execution logs, schedule reports via the Subscriptions in Email or Network Folders, etc.

The list goes on and on.

I've been out of Crystal Reports for over a year now.

I don't really miss it.

The fact that SSRS is free with SQL-Server, has the backing of one of the largest Software Company's in the world, it's part of the complete Microsoft Business Intelligence suite of products, I find it far superior than Crystal Reports.

However, if a contract requires me to program in Crystal Reports, Actuate, whatever, I'm not against it.

I just have a preference for SSRS.

That's all I'm saying!


T-SQL Auto Format

Working in SQL-Server for the past decade, I often wondered how to standardize the formatting of T-SQL.

Everywhere you go, people have certain 'Styles'.

Upper case this, lower case this, indent here, etc.

I have finally found a great online utility to 'Auto Format'.


I've been using this link for a few weeks now and I love it.

Just pop your code in there, wait a few seconds, copy to clipboard and paste in T-SQL editor.

Good to go!

Amendum 8/31/2012 - this one actually works:



Early days of BI

I worked a job in Tampa for a while.

I was on the reporting team doing Crystal Reports.

And my boss, one day in a meeting volunteered me to monitor the DTS packages that ran each day.

Only thing was, the jobs ran in a window from 1am to 4pm.

So each night, I received a phone call around 1am, saying the packages were not going to finish in time.

So I had to get up, drive 50 minutes to work, log on to my computer, fix the problem, then drive home, go back to bed, only to wake up and go to work for real.

This was not my favorite job in the world.

Then I managed to get remote access.

However, I still awoke at all hours of the night.

Then a new guy showed up.

He was about the smartest person I ever met.

And we chatted for a while.

Next thing you know, he spoke with the CEO and got me off the nightly monitor shift onto his team of 1.

And that was awesome.

Because I still got to write reports in Crystal Reports.

But we also worked with Crystal ScoreCard.

We even had a conference call with the Crystal developers in England.

And they sent us some contractors to write the KPI's.

This was way before BI got popular.

And we also had Microsoft Cubes.

I tried reading books on the subject but to be honest it was over my head back then.

So overall it became a good job.

I got to write DTS packages which read XML fields stored in the database, nice design eh!

So I had to write VBA code to extract from the XML nodes.

I also wrote MDX code back then which ran inside the Crystal Report - except the wizard wrote the code and I copied and pasted...

Eventually I wanted to work closer to home because the hour drive was too much.

So I accepted a $10k pay cut to work closer to home.

After a few months on the new job, the owners of the company bumped up my salary $15k after they saw my work.

And I lasted there close to 4 years and balanced the company books for two insurance companies doing the month end for 33 months plus a slew of reports for marketing, claims, accounting, hurricanes, etc.

So I got to work in BI before they called it BI.

I actually started doing Crystal Reports in 1995/1996, long ago.

When reporting was the red headed step child of the programming world.

My how things have changed.

Java Coder

When I was a Visual Basic / Oracle / Crystal Reports developer back in the late 1990's, for some reason, I wanted to learn Java.

So my father, who worked for IBM at the time, gave me some software called IBM VA Java.

So I learned the basics.

And then asked my boss at the time if I could do some Java at work and attend a conference or training.

He said no.

So I left for another job.

Years went by, no Java.

My wife's friend was a manager who was looking for a Java programmer and asked me constantly to come work for her.

I explained I knew how to code, I knew some Java but never wrote a line of production Java code in my life.

Well guess what?

They were looking for a Java programmer who knew guess what?

IBM VA Java from late 1990's.

So I interviewed for the job.

When I got there, guess what?

I knew the hiring manager from before and he liked me.

So guess what?

I got the Senior Java Programmer position without writing a single line of production Java code.

And then I learned Java on the fly with  no training and minor assistance from another developer.

My task was to solve the problem that he could not solve.

Which was to get Cookies & Captcha working in production.

I stepped through the code and everything was okay.

It turns out the Middleware guy never disclosed the architectural structure of the servers.

And there were two internal servers and two external servers.

So the trick was to drop a cookie on each of the 4 servers.

Presto it worked.

I was able to solve the problem.

So all my bosses were happy.

Except I inadvertently made the other programmer look bad.

And I gained an enemy with the Middleware guy.

And for the rest of my 4 years working, not one of my deployments worked 1st time, or 2nd time.

There was always some issue.

And the Middleware guy blamed the coders.

It wasn't just me that had issues with the deployments, either the guy had no idea what he was doing or something fishy going on because the entire Java department of 20 people had issues with deployments, and I especially had them.

Yet it was a black box environment, we could not see production Websphere server, ever.

You would think that code that worked on a dev box, on the test server, would automatically work in production, but it never did, and we had to fix the issue every time.

They eventually got rid of that guy, although he was with the org for 30 years and nobody knew his job, they still let him go during the re-org.  And the problems with deployments disappeared immediately.

Although I survived all three downsizing, I left to get out of Java and into Microsoft Business Intelligence.

And haven't looked back.


New Report Request Form

Based on recent experience, I've decided to implement a 'New Report Request Form'.

Why?  Because sometimes you need them.

Why?  Because sometimes the user changes their mind once or twice or twelve times.

Why?  Because business rules change or the report needs tweaking or the requester didn't think through the entire process at the beginning.

So in no particular order, I grabbed some bullet points off the web?
    1. Is this a new report, or a modification to an existing report ?

    2. Report owner ?

    3. Business Contact ?

    4. What is the purpose of the report ?

    5. By what date does the report need to be available ?

    6. Summary or Detail Report?

    7. (Proposed) Report Title ?

    8. What is the frequency of publication (* Ad Hoc, Monthly, Quarterly, 6 Monthly, Annually, Other (Please Specify. . .) ?

    9. What system will this report be based on (name of application/database) ?

    10. What fields/information should be displayed on the report ?

    11. What criteria/conditions will you need ?

    12. What (special or conditional) formatting do you need on each column ?

    13. Report period (is the report for a single day, a month, a year, Begin/End Dates, etc.) ?

    14. Who are the Users of the report / Security?

    15. What are the sorts that should be applied?

    16. What special calculations need to be performed?

    17. Totals and Subtotals ?

    18. Export Format* Email, CSV, TXT (Tab Delimited), XLS, Other (Please Specify) ?

    19. Provide Report mock-up with all data elements labeled – column descriptions, formula calculations, format type, max characters, etc…

    20. What else would you like included ?

What are some other things to request?  Gotcha's?

Because my department is understaffed and because the workload is constant and the number of request increases over time, I need to implement some structure to prevent us from drowning.


SSIS and Master Data Services

Back to work tomorrow, Spring Break is officially over.

I never worked so hard as I did this week while on vacation.

Got plenty of hours on my part time job.

And I started another part time project.

I got to work in SSIS the past two days.

Mostly massaging code which another developer wrote.

However, I had to learn the biz rules in order to fix the bugs and enhance everything.

The packages were a bit complex, for loops, multiple sequences, derived fields, variables, truncating tables.

The packages had more going on that what I'm accustomed to, so I went through the process flow of every item which took some time, but allowed me to learn what was going on.

Because as you probably know, SSIS is a bit fragmented and not easy to follow, everything is embedded or hidden so you have to click on everything to figure it out.

Also, I worked in Master Data Services 2008 r2.

Basically, there's a web front end, where you can create Models (another name for Schema) and there's Entities (another name for tables) and leaf's  (another name for columns) in the System Administrator.

You can view and edit the data in the Explorer.

There's versioning.

And there's Integration Management, which allows you to import / export data.  The underlying data structure is a bit whacked, so they have the ability to create VIEWS to make things easier, which are runnable in T-SQL.

And I learned how to remove data from the database, a lot different that what you'd think.

You have to Insert data into a table, which gets picked up in another job, which De-Activates the data, strange but true.

Overall, glad I got to learn another product in the BI stack, although not an expert, I learned fast enough to complete the first assignment.

And I guess that's all we can really ask for, to learn something new on the fly with minimal training.

Because there's too much to learn and not enough time.

So back to work tomorrow.

Let's make it a good one!