Top 10 Things to Look For at a Company

When you look at a company to decide if they're up and coming, here's a few things to investigate:

  1. Product - Is the product respected in the industry?
  2. Net Worth - Do they have Cash on Hand?
  3. Revenue - What is their Monthly Intake verses Monthly Outgo?
  4. Culture - How is the atmosphere?
  5. Salary - Does the company pay Market Rates?
  6. Technology - Are they using the Latest / Industry Standard Software?
  7. Commute - Will you be stuck in traffic for hours per day?
  8. Bonus - Will you be receiving unexpected Monies during the year?
  9. Upward Mobility - Do you have a chance to progress?
  10. Benefits - Will you be able to pay your Health necessities?


#PerformancePoint 2013, Here I Come!

I've been tasked with the creation of Enterprise level Dashboards.

Our development platform is Microsoft:

Database: SQL-Server 2012
Tabular Model Analysis Services: SQL-Server 2012
Delivery: SharePoint 2013
ScoreCards / Dashboards: PerformancePoint 2013

I do have access to the Database, SSAS, Visual Studio and SQL-Agent.

So I was able to get the data organized, create the Tabular Model project in Visual Studio 2010, create the SSAS cube, refresh the cube using SSIS, scheduled in SQL-Agent.

And that's when the trouble began.

First thing to note, it's locked down tight, I have almost no permissions.

To SharePoint Web interface or the Server.

Working with the Server Admin's, they got SharePoint 2013 up and running, however I could not deploy the Dashboard Designer.  Once that was resolved, I could not create a Data Connection for PerformancePoint.  After that was resolved I could only access via "Unattended Service Account" and not "Per-User Identity".  Once that was resolved I could not access the PerformancePoint Content List.  After that was resolved, hallelujah!  I now have access.  How simple was that?

So now I'm diligently working on the creation of some new KPIs, which will reside in the new ScoreCards, which will reside in new Dashboards, which will reside on SharePoint.

The one issue I have so far, is for a particular KPI, you have an Actual Value / Target Value.

So lets say the Actual Value is 98 and the Target Value is 100, the Indicator shows a Green which is correct, however, it shows a -2% instead of 98%.  The users would prefer the 98%

Does anyone have a suggestion on how do accomplish this?

Much obliged!


Obsolete Items From My Generation

I looked down at my watch, because my beeper went off, so I tried to call someone from my home phone, so I ran down to the nearest payphone, when I passed a AAA to stop in for a Trip Ticket.

You see, I posted an add in the classified ads, to sell my encyclopedia collection, which I purchased on dial up internet, AOL of course and I went to speak to a teller at the local bank.

I was listening to my Walkman tape player, which I keep next to my Polaroid instant camera, and had to look up a phone number in the Yellow pages, because my typewriter stopped working.

Turns out the fax machine was out of paper, so I jogged to the post office to mail a letter, because my VCR player was acting up and I had no head cleaner at the time.

You see, I was trying to find things that were obsolete during my generation, and the only thing I could find was our political system, our civil liberties and our chance for prosperity.

Get it?


#BigData Is Here to Stay

Big data is here to stay.

And the amount of data will continue to grow.

Except not all of it will be in Traditional Relational Structured format.

So will relational databases survive?

Sure, for transactions perhaps.  And as a pass through to get unstructured data into reports, it will pass through Structured format.

And the amount of relational data is outpaced by data generated by alternative sources.

And relational data flows into Traditional Data Warehouses.

Which are difficult to maintain, change as well as expensive to produce and find talent to support.

Big Data is an extension of the Data Warehouse.  And Big Data is now accessible via SQL like languages like HIVE.

As well as SQL based languages such as Cloudera Hadoop, which by-pass Map-Reduce.

So you ingest as much data in any format into Hadoop, perhaps move that data into a NoSQL database such as HBase.  From there you can mount that data into structured format and run Ansi-SQL to query the data super fast.  On top of that you have software for visualization, graphs and maps as well as Dashboards and Reports.

From my perspective, the technology is moving closer to the Data Analysts. 

I see Data Scientist moving towards the technology, perhaps, but not as much due to the learning curve and shortage of talent.  Also, there's some difficulty in the exact definition as well as hiring manager expectations of what the position entails.

Granted there's a learning curve to become proficient in Big Data, but there was also a curve to learn Star Schema's and MDX. 

As technology improves, the average Data Analyst will know enough to tame Big Data in my opinion as the trend has already begun.

As with Traditional Reporting, Big Data poses challenges such as bad data in, bad data out, Data Cleansing, Data Governance and producing accurate reports.

And finally, we still need a simple way to interpret the data.

And then do something with that insight.

Big Data is here to stay.  As far as Traditional Relational Databases and Traditional Data Warehouses, time will tell.


8 Hours of Coding Today

Today was my first day back from training class.

It was most enjoyable, why?

Because I got to code all day.

What did I code, created a new Tabular Model from scratch.

A team mate gave me the source data, from the system he maintains.

I wrote an SSIS job to port the data over to my database.

And wrote a Stored Procedure / SQL-Agent job to pull in some data from SalesForce.

Once the data was local, I created a Visual Studio Tabular Model project to ingest the data into SSAS, and scheduled that to run nightly.

Then created another SSIS package to refresh the data, scheduled that to run just after the first job.

Then created a Business Intelligence Semantic Model or BISM in SharePoint 2013, assigned roles.

Then created a PowerView and a PowerPivot to point to that data, and sent it out to the main users for verification.

I kid you not, they've been asking for this data for as long as I've worked at this company, and now we have it, for user consumption, to slice and dice at will.

Some days you get to code for 8 hours, today was one which felt great and exhausting at the end.

Tomorrow, who knows what I'll get to work on, that's half the fun of working as a programmer!


Attempt to Install Cloudera Manager

Went to the Cloudera website, 1st step is to download the Cloudera Manager.

Make sure you are running 64 bit VM, I'm using CENTOS.

Be sure to 'Disable' your ETC/SELinux/config


Had to search for the VM's IP Address:

Had to set user permissions:

However, it crashed here, due to my reusing a VM from 3 months ago, however, a simple modification to the HOSTS file with the new IP Address fixed this no problem!!!

and Success!!!


Top 10 Bad Practices When Writing #SQL

What are some bad practices I see when viewing SQL code.
  1. Spaces in the table names
    1. ...From [User Info]
  2. Spaces in field names
    1. Select a.[First Name]...
  3. Dates stored as Strings
    1. CommissionDate as String
  4. Lack of descriptive 'Alias' names
    1. Inner Join Contacts a with (nolock)
  5. Inconsistent programming styles
    1. Using CTE, then Temp Tables, then Table Variables
  6. Unnecessary code 
    1. i.e. "Ltrim(RTrim(CustomerName))"
  7. Scattering Tables across Multiple Databases on a Server
    1. ...From Sales.dbo.Customer Inner Join Leads.dbo.Customer...Inner Join Demographics.dbo.State
  8. Messy Code
    1. Difficult to follow, not spaced for easy reading
  9. Using Cursors
    1. Use Looping or Joins more efficiently
  10. No Comments / Documentation
    1. Please add comments to inform next person what you are doing


Programming Courses Only Teach Surface Level Info

Programmers can take classes to learn new subjects / languages.

Or they can learn on their own.

When you do attend a class, you should understand what you're getting.

And that is, you will be introduced to topics you've never seen before.

You will not learn in depth detailed knowledge or how to solve specific problems.

They will glance over subjects, at a fast pace, and you will walk away with new knowledge of a subject.

But not the subject itself.

For example, they will teach you what a peanut butter and jelly sandwich is.

Then you'll have a lab to build one yourself.  And then another very similar lab building the same sandwich with 3 slices of bread instead of two.

And another lab using extra jelly.

The labs introduce you to the subject, then you repeat the process over and over, with slight modification.

At the end of the class, you will know what a P&J sandwich is, and you will have had experience building it.

But if you want real in depth knowledge, you may as well just read a few good books and get your own on-hand experience.

Learn it yourself, because taking courses only introduces you to a new topic, it doesn't teach you what you need to know to earn a living, or get certification.

That's all I'm saying!


Recap of my Cloudera Hadoop Training

I just completed a 4 day course on Big Data Hadoop from Cloudera.

And my takeaway is this.

Map Reduce is powerful, can be written in Java, has a bit of complexity to it, yet you can re-use code to ease the pain.

HIVE is definitely a friend of any BI developer.  Through SQOOP, you can ingest raw data from a relational database.

Why would you do this, mainly for the amount of data you can store.  It overcomes the limitations of your standard database as you can throw as much data into it as you want.

And then access it through HIVE SQL which is very familiar to any SQL developer.  You won't get rid of your SQL-Server any time soon, as that's ideal for transactional data with fast reads, writes, updates and deletes.

Once the data's in HIVE you can do what you like and run your queries, and output the results back to SQL-Server if you choose through SQOOP.

It's got lots of potential and I hope to be using it soon.

In 2 weeks, I'll be attending another Cloudera course to learn HBase database for Hadoop.

Should be fun!

#Hadoop Table Joins

It is possible to join 2 or more tables in Hadoop.

One way is to have a Mapper put the first table in Memory.  Then loop through and do the Joins to the second table.  However, this could overflow your memory buffer with large data sets.

Another way is to Join the two tables on the Reducer side.  By passing in the Joining key, then identifying which table is which and applying the join based on the key.  This also requires lots of resources.

Both of these methods are possible, and both use lots of memory and coding.

I prefer my way, which is to run your first Map/Reduce job from your Driver, the based on the output of the first Outer M/R job, you then call an Inner Map / Reduce job. 

This is how Visual Basic programmers back in the day joined two tables without having to know what a Join was.  It was slow then, and it's slow now.  Except the coding is simpler and is not memory intensive.

#Hadoop #SQOOP Training

In today's Cloudera training, we covered SQOOP or SQL for Hadoop.

It's a language created by Cloudera to ingest data into HDFS.

So the first step was to log into MySQL.

Then poke around and view some table structures.

mysql> DESCRIBE movierating;

And then import the a MySQL table into HDFS using the Import command:

We imported over 1 million rows in about 21 seconds...
Once ingested, we can view the contents from within HDFS:

Once in Hadoop HDFS, we can now run Map Reduce jobs.



#Hadoop Garbage Truck Analogy

When I think of Hadoop, I think of large sets of data.

Stored in the Hadoop Distributed File System (HDFS).

Where you can query it's contents.

It's similar to a garbage pickup truck, basically the Map phase.

Each truck goes around the neighborhood, picking up a copy of your garbage.

There's a separate truck for each house.

Once they've got your load, they return to a meeting point, where all the garbage is collected.

By that, it's get's Shuffled, Sorted and Merged.

Then they send that batch of merged sorted garbage, to another randomly chosen hub, where that garbage is then resorted with all the other garbage trucks, or the Reducer Phase.

The Reducer sorts all the garbage again, and produces it's own output, in the form a file(s)

Kind of a stretch analogy, but it's close.

And there you have it.


Day 2 Hadoop Training

Today is day 2 of Hadoop training.

First we did an exercise using Eclipse in Java.

Load some sample Java files into the IDE, the Driver, Mapper and Reducer.

Then export to a Jar file, then run a Hadoop command to start the job by specifying the Input and Output locations.

Then we learned about Unit testing in JUnit and did an exercise.

One cool feature is the ability to step through the Java code in the Eclipse IDE.

After lunch we discussed Combiners, which sit between the Mapper Phase and the Reducer Phase.  Basically, is sums up the data nice and neat prior to sending to the Reducer, saves network traffic when dealing with huge sets of data.

Next were Partitions, basically you can pre-assign which partition to send a Reducer too.  For example, if you parse a date by month, you specify the 12 months in a configuration, then create 12 partitions (0-11), and based on the month, you send the Reduce phase to that partition.

We did labs on both these topics.  Basically, they provide the solutions, I  get them to run, then study the code. 

I think the next topic with be Map only jobs (no reducers) and Counters.

We'll see what happens!


#Hadoop Basics I learned in Class

So I got training today on Hadoop.

Teacher was from Cloudera.

18 other programmers attended as well as our CTO.

Today was day 1 of 4.

We learned the history of Hadoop, started out as Lucene to Nutch to Hadoop.

And Hadoop Distributed File System HDFS.  Which is a file system that sits on top of the file system.

Data gets ingested through Hadoop commands which gets distributed into chunks which sits inside of a Block, typically 64mgs.  Each block is replicated 2 more times for fail over and redundancy.

The data sits in HDFS waiting to be queried.

We do that with Map Reduce.  The Mapper phase goes through each of the files, line by line, where it's dissected into Key Value pairs.

Those K/V pairs are then collected from all the Data Nodes into one or more randomly chosen Data Nodes, where they are Reduced.

In that they are aggregated (summed, averaged, min, max, etc.) and output to a file within Hadoop HDFS.

There's a "conductor" which keeps track of all the Data Nodes stored in Metadata which happens on the Named Node.  This happens to be a single point of failure, where the Data Nodes can lose 2 of the 3 versions of the data and the system will self-heal by copying that data to other healthy nodes.

The Named Node keeps track of Jobs or Programs via the Job Tracker.  It issues jobs to each of the Data Nodes through it's Task Tracker and keeps a 'heartbeat' to know when it completes and if it fails, and will spawn a new task to another node if it does fail.

It will also send the same job to another Data Node where the first job to finish wins called Speculative Execution, which can be toggled on and off in the config.

There's also an interim step called Sort, Shuffle and Merge.

This occurs after the Mapper phase and it sorts the data, shuffles it to the new Data Node and is merged with all the data from the other tasks.

It creates a Key / Value pair where the value is an array of items, which then gets split in the Reduce phase.

You can keep the Mapper jobs simple, as in the WordCount sample, which parses files into raw lines of data, which gets split into words, which get sent to the Reducer for summation.

You can also apply complex logic to the Mapper phase, by applying filters, explode the data or you don't have to call the Reducer phase, just keep the raw detailed data from the mapper.

They Mapper receives 2 main parameters, the Input file and Output file, along with the data types.  The data types must be comparable and sortable for it to work, so you must declare the variables appropriately.

You can split the Reducers into multiple Reducers to maximize performance.

That's basically what I remember from the class today, they covered a lot, and I took detailed notes which I need to transcribe.  What I do is write down almost everything the instructor says, then go back and write it so I can read it, then study it.  That's my learning pattern and it got me through college.

They also touched on the administration aspects of Hadoop, as there's a whole course dedicated to that subject.  Typically a small Cluster consists of 10-40 Data Nodes.  40+ is considered big and requires much tuning for maximum performance.

There are Master Nodes and Slave Nodes.  Some are single points of failure and some aren't.

We've got three more classes to go, where they'll talk more about HBase, Pig, Hive maybe Impala.

HBase is a NoSQL database which sits atop of HDFS.  It's got sparse data, wide columns and can sit within the Cluster.  It typically holds less data than HDFS and you can specify what and how much data to ingest.

Pig uses a scripting language called Pig Latin.  It uses data sets, called in sequence, where you can sort and group and aggregate data, which gets converted to Map Reduce which is not that fast.

Hive is a Hadoop Query Language similar to SQL.  It too gets converted to Map Reduce job and is not that fast. 

Impala was developed by Cloudera and was recently released a few weeks ago.  It too is SQL based except it runs natively and doesn't get converted to Map Reduce, thus it's extremely fast.  With my SQL background that has potential!

There's also a Machine Learning Mahout module we may learn if time permits.

Anyway, got to rest up for tomorrow's class - should be loads of fun!!!

Post from this am...

#Cloudera Training - Map Reduce - Tampa Florida #BigData

Today's our first day of Cloudera Big Data Training for Map Reduce in Tampa, Florida.

Our presenter is from Cloudera.

20 of our developers including CTO are here.

And the presentation is great.  Good overview, history and details surrounding Hadoop, HDFS, Map Reduce so far.

We have a VM of Cloudera's version of Hadoop which will probably be used for the demo's shortly.

I've been taking some good notes as there's lots to learn.

Here's a follow up post from today's class...


Web Browsers Are Just Smart Terminals

If you think back to the mainframe days, all the users had was a dumb terminal and a keyboard.

And if you think about it, a Web Browser is basically a Smart terminal.

Through it, you can connect anywhere in the world, get to any website, perform some amazing graphics, video, you name it.

I remember my father's IBM PC from 1982, with a 1200 baud modem.  The green lines scrolled so fast on the color monitor we though who would want anything faster, we can't even read the screens they scroll so fast.

So in a sense we have come along way since the 1970's and 80's.

Dumb Terminals, to Color Monitors to freaking awesome Web Browsers!

Where's That Data Come From?

I would estimate half my day is searching for data.

Where can I find this info?  How does that data get populated?

How are regions determined?  What countries make up the Latam region?

You think I'm kidding, but it's true.

Some places have no authoritative source for where all the data is, how it flows through all the systems, and what the data means.

People think Self Service you just give the users a place to create and run their own queries.

Some places the full time report writers don't even know some of the basic questions about the data.

And don't get me started on clean data.  Fat fingering data.  No lookup tables.  No edits on the front end systems to prevent garbage data.

Report Writing is not an easy profession.  Yet I saw a job posted last week for $35k - $40k depending on experience.  I also saw an add to migrate Access db to SQL Server.  Paid $10 per hour.

Last time my toilet backed up the plumber showed up for an hour, cost me over a $100.  What am I missing here.

Who Reboots Their PC Every Day?

How often do you reboot your PC?

Every day?  Week?  Month?  Never?

How about when Windows pushes a new update, you show up for work the next day, message that your PC was updated automatically.

Thing is, rebooting really takes a long time, who's got that kind of time in the morning?

At work I reboot every few days.  At home, every night, because it's a laptop and the light emitted from monitor keeps us awake.

Who'd like a faster time to reboot?  All of us probably.  Then we'd do it more frequently perhaps.

Sure does waste a lot of time if you think about it.