The first week, client offered to stand up the Hadoop cluster. Except we learned Hadoop no longer supported on Windows operating system. I was on vacation the 2nd week of the project. I decided to take the laptop on vacation to the cabin and got Hadoop working on Microsoft Windows 10 Hyper-V, no easy feat. I configured it to allow remote connection from the laptop. After doing some research, turned out Visual Studio 2015 had 3 new components to work with Hadoop. I played around with it and got flat files to flow t HDFS, send Pig Scripts which worked and pushed data to Hive tables. That was all I needed to return on Week 3 of the project. Upon return, I was brought in to assist in the troubleshooting and we reinstalled Hadoop one or two more times, finally got it stood up on Linux, Master Node with three Data Nodes. I ported my Visual Studio 2015 source code to the server, connected with Git and the project was humming along nicely. Data flowed from Excel, to text files, pushed from Shared Folder to HDFS, Pig script to message the data, pushed into Hive tables, then Hive ORC tables. Then, figured out how to install Polybase on SQL Server 2016, allowing seamless flow of data from Hadoop to SQL Server using common T-SQL. I architected the SSIS project. Flowed the data into Master Data Service entities. Ran the data through some custom logic to parse and clean the address. Then called an SSIS component in c# to call an internal web service to Geo Code the Latitude and Longitude. Then called another web service using c# to send in Lat & Lon to get Geo Tagging. Then pushed data into SQL Server ODS tables. However, around that time, the Hadoop cluster went down. I troubleshot the server for hours and hours and hours, deep into the nights. The thing about Hadoop, if it stops working, there are so many configuration files to go through and investigate. If one file has one entry with an extra space, the entire thing stops working. I looked through every folder and many files interrogating everything. And took tedious notes along the way. And searched many a website/blog in attempting to fix. Permissions. Config files. User accounts. Error logs. Host files. You name it. I was told to hold off on the issue as I had to jump back on my actual assignment. I couldn’t let it rest and would work on it after hours. Then I saw something, in one of the webs, you could see the jobs: success, failed, killed, etc. There it was. After picking through the logs, I saw the job was still running. And so were 150 other jobs. Which I initiated during my development. The server got backed up. Found a site on how to kill the process. And proceeded to execute a command to kill 1 process at a time, for 150+ processes. Restarted Ambari, bam! I could execute Hive with no error on the command line. Then flowed some Pig scripts through Visual Studio 2015 using the WebHCat server and sure enough, back in business. Solved it~!
I like to solve difficult problems. Especially the ones that others gave up on. Those are the juicy problems that are not easy to find. That took some meticulous troubleshooting for many, many hours. I rolled off the Hadoop project before the Data Warehouse was completed but I created the SSIS files to handle the Dim and Fact tables as well as refresh the Tabular Model. That is what you call a great project.
And there you have it~!