First Microsoft CNTK Project in Visual Studio on Windows

The other day, I posted about the new open source technology from Micrsoft CNTK.

There are now many open source technologies to handle Machine Learning, Deep Learning, Artificial Intelligence, Neural Networks, and many flavors of each.

But this one allows the ability to run on Microsoft operating systems, one feature that I especially enjoy.

Here's a URL to get started with an example.

So let's get started.

Here we are in Visual Studio 2013, in the project we created the other day:

What is all this stuff?  Well, the project was written in c++, which is portable to Windows or Linux or Unix.

Here's the folder containing the raw files that comprise the project:

Unfortunately, there isn't much documentation on getting started as the technology was just released on GitHub the other day.  So I do what I traditionally do when learning any technology, poke around to see what's there.

First thing, there's a folder that contains some PDF files:

Interesting... there's an example folder:

Perhaps some clues...

With a "Readme" file:

"This example demonstrates usage of NDL to train a neural network on CIFAR-10 dataset (http://www.cs.toronto.edu/~kriz/cifar.html).
CIFAR-10 dataset is not included in CNTK distribution but can be easily downloaded and converted by running the following command from this folder"
Says there's a data set on the web that can be downloaded, running a Python script:  python CIFAR_convert.py

So we open a command prompt, navigate to the desired directory, do a search for the python file "CIFAR_Convert.py":

Thar she blows... except the command did not run.  Because  there's no "Python" entry in the Environment Variables: so let's add it:

Added entry to Environmental Variables:

Next error, we need to install "Numpy":

After reading the documentation, there's an alternate way, by loading different version of Python Anaconda:

 Installing bits, 64:

One note, every time the Environment Variables are modified, it requires a reboot.

It still does not recognize the Numpy.  Researching, it turns out the versions needs to line up, Python and Numpy, in my case 2.7.3 http://stackoverflow.com/questions/11200137/installing-numpy-on-64bit-windows-7-with-python-2-7-3

However, this loaded the file into the Anaconda application, we wanted the c:\Python26\Scripts folder.

So re-downloaded the correct numpy file as instructed in the Stackoverflow article above:


Downloaded file, copied to the Scripts directory of the Python27 folder on the c: drive:

Installed successfully!

So running the application in the IDE, bypassing Debug mode, we get the following:

This indicates "no command line arguments found".  So we need to run the app from the command line apparently, and pass in parameters.

The CNTK.exe file resides here: C:\Users\jbloom\Desktop\BloomConsulting\MicrosoftCNTK\CNTK-master\x64\Debug

Here's the command to execute the job:

configFile=01_Conv.config configName=01_Conv

Holy Toledo, looks like it ran, no errors... interrogating further...

It created a log file in the Output director:

 Looking at the log file, it threw an error midway through:


command: Train Test
precision = float
CNTKModelPath: ./Output/Models/01_Convolution
CNTKCommandTrainInfo: Train : 30
CNTKCommandTrainInfo: CNTKNoMoreCommands_Total : 30
CNTKCommandTrainBegin: Train
NDLBuilder Using CPU
Reading UCI file ./Train.txt
About to throw exception 'UCIParser::ParseInit - error opening file'


    -UCIParser,std::allocator > >::ParseInit
    -std::_Ref_count_obj >::_Ref_count_obj >
    -std::make_shared,Microsoft::MSR::CNTK::ConfigParameters & __ptr64>
    -CreateObject >

EXCEPTION occurred: UCIParser::ParseInit - error opening file
Usage: cntk configFile=yourConfigFile
For detailed information please consult the CNTK book
"An Introduction to Computational Networks and the Computational Network Toolkit"

Adding command line arguments to the Visual Studio Project:

"From the output of the above command you simply copy the 'VS debugging command args' to the command arguments of the CNTK project in Visual Studio (Right click on CNTK project -> Properties -> Configuration Properties -> Debugging -> Command Arguments). Start debugging the CNTK project."

needs to know the actual path of the file, so modified command line argument slightly:

configFile=C:\Users\jbloom\Desktop\BloomConsulting\MicrosoftCNTK\CNTK-master\Examples\Image\Miscellaneous\CIFAR-10\01_Conv.config configName=01_Conv

Running Visual Studio project in debug mode...it lets you step through the C++ code, memory allocation, low level stuff...

Turns out, when you run through VS, it places the Output file in a different location:


command: Train Test
precision = float
CNTKModelPath: ./Output/Models/01_Convolution
CNTKCommandTrainInfo: Train : 30
CNTKCommandTrainInfo: CNTKNoMoreCommands_Total : 30
CNTKCommandTrainBegin: Train
About to throw exception 'error opening file './Macros.ndl': No such file or directory'


    -Microsoft::MSR::CNTK::attempt< >
    -Microsoft::MSR::CNTK::attempt< >
    -std::_Ref_count_obj >::_Ref_count_obj >
    -std::make_shared,Microsoft::MSR::CNTK::ConfigParameters const & __ptr64>

attempt: error opening file './Macros.ndl': No such file or directory, retrying 2-th time out of 5...

turns out, we need to set the output directory:

Well, if you remember, earlier in the post, we got stuck at the Pip, to load the Numpa files, because it loaded them into the Anaconda Python folders.  Uninstalled the Anaconda application.

Ran file get-pip.py from command line again:

 Okay, now that Numpa got installed, we go back to this URL to get the correct version of files (Python and Numpy versions align): http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy

Alright!  Reran the Python script to download the file from the web, train the model, except it threw an error for Out of Memory.  

Turns out Python will eat all your available memory, then kaput!

Well, at this point, I changed to another example project.  Was able to extract the files for another experiment, in the same Image folder:


It created the necessary files:

It actually pulled the files from here.

 So to continue from this URL:

Run the example from the Image/MNIST/Data folder using:

CNTK.exe configFile=../Config/01_OneHidden.config

Now getting access violation...

Well, stepping through the code, not sure if this project is supposed to work, highlighted c++ code in yellow:

int wmain(int argc, wchar_t* argv[]) // wmain wrapper that reports Win32 exceptions
    set_terminate(terminate_this);   // insert a termination handler to ensure stderr gets flushed before actually terminating
    _set_error_mode(_OUT_TO_STDERR); // make sure there are no CRT prompts when CNTK is executing

    // Note: this does not seem to work--processes with this seem to just hang instead of terminating
        return wmain1(argc, argv);
    __except (1 /*EXCEPTION_EXECUTE_HANDLER, see excpt.h--not using constant to avoid Windows header in here*/)
        fprintf(stderr, "CNTK: Win32 exception caught (such an access violation or a stack overflow)\n"); // TODO: separate out these two into a separate message

Moving to the 3rd example project, "Speech", we set the command line arguments,


 or from the CMD (as administrator)

### Run

Run the example from the Speech/Data folder using:

`cntk configFile=../Config/FeedForward.config`

It throws up a warning message:

Check the boxes, let it run...

Well, another access violation or stack overflow.

That's as far as I'm going to take it for now.  From downloading the project, getting it to compile in Visual Studio, then downloading the required files for Python and its dependencies, and getting the versions to line up, it appears my old laptop does not have the memory to handle this code at this point in time.

Just realized there's a page that lists existing bugs for this project: 

So thanks for following along at home.  It's important to stay current with technology.  This just happens to be some rather complex and advanced stuff.  A nice feature, it works on Windows and seems like it has lots of potential going forward!

Again, you can read the first blog post on this subject: http://www.bloomconsultingbi.com/2016/01/first-try-at-microsoft-cntk.html


First Try at Microsoft CNTK Installation on Windows plus (GPUs are coming to Azure)

Saw a post today on Twitter, "Microsoft releases CNTK, its open source deep learning toolkit, on GitHub"

This is big news.  Because now anybody can download an application to run Neural Networks on their own machines.  On Windows operating systems:

So let's get started.

Windows Visual Studio setup

Create or logon to Github, then go to this link.

Download and unzip from the release page to the folder where you want to install CNTK.

Click on the "Download Zip" button:

File downloads:

And the extracted files:

If Visual C++ Redistributable for Visual Studio 2013 is not installed on your computer, install it from http://www.microsoft.com/en-us/download/details.aspx?id=40784.

Download (x86 or 64) and run:

For the GPU built version, ensure the latest NVIDIA driver is installed for your CUDA-enabled GPU.

You do not need to install the CUDA SDK, though it would be fine if you do so now or in the future.

Install Microsoft MS-MPI SDK and runtime from https://msdn.microsoft.com/en-us/library/bb524831(v=vs.85).aspx


Install Cuda 7.0 from the Nvidia website.

Download NVidia CUB from GitHub ...
Install NVIDIA CUDA Deep Neural Network library (cuDNN) by downloading the Windows version of cuDNN v4 using the following link. Unzip the file to a folder, e.g. c:\NVIDIA\cudnn-4.0 and set environment variable CUDNN_PATH to the cuDNN cuda directory, for example:

CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.0
Install Boost, We are using the Boost library (http://www.boost.org/) for unit tests. We are probably going to incorporate the Boost library into CNTK code in the future. Download and install boost version 1.59 (you need msvc-12.0 binaries) from Sourceforge.
Set the environment variable BOOST_INCLUDE PATH to your boost installation, e.g.:
Set the environment variable BOOST_LIB_PATH to the boost libraries, e.g.:
To integrate Boost into the Visual Studio Test Framework you can install a runner for Boost tests in VS from the VisualStudio Gallery.

Downloaded 1.59 Binaries from here: http://sourceforge.net/projects/boost/files/boost-binaries/1.59.0/

Next, Install ACML 5.3.1 or above (make sure to pick the ifort64_mp variant, e.g., acml5.3.1-ifort64.exe) from the AMD website.


Set the environment variable ACML_PATH, to the folder you installed the library to, e.g.

Set the environment variables:


CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.0

Next, MKL. (didn't install)

An alternative to the ACML library are the Intel MKL libraries.

To use MKL you have to define USE_MKL in the CNTKMath project. MKL is faster and more reliable on Intel chips, but it might requires a license.

Next, Install the latest Microsoft MS-MPI SDK (version 7 or later) and runtime from Microsoft Developer Network.

Next, If you want to use ImageReader, install OpenCV v3.0.0. Download and install OpenCV v3.0.0 for Windows from OpenCV-Org.
Set environment variable OPENCV_PATH to the OpenCV build folder, e.g.

At this point, all files were downloaded off the web:

Installed, and set the environment variables.  

Opened the solution in Visual Studio 2013, performed a "Build", it ran for a long time:

At that point, one of the projects did not load in Visual Studio 2013.  

Troubleshot, turns out the Cuda install did not complete successfully.  

So I re-downloaded the cuda_7.0.28_windows_network file and performed a reinstall.

At that point, all the projects loaded correctly in Visual Studio 2013:

Next, attempt a "Build" on the entire Solution:

Successful build!   

Overall, it took several hours to bring down the install files, set the variables, etc.

Will try to run one of the example demo's next.  This is not easy stuff, involves understanding of advanced math, statistics, as well as real world machine learning.  As well as deep learning.  And Neural Networks.  And Artificial Neural Networks.  And NN with short long term memory.

Scanning images, pattern recognition, speech detection. 
Self learning Neural Nets. 
A lot different than writing standard SQL.  But that's what makes it great.  A challenge indeed.

In the meantime, here's a video indicating GPU technology will be on the Microsoft Azure platform at some time in the future:


Tutorial Linkhttp://research.microsoft.com/en-us/um/people/dongyu/CNTK-Tutorial-NIPS2015.pdf

 Good stuff~!


Speed up the Adoption of the Internet of Things - Wireless Connectivity

Information of Things or IoT for short, is poking its head into mainstream technology.  The ability to have remote sensors, distributed across the planet, that send packets of information across the internet, to a central hub, for storage, processing, analytics and alerts.

There are existing frameworks, standards, best practices and protocols.  However, it's already seen signs of splintering, proprietary hardware, variety of software packages, etc.  Throw in security concerns, project costs, lack of qualified talent, and there are potential hiccups along the way.

Looking from a high 10,000 foot view, one thing that could speed up the process, in my opinion, is to have a world wide wireless internet.  Not individual connections to the internet via cable or WiFi, but completely wireless access from anywhere, anytime.

That would remove one of the restrictions of the process flow for the complete Internet of Things architecture and ecosystems.  A wireless internet would not be concerned if my connected washing machine was temporarily moved outside my house WiFi connection, onto a shipping truck, then the repair shop.  It would always be connected.  So if I had a sensor in my shirt, to track certain metrics, I could roam the planet, always connected, with the sensor sending information uninterrupted, having to synch back to the hub, in batch mode, once reconnected.

What we need is an actual web that anyone can access from anywhere anytime.  Could be done with satellites, balloon in the stratosphere, Universal Radio Frequency, or some other technology, to be named later.  That would surely speed up the adoption of the Internet of Things.

Sergey Brin & Larry Page: Inside the Google Brother's Master Mission

I never knew the story behind Google.  I do remember the day the company went public.  Quite a remarkable story.

Lots to Consider with Internet of Things

Let's say I ran a vending machine company.  10 vending machines, each one selling candy bars.  Every day, I'd make the rounds, stocking them, collecting change and dollar bills, fixing machines as needed.  Making sure the energy supply was good and connected.  And refunding any change to people who lost money in the machine.

Sure would be nice to perform these tasks remotely.  Through sensors.  I could ping the machine, ever so often, make sure it's functioning.  Check to see what version of software was running.  Perform diagnostics, make sure no issues.  Pull the money from the machine as its deposited, perhaps through credit cards instead of real money.  Run reports to see peak activity hours and popular brands.  Would save on gas money, driving to and from each location.  Save on resources as I wouldn't have technicians and stocking people, etc.  Collect the information from decentralized machines, collect, maintain and control things remotely, store data centrally, constant communication with multiple devices.

That's one way to look at the Internet of Things.  Instead of vending machines, perhaps sensors in your car, washing machine, lawn mower, air conditioner, sweater, tennis racket.  Anything really.  The internet of anything and everything.

Data would flow in, from the sensor, through a well defined protocol, continuously, mini messages with limited data, streaming in and collected, looking for patterns, anomalies, and activating alerts or triggers when criteria met.  Sensor down, send message to perform diagnostic or reboot.  Detecting malfunction in washing machine, initiate call to home owner, warranty people and schedule repair man.  Also, check data, is this issue common?  Was there a recall on the unit?  How many other affected?  Initiate email to vendor, inform widespread issue with product x, recall unit?

All this can be done unassisted, without the intervention of humans.  Automation is key.  A continuous loop of information, tightly secured, over the internet, through predefined protocol.

What are some sore points?  Internet connection.  Security.  Sensors that break or malfunction or outdated.  Operating systems not having latest patches.  Bottlenecks on the central server that collects millions of data packets per second.  Bugs in the software along any point in the chain.  People snooping the data, exposing sensitive information, exposing entry access points to "things" or networks, reporting wrong information, outages.

What about vendors who don't secure their sensors correctly.  Or not having enough or the correct software developers to support rise in IoT?  How about the complexity of layers, as in hardware devices, software code, vendor products, software package options.  Who's going to create these decoupled, decentralized systems and who's going to maintain them after they've been in production for 5 years and everything is outdated?  How much will it cost to upgrade everything to latest standards and protocols?

Lots to consider when thinking about the Internet of Things.  Sensors.  Devices.  Power supply.  Internet connection.  Communication between server hub and things.  Volume of data.  Storage and backup of data.  Software creation and maintenance.  Warranties, SLAs, patches, upgrades.  Maintenance, repair and downtime costs.

Who owns the system?  Because it crosses so many layers.  DBA?  Software developer/architect?  Network people?  Cloud people?  Sensor manufacture?  Security people?  Does the business own it?

Only thing, the IoT plane just pulled from the gate.  Now approaching the runway.  Got clearance for takeoff.  Let's hope for a smooth ride.

Larry Page: Where's Google going next?

Picking off some main topics / themes / statements:

"We need revolutionary change, not incremental." 

Balloons in the stratosphere.  

Bikes for everyone, instead of blacktop roads and parking lots.

Shared knowledge.

Apply machine learning to any subject, to understand better.

Curiosity, looking at things people may not be thinking about or working on, take a risk.

Seems like a true visionary!


The Need to Bake Security into Core #IoT Systems

The interesting thing about the creation of Hadoop, is the software application was never designed for widespread consumption.  Therefore, "Security" was not baked into the initial product.  And over the years, there's been effort to integrate security into the ecosystem.

With the Internet of Things (IoT), one of the main concerns right out of the gate is security.  To summarize IoT technology, sensors reside in a variety of products, information is passed to a gateway, through a specific protocol, to send data in the form of messages to a centralized repository, for storage, analysis and some type of action upstream.  

There is two way communication from the central location to the decentralized devices and sensors, in which data is captured regarding if the sensor is active, it's operating system, the power supply, version number, etc.  Software updates can be pushed to the device, however, with the potential huge number of devices out in the wild, constant communication could cause excessive network traffic / noise.  

Devices get registered and once the connection is established, there's a way to maintain session state and there's a way to secure the communication using SSL which can slow down the communication, however, the system is designed to have small constant packets of information sent across the wire for real time monitoring.  Usernames and Passwords can be sent as part of the message.

However, with any software, it can be hacked.  With any connected device to the internet, it can be compromised and penetrated.  These are legitimate concerns and they echo the similar security concerns from years ago, when the Cloud was beginning to be adopted.  The Cloud has since become standard business practice minus sensitive data like HIPPPA or PCI data, but many companies have chosen the Hybrid approach, by storing sensitive data locally, and pushing aggregated, non customer centric data to the Cloud.

IoT revolves around devices, sensors, communication protocols, data packets, data storage, message queues, big data, analytics, real time streaming data, alerts, web services, hardware, operating systems, etc.  And the number of decentralized sensors can be enormous, and the incoming messages can be millions per second.  So there's some inherent complexity of layers and types of technology involved.  

For IoT to get traction and become mainstream, the concept of "Security" needs to be addressed up front and definite standards put in place.  Because malicious hackers don't take vacations and are probably finding ways to infiltrate existing IoT systems as we speak.  

Imagine all the interception of data hacks that could occur, all the stolen information or electronic asses lifted from unsuspecting, un-monitored systems.  Let's not take the same path that Hadoop took and casually not include security into the core product.  Because IoT has the potential to take the world of data and applications and insights to a whole new level.

The Internet of Things is Just Getting Started: Arlen Nipper at TEDxNewBedford

The Internet of Things IoT requires communication between the enterprise server with all the devices and sensors embedded within downstream "things".  One way this is accomplished is through an open source protocol, similar to the HTTP protocol, and it's named MQTT.  The protocol is lightweight and can push and pull messages.  The orchestration occurs through HiveMQ Enterprise MQTT Broker.   From the link provided:

"The implementation of MQTT is the de-facto reference implementation and available in Java, C, C++, JavaScript, Lua, Python and soon also C#."

In the world of Microsoft, we have the Azure Event Hub.  This implementation available on the Azure Platform, based of the original Service Broker, then Service Hub, is designed to handle massive loads in the cloud.  The best part about this technology is the "guaranteed deliver".  Event Hub takes it a step further using "partitions" to allow scalability for millions of transactions.  As the messages arrive, you can store them in Azure SQL Server or Azure HDInsight or Azure VM which hosts Hadoop, or you can re-route the messages to another Event Hub.  Quite powerful.  Another useful link.

IoT is already a reality.  However, there are some concerns about "security".  As every application has potential to be hacked and any device connected to the internet can be penetrated.  Some reason for concern, although, chances are, some smart people are working on this exact issue as we speak.

And there you have it.  Reporting?  Great.  BI and Data Warehousing?  Great too.  Internet of Things?  The future~!

The Quantum Conspiracy: What Popularizers of QM Don't Want You to Know

What the lecturer is saying, in my opinion, that Einstein's theory of relativity, is directly blocking the existence of reverse time travel.  As Quantum Entanglement disregards the theory, which Einstein termed as "spooky" behavior.

Perhaps the theory is not correct or partially correct as it does not handle the spooky behavior.

To summarize in a simplistic loosely defined sentence, "Perhaps reverse or forward time travel is possible."

You gotta love science...


Artificial Intelligence Isn't Another Recycled Business Process Software App

What do you do for a living?

Well, I get paid to re-write an existing software application.  It's designed in X technology. My job is to upgrade it to the latest "hot" language/framework.

My, that's very impressive.  Are you inventing a new product?

No, the actual product has been around for a half a century.  In fact, it has all the basic functions to run a business.  From accounting, to lead generation, to tracking the sales and finances.  Before it was an application, people did the same function, yet by paper and pen. In fact, you can trace the same business process back hundreds of years.  Maybe thousands.

So in fact, you aren't inventing anything new.  Just putting a new wrapper on the same product, bundle it and resell for profit.


And so, our story describes many software developers. 

What have you done that's never been done before?

Well, we take these numbers here, and we tweak them here, then roll them up here, and display them here.  Then some people review the numbers and make some decisions, like who to fire, replace, automate, or give bonus too.

That process has been done for a very long time.  Maybe one of man's first occupations.  Abacus maybe?

So again, what have you invented that's never been done before?

Well, we do all our work in the Cloud.  The Cloud is a bunch of hosted servers, not on-premise.  It allows workers from all over the world to contribute and collaborate.  In addition to provide a platform for our site and data.

Haven't you just moved the internal infrastructure off site, pay as you go service or monthly contract?  We had the same functionality back in the mainframe days.

I don't think anyone is creating something entirely new.

Oh, there's one other thing.  It's called Artificial Intelligence.  It's basically some intelligent software that can learn over time.  It simulates the human brain in that it's wired with neurons and synapses and can select the best probable outcome based on millions of data points, in a fraction of a second.  These Artificial Algorithms will soon be able to process multiple domains.  And the programs will be integrated into Robots and other devices, that have the ability to move from place to place, they have vision, they can process sounds, touch, tastes and smells.  In fact, they can perform the function of many skilled workers.  Oh, they don't require vacation, or health benefits.  They'll work 24 hours per day.  And you can replace them without much thought or severance pay. Is this technology the type of invention you were looking for?

Precisely.  This my friend, is something that's never been done before.  This is not the recycling of business processes with the latest framework.  This is truly a new concept.  This will revolutionize the workforce as well as life on planet Earth.

Finally, something new.  Artificial Intelligence.


The Shift in Reporting from Centralized to Decentralized

IT used to own the reports.  The time it took to create the reports was long.  Did not meet the needs of the business unit.  So Self Service reporting was created.  And Vendors lined up to meet the demand by offering easy to use tools for the business folks.

What has become of the traditional report writer?  They still exist.  Yet the business has picked up a good percentage of the report work load.  Sort of hybrid of "rogue IT" as they still reside within the business units, yet they leverage the infrastructure and data from IT.  Best of both worlds.

Yet not all power user report writers know what they are doing.  Likewise, nobody's there to enforce standards.  Or data accuracy.  Nobody checking to verify the data is fresh or current.  And no longer can they stack up unlimited report requests, billed to IT budget.  Now the costs is associated with their unit, and perhaps not every department can afford their own full time report person.  Or software licenses.

It seems like the classical swap from "Centralized" reporting to "Decentralized" reporting.  The heavy lifting is now performed from the actual business units that consume the data.  So one advantage is, most likely, they know the business rules.  No question there.  And they can work at their own pace, as in prioritizing reports based on necessity and availability.  

And they no longer have a department to kick around, stating in every meeting, that they don't trust the data, reports take to long, none of the reports match.  Nope, IT is no longer the kicking bag for every department in the org.

Except.  Now the Business has to overcome all the difficulties that IT had to deal with.  Missing, bad or duplicate data.  Reports not running in timely manor or timing out.  Data isn't fresh.  How do we integrate the leads data with sales data with financial data with call center data, or integrating desperate data sources.  Nobody to enforce "Data Governance".  Or not having enough resources to handle the load of unlimited report requests.  Data and report security.  Basically, everything IT had to deal with, is now in the hands of the business.  So with great power, comes great responsibility.

The business units wanted control of the data and reports and information.  Well, now they've got it.


Advanced Technologies on Linux Operating Systems Only

I grew up on PC-Dos.  Then Windows.  From time to time, had jobs that required Unix or Linux.  The majority of my career has been Windows based.

Enter Hadoop, stage left.  Based in Linux world.

Enter Artificial Intelligence platforms, stage right.  Based in Linux world.

For me personally, having to maneuver in Linux, everything takes a lot longer.  The commands are not at the tip of the tongue.

Perhaps those of who work on Windows should just learn Linux once and for all.  In the meantime, this is a potential blocker and additional hurdle for entry into the advanced concepts required to learn new technology.

Yes, Azure has HDInsight, cloud based.  And Hortonworks and Microsoft teamed up to provide a Windows based version of Hadoop, much appreciated.

But then again, if we are going to learn advanced concepts such as artificial intelligence and neural networks, I suppose learning the operating system is just part of the process.


Intro to Quantum Concepts

Quantum teleportation.  The movement of information without the movement of particles.

In classical bits, you have a zero or one, on or off.

In Quantum world, there is Quantum Bits or Qubits, which encode information, called Quantum Information.

Here's a link with better description: https://en.wikipedia.org/wiki/Quantum_teleportation

I believe the movie Stargate as well as Contact has a similar concept, of charging up a teleport and connecting to a similar one on the other side, and sending things through, teleporting.


How about Quantum Cryptography.  A very useful tool for disguising your information for secure communication: https://en.wikipedia.org/wiki/Quantum_cryptography

And of course, there's Quantum Computing: https://en.wikipedia.org/wiki/Quantum_computing

As well as Quantum Entanglement, the joining of two seperate particles such that they are indistinguishable as separate units: https://en.wikipedia.org/wiki/Quantum_entanglement
allowing the transfer of information without movement.

The Quantum world is perhaps a difficult concept to fathom.  Yet the potential for advancement is tremendous.  Might be beneficial to read a bit about the new phenomenon.  

Or have a beer.  Whichever you prefer.


Built in Opportunity Costs when Deciding Technology Path

There's an opportunity cost when doing anything.  If I do this, then I can't do that, simultaneously.  In order for me to work in the data space, I do not have the bandwidth to become expert in other technologies. There's only so much time in the day.

So you have to specialize to some degree.  If you look at the data technologies out there, there's a lot to know.

We have traditional databases, many longtime vendors and they have standardized on the SQL language.  There hasn't been too much disruption in this space, minus In Memory databases, newer ways to compress data, additions like XML and JSON and some other features.  But for the most part, this space hasn't changed too much.

Within databases, we have seen new technologies emerge.  NoSQL, Hadoop and Graph database sprung up to handle different business cases for storing and retrieving data.

This has opened up the ETL Extract Transform and Load space.  As we have to account for the new types of data and storage capacities.

And further downstream, we have to report on this data, so that has evolved as well.  New tools to visualize data embedded within application, via the web and mobile devices.

So too, has the volume of data grown.  The amount of data stored now is nowhere near twenty years ago.  With the addition of sensors and Internet of Things we have to account for huge quantities of data stored indefinitely.  As well as reading the data as it flows into the ecosystem.

And there's the data mining side.  Machine learning is all the rage, as we have new technologies in the hands of everyday programmers to predict, look for outliers and make recommendations.

Throw in Artificial Intelligent programs that can learn over time and the world of data is just exploding.

So as an opportunity cost, if I decide to focus on a specific piece of the data space, I will not be able to learn another piece.  So where do I focus my time and effort?

Well, it seems to me the role of the traditional report writer is losing its luster.  Why?  Because report writers were slow and users didn't trust the data.  Self Service tools have sprung forth, putting the ability to work with data in the hands of everyone.

If they offer a product, that can pull data from almost any source, store that data using compression data algorithms, allow users to mash and integrate data without having to know specific SQL syntax, then report on this data in charts and visualizations, and refresh the data automatically, why do they need full time report writers?  Why would I try to compete with a free product, available on premise or in the cloud?  What service could I provide that isn't already available for free?  I could train the users, except the product is so dang easy to use and documented, I don't see much opportunity there.

We still have complex data warehousing which hasn't been automated yet, so there still some steam left in that space.  We have newer technologies which haven't been totally explored like NoSQL, although the use cases and ROI haven't been up to the hype.  There's the still new machine learning which hasn't been fully tapped yet, due to lack of qualified developers and not clearly defining the role of the data scientist, trying to lump to much into a single position.

Where are the sore points in the data space?  Data integration is still a big bowl of spaghetti.  And the advanced topics that haven't been automated yet, which require a higher level of thinking.  As well as industry specific niches where you know everything about a specific industry as well as the technology, an example would be medical data.  And I believe technology training will be in huge demand for the next x years.

  1. Data integration
  2. Advanced technologies
  3. Domain Knowledge + Technology
  4. Training
That's what I see as the opportunities developing today in the data space.  There's been so much change, upheaval and growth in every direction it's impossible to keep up.  So you have to determine what you want to specialize in and march forward.  Because there's a built in opportunity costs in any direction you go.