When working with Hadoop, we move files through the pipeline. We send text or CSV files into HDFS, then mount the file using Pig language, perhaps add some business logic, filter out some values, transform some data, output the results to another file or folder.
Then we pick up that file and mount in perhaps a Hive table. And then, maybe into an ORC table.
And we leave a trail of files littered throughout our HDFS cluster.
Next time you go to run the same scripts with new data, you may encounter an error, as file or folder already exists. So we need a way to clean out the existing folders, to make room for the new ones, and prevent our data flow from error.
I searched for a bit, and found individual pieces to remove a file, or files recursively, or even the folder. But if the item didn't exist prior, it too would throw an error.
So how do you know if the file/director already exists? Well, the way I did it was to create a dummy file in the directory I wanted to remove recursively. That way, I know for sure the file exists as well as the folder. First line, create a dummy txt file, second line, remove the folder recursively:
fs -touchz /user/pig/Agency.out/tmp.txt;
fs -rm -r /user/pig/Agency.out;
I run this from the Hadoop Pig Script in Visual Studio 2015. Seems to work.
After posting this blog, got a tweet from Josh Fennessy with good suggestion:
I tested and it works great!
thanks for the feedback~!
This blog post is in no way an attempt to steal other people's work. It's basically an conglomeration of notes from research I did...
I signed up for the Hortonworks Certified Associate exam last Thursday. Figured if I sign up, I'd have to take the test. And if I tak...
Saw a post today on Twitter, " Microsoft releases CNTK, its open source deep learning toolkit, on GitHub " This is big news. Be...