10/10/2016

Remove Folder Content in HDFS using Pig Script

When working with Hadoop, we move files through the pipeline.  We send text or CSV files into HDFS, then mount the file using Pig language, perhaps add some business logic, filter out some values, transform some data, output the results to another file or folder.

Then we pick up that file and mount in perhaps a Hive table.  And then, maybe into an ORC table.

And we leave a trail of files littered throughout our HDFS cluster.

Next time you go to run the same scripts with new data, you may encounter an error, as file or folder already exists.  So we need a way to clean out the existing folders, to make room for the new ones, and prevent our data flow from error.

I searched for a bit, and found individual pieces to remove a file, or files recursively, or even the folder.  But if the item didn't exist prior, it too would throw an error.

So how do you know if the file/director already exists?  Well, the way I did it was to create a dummy file in the directory I wanted to remove recursively.  That way, I know for sure the file exists as well as the folder.  First line, create a dummy txt file, second line, remove the folder recursively:

fs -touchz /user/pig/Agency.out/tmp.txt;
fs -rm -r /user/pig/Agency.out;


I run this from the Hadoop Pig Script in Visual Studio 2015.  Seems to work.

After posting this blog, got a tweet from Josh Fennessy with good suggestion:
I tested and it works great!

thanks for the feedback~!

No comments:

Post a Comment

We Interrupt this Broadcast