When working with Hadoop, we move files through the pipeline. We send text or CSV files into HDFS, then mount the file using Pig language, perhaps add some business logic, filter out some values, transform some data, output the results to another file or folder.
Then we pick up that file and mount in perhaps a Hive table. And then, maybe into an ORC table.
And we leave a trail of files littered throughout our HDFS cluster.
Next time you go to run the same scripts with new data, you may encounter an error, as file or folder already exists. So we need a way to clean out the existing folders, to make room for the new ones, and prevent our data flow from error.
I searched for a bit, and found individual pieces to remove a file, or files recursively, or even the folder. But if the item didn't exist prior, it too would throw an error.
So how do you know if the file/director already exists? Well, the way I did it was to create a dummy file in the directory I wanted to remove recursively. That way, I know for sure the file exists as well as the folder. First line, create a dummy txt file, second line, remove the folder recursively:
fs -touchz /user/pig/Agency.out/tmp.txt;
fs -rm -r /user/pig/Agency.out;
I run this from the Hadoop Pig Script in Visual Studio 2015. Seems to work.
After posting this blog, got a tweet from Josh Fennessy with good suggestion:
I tested and it works great!
thanks for the feedback~!
I signed up for the Hortonworks Certified Associate exam last Thursday. Figured if I sign up, I'd have to take the test. And if I tak...
Data becomes information. Information adds value if used properly to align business practices, streamline processes with net result of incr...
Data is the new oil. Sort of a good analogy. Except new oil is constantly required. And there is only so many oil wells on the planet. A...
What do you want to do when you grow up. For some of us, we still haven't decided. After close to 50 years. Chances are, if you chos...