037a_AnimalNamesStructStreamingFiles(Scala)

Loading...

ScaDaMaLe Course site and book

Write files with animal names continuously for structured streaming

This notebook can be used to write files every 2 seconds into the distributed file system where each of these files contains a row given by the time stamp and two animals chosen at random from six animals in a animals.txt file in the driver.

After running the commands in this notebook you should have a a set of files named by the minute and second for easy setting up of structured streaming jobs in another notebook. This is mainly to create a structured streaming of files for learning purposes. In a real situation, you will have such streams coming from more robust ingestion frameworks such as kafka queues.

It is a good idea to understand how to run executibles from the driver to set up a stream of files for ingestion in structured streaming tasks down stream.

The following seven steps (Steps 0-6) can be used in more complex situations like running a more complex simulator from an executible file.

Step 0

let's get our bearings and prepare for setting up a structured streaming from files.

Just find the working directory using %sh.

%sh
pwd
/databricks/driver

We are in databricks/driver directory.

To run the script and be able to kill it you need a few installs.

%sh
apt-get install -y psmisc
Reading package lists... Building dependency tree... Reading state information... psmisc is already the newest version (23.1-1ubuntu0.1). psmisc set to manually installed. The following packages were automatically installed and are no longer required: libcap2-bin libpam-cap zulu-repo Use 'sudo apt autoremove' to remove them. 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

Step 1

Let's first make the animals.txt file in the driver.

%sh
rm -f animals.txt &&
echo "cat" >> animals.txt &&
echo "dog" >> animals.txt &&
echo "owl" >> animals.txt &&
echo "pig" >> animals.txt &&
echo "bat" >> animals.txt &&
echo "rat" >> animals.txt &&
cat animals.txt
cat dog owl pig bat rat

Step 2

Now let's make a bash shell script that can be executed every two seconds to produce the desired .log files with names prepended by minute and second inside the local directory logsEvery2Secs. Each line the file every2SecsRndWordsInFiles.sh is explained line by line:

  • #!/bin/bash is how we tell that this is a bash script which needs the /bin/bash binary. I remember the magic two characters #! as "SHA-BANG" for "hash" for # and "bang" for !
  • rm -f every2SecsRndWordsInFiles.sh && forcefully removes the file every2SecsRndWordsInFiles.sh and && executes the command preceeding it before going to the next line
  • echo "blah" >> every2SecsRndWordsInFiles.sh just spits out the content of the string, i.e., blah, in append mode due to >> into the file every2SecsRndWordsInFiles.sh

The rest of the commands simply create a fresh directory logsEvery2Secs and write two randomly chosen animals from the animals.txt file into the directory logsEvery2Secs with .log file names preceeded by minute and second of current time to make a finite number of file names (at most 3600 unique .log filenames).

%sh
rm -f every2SecsRndWordsInFiles.sh &&
echo "#!/bin/bash" >> every2SecsRndWordsInFiles.sh &&
echo "rm -rf logsEvery2Secs" >> every2SecsRndWordsInFiles.sh &&
echo "mkdir -p logsEvery2Secs" >> every2SecsRndWordsInFiles.sh &&
echo "while true; do echo \$( date --rfc-3339=second )\; | cat - <(shuf -n2 animals.txt) | sed '$!{:a;N;s/\n/ /;ta}' > logsEvery2Secs/\$( date '+%M_%S.log' ); sleep 2; done" >> every2SecsRndWordsInFiles.sh &&
cat every2SecsRndWordsInFiles.sh
#!/bin/bash rm -rf logsEvery2Secs mkdir -p logsEvery2Secs while true; do echo $( date --rfc-3339=second )\; | cat - <(shuf -n2 animals.txt) | sed '{:a;N;s/\n/ /;ta}' > logsEvery2Secs/$( date '+%M_%S.log' ); sleep 2; done

Step 3

Time to run the script!

The next two cells in %sh do the following:

  • makes sure the BASH script every2SecsRndWordsInFiles.sh is executible
  • run the script in the background without hangup
%sh 
chmod 744 every2SecsRndWordsInFiles.sh