ScaDaMaLe Course site and book

Download Files Periodically

This notebook allows for setup and execution of a script to periodically download files. In this case the "Our World in Data" dataset csv files which are updated daily.

Content is based on "037a_AnimalNamesStructStreamingFiles" by Raazesh Sainudiin.

To be able to later kill a .sh process, we need to make this installation

apt-get install -y psmisc 

create a new directory for our files if needed

dbutils.fs.mkdirs("file:///databricks/driver/projects/group12")
res0: Boolean = true

create a shell script to periodically download the dataset (currently set to download once per day). 1. shell 2. remove the previous shell script 3. write to script: bash binaries 4. write to script: remove folder where previous downloaded files are located 5. write to script: make new directory to put downloaded files 6. write to script: while loop: 6.1) remove old downloaded csv dataset 6.2) download new csv dataset 6.3) copy the csv file to the newly created directory using the timestamp as name 7. print the contents of the shell script

rm -f projects/group12/group12downloadFiles.sh &&
echo "#!/bin/bash" >> projects/group12/group12downloadFiles.sh &&
echo "rm -rf projects/group12/logsEveryXSecs" >> projects/group12/group12downloadFiles.sh &&
echo "mkdir -p projects/group12/logsEveryXSecs" >> projects/group12/group12downloadFiles.sh &&
echo "while true; rm owid-covid-data.csv; wget https://covid.ourworldindata.org/data/owid-covid-data.csv; do echo \$( date --rfc-3339=second )\; | cp owid-covid-data.csv projects/group12/logsEveryXSecs/\$( date '+%y_%m_%d_%H_%M_%S.csv' ); sleep 216000; done" >> projects/group12/group12downloadFiles.sh &&
cat projects/group12/group12downloadFiles.sh
#!/bin/bash
rm -rf projects/group12/logsEveryXSecs
mkdir -p projects/group12/logsEveryXSecs
while true; rm owid-covid-data.csv; wget https://covid.ourworldindata.org/data/owid-covid-data.csv; do echo $( date --rfc-3339=second )\; | cp owid-covid-data.csv projects/group12/logsEveryXSecs/$( date '+%y_%m_%d_%H_%M_%S.csv' ); sleep 216000; done

make the shell script executable

chmod 744 projects/group12/group12downloadFiles.sh

execute the shell script

nohup projects/group12/group12downloadFiles.sh

look at the files

pwd
ls -al projects/group12/logsEveryXSecs
/databricks/driver
total 14244
drwxr-xr-x 2 root root     4096 Jan  7 09:05 .
drwxr-xr-x 3 root root     4096 Jan  7 09:05 ..
-rw-r--r-- 1 root root 14577033 Jan  7 09:05 21_01_07_09_05_33.csv

look at the file content

cat projects/group12/logsEveryXSecs/XXXX.csv
cat: projects/group12/logsEveryXSecs/XXXX.csv: No such file or directory

kill the .sh process

killall group12downloadFiles.sh

move downloaded files to another location to make sure we don't delete the datasets

// dbutils.fs.mkdirs("/datasets/group12/")
dbutils.fs.cp("file:///databricks/driver/projects/group12/logsEveryXSecs/","/datasets/group12/",true)
res5: Boolean = true
display(dbutils.fs.ls("/datasets/group12/"))
path name size
dbfs:/datasets/group12/20_12_04_08_31_44.csv 20_12_04_08_31_44.csv 1.4181338e7
dbfs:/datasets/group12/20_12_04_08_32_40.csv 20_12_04_08_32_40.csv 1.4181338e7
dbfs:/datasets/group12/20_12_04_10_47_08.csv 20_12_04_10_47_08.csv 1.4190774e7
dbfs:/datasets/group12/21_01_07_08_50_05.csv 21_01_07_08_50_05.csv 1.4577033e7
dbfs:/datasets/group12/21_01_07_09_05_33.csv 21_01_07_09_05_33.csv 1.4577033e7