Download Files Periodically
This notebook allows for setup and execution of a script to periodically download files. In this case the "Our World in Data" dataset csv files which are updated daily.
Content is based on "037a_AnimalNamesStructStreamingFiles" by Raazesh Sainudiin.
To be able to later kill a .sh process, we need to make this installation
apt-get install -y psmisc
create a new directory for our files if needed
dbutils.fs.mkdirs("file:///databricks/driver/projects/group12")
res0: Boolean = true
create a shell script to periodically download the dataset (currently set to download once per day). 1. shell 2. remove the previous shell script 3. write to script: bash binaries 4. write to script: remove folder where previous downloaded files are located 5. write to script: make new directory to put downloaded files 6. write to script: while loop: 6.1) remove old downloaded csv dataset 6.2) download new csv dataset 6.3) copy the csv file to the newly created directory using the timestamp as name 7. print the contents of the shell script
rm -f projects/group12/group12downloadFiles.sh &&
echo "#!/bin/bash" >> projects/group12/group12downloadFiles.sh &&
echo "rm -rf projects/group12/logsEveryXSecs" >> projects/group12/group12downloadFiles.sh &&
echo "mkdir -p projects/group12/logsEveryXSecs" >> projects/group12/group12downloadFiles.sh &&
echo "while true; rm owid-covid-data.csv; wget https://covid.ourworldindata.org/data/owid-covid-data.csv; do echo \$( date --rfc-3339=second )\; | cp owid-covid-data.csv projects/group12/logsEveryXSecs/\$( date '+%y_%m_%d_%H_%M_%S.csv' ); sleep 216000; done" >> projects/group12/group12downloadFiles.sh &&
cat projects/group12/group12downloadFiles.sh
#!/bin/bash
rm -rf projects/group12/logsEveryXSecs
mkdir -p projects/group12/logsEveryXSecs
while true; rm owid-covid-data.csv; wget https://covid.ourworldindata.org/data/owid-covid-data.csv; do echo $( date --rfc-3339=second )\; | cp owid-covid-data.csv projects/group12/logsEveryXSecs/$( date '+%y_%m_%d_%H_%M_%S.csv' ); sleep 216000; done
make the shell script executable
chmod 744 projects/group12/group12downloadFiles.sh
execute the shell script
nohup projects/group12/group12downloadFiles.sh
look at the files
pwd
ls -al projects/group12/logsEveryXSecs
/databricks/driver
total 14244
drwxr-xr-x 2 root root 4096 Jan 7 09:05 .
drwxr-xr-x 3 root root 4096 Jan 7 09:05 ..
-rw-r--r-- 1 root root 14577033 Jan 7 09:05 21_01_07_09_05_33.csv
look at the file content
cat projects/group12/logsEveryXSecs/XXXX.csv
cat: projects/group12/logsEveryXSecs/XXXX.csv: No such file or directory
kill the .sh process
killall group12downloadFiles.sh
move downloaded files to another location to make sure we don't delete the datasets
// dbutils.fs.mkdirs("/datasets/group12/")
dbutils.fs.cp("file:///databricks/driver/projects/group12/logsEveryXSecs/","/datasets/group12/",true)
res5: Boolean = true
display(dbutils.fs.ls("/datasets/group12/"))
path | name | size |
---|---|---|
dbfs:/datasets/group12/20_12_04_08_31_44.csv | 20_12_04_08_31_44.csv | 1.4181338e7 |
dbfs:/datasets/group12/20_12_04_08_32_40.csv | 20_12_04_08_32_40.csv | 1.4181338e7 |
dbfs:/datasets/group12/20_12_04_10_47_08.csv | 20_12_04_10_47_08.csv | 1.4190774e7 |
dbfs:/datasets/group12/21_01_07_08_50_05.csv | 21_01_07_08_50_05.csv | 1.4577033e7 |
dbfs:/datasets/group12/21_01_07_09_05_33.csv | 21_01_07_09_05_33.csv | 1.4577033e7 |