We start by seeing if the files necessary to run our notebooks are already in the distributed file system

ScaDaMaLe Course site and book

dbutils.fs.ls("dbfs:///FileStore/06_LHC")
res0: Seq[com.databricks.backend.daemon.dbutils.FileInfo] = WrappedArray(FileInfo(dbfs:/FileStore/06_LHC/LICENSE, LICENSE, 1071), FileInfo(dbfs:/FileStore/06_LHC/README.md, README.md, 3150), FileInfo(dbfs:/FileStore/06_LHC/data/, data/, 0), FileInfo(dbfs:/FileStore/06_LHC/h5/, h5/, 0), FileInfo(dbfs:/FileStore/06_LHC/models/, models/, 0), FileInfo(dbfs:/FileStore/06_LHC/scripts/, scripts/, 0), FileInfo(dbfs:/FileStore/06_LHC/utils/, utils/, 0))

Important!

Run the command line above to see if the required files and data are already available in the distributed file system, you should see the following:

Seq[com.databricks.backend.daemon.dbutils.FileInfo] = WrappedArray(FileInfo(dbfs:/FileStore/06LHC/LICENSE, LICENSE, 1071), FileInfo(dbfs:/FileStore/06LHC/README.md, README.md, 3150), FileInfo(dbfs:/FileStore/06LHC/data/, data/, 0), FileInfo(dbfs:/FileStore/06LHC/h5/, h5/, 0), FileInfo(dbfs:/FileStore/06LHC/models/, models/, 0), FileInfo(dbfs:/FileStore/06LHC/scripts/, scripts/, 0), FileInfo(dbfs:/FileStore/06_LHC/utils/, utils/, 0))

If these items appear, then skip most of this notebook, and go to Command Cell 21 to import data to the local driver

We start by installing everything necessary

pip install h5py

We start by preparing a folder on the local driver to process our files

rm -r 06_LHC
mkdir 06_LHC

Now get all necessary files from the project repository on Github

cd 06_LHC

wget https://github.com/dgedon/ProjectParticleClusteringv2/archive/main.zip
unzip main.zip
mv ProjectParticleClusteringv2-main/* .
rm -r ProjectParticleClusteringv2-main/ main.zip

We get the necessarry data (first training, then validation) and untar the file

cd 06_LHC
mkdir data
cd data


wget https://zenodo.org/record/3602254/files/hls4ml_LHCjet_100p_train.tar.gz
#mkdir data
#mv hls4ml_LHCjet_100p_train.tar.gz data
#cd data
tar --no-same-owner -xvf hls4ml_LHCjet_100p_train.tar.gz
cd 06_LHC/data
wget https://zenodo.org/record/3602254/files/hls4ml_LHCjet_100p_val.tar.gz
tar --no-same-owner -xvf hls4ml_LHCjet_100p_val.tar.gz

Now we preprocess the data. This transforms the data somehow in a useful way

cd 06_LHC/scripts
python prepare_data_multi.py --dir ../data/
cd 06_LHC/scripts
python prepare_data_multi.py --dir ../data/ --make_eval
rm -r ../data

Finally move everything onto the distributed file system. All necessary files are stored at FileStore/06_LHC

dbutils.fs.cp ("file:////databricks/driver/06_LHC", "dbfs:///FileStore/06_LHC", recurse=true) 

Important (continued from above)

Import files to local driver

Now for future notebooks, run the following command line below to import the files to the local driver. This may take a minute

dbutils.fs.cp("dbfs:///FileStore/06_LHC", "file:////databricks/driver/06_LHC", recurse=true)
res0: Boolean = true

Run the command line below to list the items in the 06_LHC folder. You should see the following:

  • LICENSE
  • README.md
  • h5
  • models
  • scripts
  • utils
ls 06_LHC/

You are now ready to go!