We start by seeing if the files necessary to run our notebooks are already in the distributed file system
dbutils.fs.ls("dbfs:///FileStore/06_LHC")
res0: Seq[com.databricks.backend.daemon.dbutils.FileInfo] = WrappedArray(FileInfo(dbfs:/FileStore/06_LHC/LICENSE, LICENSE, 1071), FileInfo(dbfs:/FileStore/06_LHC/README.md, README.md, 3150), FileInfo(dbfs:/FileStore/06_LHC/data/, data/, 0), FileInfo(dbfs:/FileStore/06_LHC/h5/, h5/, 0), FileInfo(dbfs:/FileStore/06_LHC/models/, models/, 0), FileInfo(dbfs:/FileStore/06_LHC/scripts/, scripts/, 0), FileInfo(dbfs:/FileStore/06_LHC/utils/, utils/, 0))
Important!
Run the command line above to see if the required files and data are already available in the distributed file system, you should see the following:
Seq[com.databricks.backend.daemon.dbutils.FileInfo] = WrappedArray(FileInfo(dbfs:/FileStore/06LHC/LICENSE, LICENSE, 1071), FileInfo(dbfs:/FileStore/06LHC/README.md, README.md, 3150), FileInfo(dbfs:/FileStore/06LHC/data/, data/, 0), FileInfo(dbfs:/FileStore/06LHC/h5/, h5/, 0), FileInfo(dbfs:/FileStore/06LHC/models/, models/, 0), FileInfo(dbfs:/FileStore/06LHC/scripts/, scripts/, 0), FileInfo(dbfs:/FileStore/06_LHC/utils/, utils/, 0))
If these items appear, then skip most of this notebook, and go to Command Cell 21 to import data to the local driver
We start by installing everything necessary
pip install h5py
We start by preparing a folder on the local driver to process our files
rm -r 06_LHC
mkdir 06_LHC
Now get all necessary files from the project repository on Github
cd 06_LHC
wget https://github.com/dgedon/ProjectParticleClusteringv2/archive/main.zip
unzip main.zip
mv ProjectParticleClusteringv2-main/* .
rm -r ProjectParticleClusteringv2-main/ main.zip
We get the necessarry data (first training, then validation) and untar the file
cd 06_LHC
mkdir data
cd data
wget https://zenodo.org/record/3602254/files/hls4ml_LHCjet_100p_train.tar.gz
#mkdir data
#mv hls4ml_LHCjet_100p_train.tar.gz data
#cd data
tar --no-same-owner -xvf hls4ml_LHCjet_100p_train.tar.gz
cd 06_LHC/data
wget https://zenodo.org/record/3602254/files/hls4ml_LHCjet_100p_val.tar.gz
tar --no-same-owner -xvf hls4ml_LHCjet_100p_val.tar.gz
Now we preprocess the data. This transforms the data somehow in a useful way
cd 06_LHC/scripts
python prepare_data_multi.py --dir ../data/
cd 06_LHC/scripts
python prepare_data_multi.py --dir ../data/ --make_eval
rm -r ../data
Finally move everything onto the distributed file system. All necessary files are stored at FileStore/06_LHC
dbutils.fs.cp ("file:////databricks/driver/06_LHC", "dbfs:///FileStore/06_LHC", recurse=true)
Important (continued from above)
Import files to local driver
Now for future notebooks, run the following command line below to import the files to the local driver. This may take a minute
dbutils.fs.cp("dbfs:///FileStore/06_LHC", "file:////databricks/driver/06_LHC", recurse=true)
res0: Boolean = true
Run the command line below to list the items in the 06_LHC folder. You should see the following:
- LICENSE
- README.md
- h5
- models
- scripts
- utils
ls 06_LHC/
You are now ready to go!