ScaDaMaLe Course site and book

yfinance Stock Data

Johannes Graner (LinkedIn), Albert Nilsson (LinkedIn) and Raazesh Sainudiin (LinkedIn)

2020, Uppsala, Sweden

This project was supported by Combient Mix AB through summer internships at:

Combient Competence Centre for Data Engineering Sciences, Department of Mathematics, Uppsala University, Uppsala, Sweden


This notebook builds on the following repositories in order to obtain SparkSQL DataSets and DataFrames of freely available Yahoo! Finance Data so that they can be ingested into delta.io Tables for trend analysis and more:

Resources:

Yfinance is a python library that makes it easy to download various financial data from Yahoo Finance.

pip install yfinance
"./000a_finance_utils"

To illustrate the library, we use two stocks of the Swedish bank SEB. The (default) data resolution is one day so we use 20 years of data to get a lot of observations.

import yfinance as yf
dataSEBAST = yf.download("SEB-A.ST", start="2001-07-01", end="2020-07-12")
dataSEBCST = yf.download("SEB-C.ST", start="2001-07-01", end="2020-07-12")
dataSEBAST.size
dataSEBAST

We can also download several tickers at once.

Note that the result is in a pandas dataframe so we are not doing any distributed computing.

This means that some care has to be taken to not overwhelm the local machine.

defined object TrendUtils
dataSEBAandCSTBST = yf.download("SEB-A.ST SEB-C.ST", start="2020-07-01", end="2020-07-12", group_by="ticker")
dataSEBAandCSTBST
type(dataSEBAST)

Loading the data into a Spark DataFrame.

dataSEBAST_sp = spark.createDataFrame(dataSEBAandCSTBST)
dataSEBAST_sp.printSchema()

The conversion to Spark DataFrame works but is quite messy. Just imagine if there were more tickers!

dataSEBAST_sp.show(20, False)

When selecting a column with a dot in the name (as in ('SEB-A.ST', 'High')) using PySpark, we have to enclose the column name in backticks `.

dataSEBAST_sp.select("`('SEB-A.ST', 'High')`")

We can also get information about individual tickers.

msft = yf.Ticker("MSFT")
print(msft)
msft.info

We write a function to transform data downloaded by yfinance and write a better formatted Spark DataFrame.

import pandas as pd
import yfinance as yf
import sys, getopt

# example:
# python3 yfin_to_csv.py -i "60m" "SEB-A.ST INVE-A.ST" "2019-07-01" "2019-07-06" "/root/GIT/yfin_test.csv"
def ingest(interval, tickers, start, end, csv_path):
  df = yf.download(tickers, start=start, end=end, interval=interval, group_by='ticker')
  findf = df.unstack().unstack(1).sort_index(level=1)
  findf.reset_index(level=0, inplace=True)
  findf = findf.loc[start:end]
  findf.rename(columns={'level_0':'Ticker'}, inplace=True)
  findf.index.name='Time'
  findf['Volume'] = pd.to_numeric(findf['Volume'], downcast='integer')
  findf = findf.reset_index(drop=False)
  findf['Time'] = findf['Time'].map(lambda x: str(x))
  spark.createDataFrame(findf).write.mode('overwrite').save(csv_path, format='csv')
  return(findf)

Let's look at some top value companies in the world as well as an assortment of Swedish Companies.

The number of tickers is now much larger, using the previous method would result in over 100 columns.

This would make it quite difficult to see what's going on, not to mention trying to analyze the data!

The data resolution is now one minute, meaning that 7 days gives a lot of observations.

Yfinance only allows downloading 1-minute data one week at a time and only for dates within the last 30 days.

topValueCompanies = 'MSFT AAPL AMZN GOOG BABA FB BRK-B BRK-A TSLA'
swedishCompanies = 'ASSA-B.ST ATCO-A.ST ALIV-SDB.ST ELUX-A.ST ELUX-B.ST EPI-A.ST EPI-B.ST ERIC-A.ST ERIC-B.ST FORTUM.HE HUSQ-A.ST HUSQ-B.ST INVE-A.ST INVE-B.ST KESKOA.HE KESKOB.HE KNEBV.HE KCR.HE MTRS.ST SAAB-B.ST SAS.ST SEB-A.ST SEB-C.ST SKF-A.ST SKF-B.ST STE-R.ST STERV.HE WRT1V.HE'
tickers = topValueCompanies + ' ' + swedishCompanies
interval = '1m'
start = '2021-01-01'
end = '2021-01-07'
csv_path = TrendUtils.getYfinancePath() + 'stocks_' + interval + '_' + start + '_' + end + '.csv'
dbutils.fs.rm(csv_path, recurse=True)
df = ingest(interval, tickers, start, end, csv_path)

Having written the result to a csv file, we can use the parser in the Trend Calculus library https://github.com/lamastex/spark-trend-calculus to read the data into a Dataset in Scala Spark.

import org.lamastex.spark.trendcalculus._
import org.lamastex.spark.trendcalculus._
val rootPath = TrendUtils.getYfinancePath
val csv_path = rootPath + "stocks_1m_2021-01-01_2021-01-07.csv"
val yfinDF = spark.read.yfin(csv_path)
yfinDF.count
res1: Long = 71965
yfinDF.show(20, false)
+-------------------+---------+------------------+------------------+------------------+------------------+------------------+------+
|time               |ticker   |open              |high              |low               |close             |adjClose          |volume|
+-------------------+---------+------------------+------------------+------------------+------------------+------------------+------+
|2021-01-04 14:45:00|BRK-B    |230.7899932861328 |230.99000549316406|230.72999572753906|230.97000122070312|230.97000122070312|13367 |
|2021-01-04 14:45:00|ELUX-A.ST|null              |null              |null              |null              |null              |null  |
|2021-01-04 14:45:00|ELUX-B.ST|190.14999389648438|190.14999389648438|190.14999389648438|190.14999389648438|190.14999389648438|2770  |
|2021-01-04 14:45:00|EPI-A.ST |null              |null              |null              |null              |null              |null  |
|2021-01-04 14:45:00|EPI-B.ST |null              |null              |null              |null              |null              |null  |
|2021-01-04 14:45:00|ERIC-A.ST|null              |null              |null              |null              |null              |null  |
|2021-01-04 14:45:00|ERIC-B.ST|null              |null              |null              |null              |null              |null  |
|2021-01-04 14:45:00|FB       |271.30499267578125|271.42498779296875|271.30499267578125|271.42498779296875|271.42498779296875|28330 |
|2021-01-04 14:45:00|FORTUM.HE|20.31999969482422 |20.31999969482422 |20.31999969482422 |20.31999969482422 |20.31999969482422 |49824 |
|2021-01-04 14:45:00|GOOG     |1749.4150390625   |1751.260009765625 |1749.4150390625   |1750.18994140625  |1750.18994140625  |4109  |
|2021-01-04 14:45:00|HUSQ-A.ST|null              |null              |null              |null              |null              |null  |
|2021-01-04 14:45:00|HUSQ-B.ST|108.9000015258789 |108.9000015258789 |108.9000015258789 |108.9000015258789 |108.9000015258789 |7170  |
|2021-01-04 14:45:00|INVE-A.ST|null              |null              |null              |null              |null              |null  |
|2021-01-04 14:45:00|INVE-B.ST|612.5999755859375 |612.5999755859375 |612.4000244140625 |612.4000244140625 |612.4000244140625 |1628  |
|2021-01-04 14:45:00|KCR.HE   |29.040000915527344|29.040000915527344|29.040000915527344|29.040000915527344|29.040000915527344|0     |
|2021-01-04 14:45:00|KESKOA.HE|null              |null              |null              |null              |null              |null  |
|2021-01-04 14:45:00|KESKOB.HE|null              |null              |null              |null              |null              |null  |
|2021-01-04 14:45:00|KNEBV.HE |null              |null              |null              |null              |null              |null  |
|2021-01-04 14:45:00|MSFT     |220.25            |220.32000732421875|220.1699981689453 |220.1699981689453 |220.1699981689453 |96255 |
|2021-01-04 14:45:00|MTRS.ST  |null              |null              |null              |null              |null              |null  |
+-------------------+---------+------------------+------------------+------------------+------------------+------------------+------+
only showing top 20 rows