Parsing FIA PDF Data¶
Save the PDF Locally
Perform Imports and Add Our Development Directory to the Python Path
In [1]:
import sys
import os
import seaborn as sns
%matplotlib inline
# handle sphinx building docs from higher level directory
path = os.path.realpath(os.path.dirname('__file__'))
if 'notebooks' not in path:
rel_path = ''
else:
rel_path = '../../'
# get access to formulapy module
sys.path.append(os.path.realpath(rel_path))
Import the FIA PDF Parser
In [2]:
from formulapy.data.fia.parsers import parse_laptimes
Call the Parser on our PDF File
This demonstrates the primary data format that this project uses, which is Pandas Dataframe. This is a tabular format that works much like a in memory database. This can be more intuitive, compared to nested structures that are often used with Matlab.
This example looks complicated due to having to work with different paths when building the docs, but when using it locally, you would just perform:
parse_laptimes('\full\path\to\thepdf.pdf')
In [3]:
df = parse_laptimes(os.path.realpath(rel_path + 'data/fia_qualifying.pdf'))
df
Out[3]:
Example of Using DataFrames on Parsed Data
Select a Driver by Number
In [4]:
df[df.driver_no == 1]
Out[4]:
Or, Select by Name
In [5]:
df[df.name == 'V.BOTTAS']
Out[5]:
Select the Lowest Time
In [6]:
df[df.time == min(df.time)]
Out[6]:
Sort Rows by a Given Column
In [7]:
df.sort(columns='time')
Out[7]:
Seaborn Works Well for Statistical Plots
Pandas has some built-in plotting functionality as well. One thing to notice is that the tools built around dataframes can infer information from the DataFrame or Series that wouldn't be available with simple arrays, like the x axis labeled as 'time'.
In [8]:
sns.distplot(df.time)
Out[8]:
Only Look at the Lower Times Posted
In [9]:
sns.distplot(df.loc[df.time < 100, 'time'])
Out[9]: