Parsing FIA PDF Data¶

Save the PDF Locally

Perform Imports and Add Our Development Directory to the Python Path

In [1]:

import sys
import os
import seaborn as sns
%matplotlib inline

# handle sphinx building docs from higher level directory
path = os.path.realpath(os.path.dirname('__file__'))
if 'notebooks' not in path:
    rel_path = ''
else:
    rel_path = '../../'

# get access to formulapy module
sys.path.append(os.path.realpath(rel_path))

Import the FIA PDF Parser

In [2]:

from formulapy.data.fia.parsers import parse_laptimes

Call the Parser on our PDF File

This demonstrates the primary data format that this project uses, which is Pandas Dataframe. This is a tabular format that works much like a in memory database. This can be more intuitive, compared to nested structures that are often used with Matlab.

This example looks complicated due to having to work with different paths when building the docs, but when using it locally, you would just perform:

parse_laptimes('\full\path\to\thepdf.pdf')

In [3]:

df = parse_laptimes(os.path.realpath(rel_path + 'data/fia_qualifying.pdf'))
df

Out[3]:

	driver_no	name	time
0	1	S.VETTEL	846.000
1	1	S.VETTEL	94.784
2	1	S.VETTEL	95.517
3	1	S.VETTEL	111.270
4	1	S.VETTEL	112.372
5	1	S.VETTEL	559.995
6	1	S.VETTEL	95.726
7	1	S.VETTEL	94.766
8	1	S.VETTEL	109.571
9	1	S.VETTEL	107.159
10	1	S.VETTEL	977.535
11	1	S.VETTEL	185.605
12	1	S.VETTEL	95.104
13	1	S.VETTEL	94.432
14	1	S.VETTEL	108.724
15	1	S.VETTEL	117.619
16	1	S.VETTEL	348.486
17	3	D.RICCIARDO	846.000
18	3	D.RICCIARDO	94.466
19	3	D.RICCIARDO	95.613
20	3	D.RICCIARDO	122.818
21	3	D.RICCIARDO	119.784
22	3	D.RICCIARDO	546.494
23	3	D.RICCIARDO	95.593
24	3	D.RICCIARDO	94.503
25	3	D.RICCIARDO	107.671
26	3	D.RICCIARDO	106.924
27	3	D.RICCIARDO	987.803
28	3	D.RICCIARDO	189.837
29	3	D.RICCIARDO	95.180
...	...	...	...
258	77	V.BOTTAS	843.000
259	77	V.BOTTAS	93.443
260	77	V.BOTTAS	94.301
261	77	V.BOTTAS	113.357
262	77	V.BOTTAS	113.219
263	77	V.BOTTAS	487.053
264	77	V.BOTTAS	103.937
265	77	V.BOTTAS	93.329
266	77	V.BOTTAS	1188.277
267	77	V.BOTTAS	110.096
268	77	V.BOTTAS	93.801
269	77	V.BOTTAS	293.273
270	77	V.BOTTAS	115.373
271	77	V.BOTTAS	93.128
272	77	V.BOTTAS	398.983
273	77	V.BOTTAS	126.088
274	99	A.SUTIL	842.000
275	99	A.SUTIL	123.845
276	99	A.SUTIL	96.338
277	99	A.SUTIL	397.805
278	99	A.SUTIL	116.811
279	99	A.SUTIL	96.656
280	99	A.SUTIL	96.653
281	99	A.SUTIL	113.326
282	99	A.SUTIL	113.668
283	99	A.SUTIL	460.364
284	99	A.SUTIL	451.681
285	99	A.SUTIL	95.736
286	99	A.SUTIL	95.364
287	99	A.SUTIL	116.468

288 rows × 3 columns

Example of Using DataFrames on Parsed Data

Select a Driver by Number

In [4]:

df[df.driver_no == 1]

Out[4]:

	driver_no	name	time
0	1	S.VETTEL	846.000
1	1	S.VETTEL	94.784
2	1	S.VETTEL	95.517
3	1	S.VETTEL	111.270
4	1	S.VETTEL	112.372
5	1	S.VETTEL	559.995
6	1	S.VETTEL	95.726
7	1	S.VETTEL	94.766
8	1	S.VETTEL	109.571
9	1	S.VETTEL	107.159
10	1	S.VETTEL	977.535
11	1	S.VETTEL	185.605
12	1	S.VETTEL	95.104
13	1	S.VETTEL	94.432
14	1	S.VETTEL	108.724
15	1	S.VETTEL	117.619
16	1	S.VETTEL	348.486

Or, Select by Name

In [5]:

df[df.name == 'V.BOTTAS']

Out[5]:

	driver_no	name	time
258	77	V.BOTTAS	843.000
259	77	V.BOTTAS	93.443
260	77	V.BOTTAS	94.301
261	77	V.BOTTAS	113.357
262	77	V.BOTTAS	113.219
263	77	V.BOTTAS	487.053
264	77	V.BOTTAS	103.937
265	77	V.BOTTAS	93.329
266	77	V.BOTTAS	1188.277
267	77	V.BOTTAS	110.096
268	77	V.BOTTAS	93.801
269	77	V.BOTTAS	293.273
270	77	V.BOTTAS	115.373
271	77	V.BOTTAS	93.128
272	77	V.BOTTAS	398.983
273	77	V.BOTTAS	126.088

Select the Lowest Time

In [6]:

df[df.time == min(df.time)]

Out[6]:

	driver_no	name	time
51	6	N.ROSBERG	92.506

Sort Rows by a Given Column

In [7]:

df.sort(columns='time')

Out[7]:

	driver_no	name	time
51	6	N.ROSBERG	92.506
45	6	N.ROSBERG	92.629
254	44	L.HAMILTON	92.703
248	44	L.HAMILTON	92.946
52	6	N.ROSBERG	92.950
255	44	L.HAMILTON	92.982
271	77	V.BOTTAS	93.128
265	77	V.BOTTAS	93.329
259	77	V.BOTTAS	93.443
149	19	F.MASSA	93.527
155	19	F.MASSA	93.527
143	19	F.MASSA	93.551
247	44	L.HAMILTON	93.611
44	6	N.ROSBERG	93.671
119	14	F.ALONSO	93.675
132	14	F.ALONSO	93.740
268	77	V.BOTTAS	93.801
125	14	F.ALONSO	93.858
128	14	F.ALONSO	94.005
152	19	F.MASSA	94.059
30	3	D.RICCIARDO	94.075
159	20	K.MAGNUSSEN	94.229
171	20	K.MAGNUSSEN	94.242
260	77	V.BOTTAS	94.301
201	22	J.BUTTON	94.317
13	1	S.VETTEL	94.432
165	20	K.MAGNUSSEN	94.437
18	3	D.RICCIARDO	94.466
144	19	F.MASSA	94.483
120	14	F.ALONSO	94.497
...	...	...	...
134	17	J.BIANCHI	842.000
34	4	M.CHILTON	842.000
205	25	J.VERGNE	842.000
219	26	D.KVYAT	842.000
109	13	P.MALDONADO	842.000
258	77	V.BOTTAS	843.000
55	7	K.RAIKKONEN	843.000
245	44	L.HAMILTON	843.000
42	6	N.ROSBERG	843.000
142	19	F.MASSA	843.000
118	14	F.ALONSO	844.000
188	22	J.BUTTON	845.000
158	20	K.MAGNUSSEN	845.000
17	3	D.RICCIARDO	846.000
0	1	S.VETTEL	846.000
81	9	M.ERICSSON	847.000
220	26	D.KVYAT	849.817
208	25	J.VERGNE	874.166
66	7	K.RAIKKONEN	934.009
10	1	S.VETTEL	977.535
27	3	D.RICCIARDO	987.803
198	22	J.BUTTON	1036.979
43	6	N.ROSBERG	1116.970
246	44	L.HAMILTON	1125.072
126	14	F.ALONSO	1144.167
150	19	F.MASSA	1177.634
266	77	V.BOTTAS	1188.277
50	6	N.ROSBERG	1201.803
253	44	L.HAMILTON	1214.207
166	20	K.MAGNUSSEN	1239.418

288 rows × 3 columns

Seaborn Works Well for Statistical Plots

Pandas has some built-in plotting functionality as well. One thing to notice is that the tools built around dataframes can infer information from the DataFrame or Series that wouldn't be available with simple arrays, like the x axis labeled as 'time'.

In [8]:

sns.distplot(df.time)

Out[8]:

<matplotlib.axes.AxesSubplot at 0x16a0a160>

Only Look at the Lower Times Posted

In [9]:

sns.distplot(df.loc[df.time < 100, 'time'])

Out[9]:

<matplotlib.axes.AxesSubplot at 0x167f1358>

(Parsing_FIA_PDF.ipynb; Parsing_FIA_PDF_evaluated.ipynb; )