Parsing FIA PDF Data

Save the PDF Locally

Japanese Qualifying

Perform Imports and Add Our Development Directory to the Python Path

In [1]:
import sys
import os
import seaborn as sns
%matplotlib inline

# handle sphinx building docs from higher level directory
path = os.path.realpath(os.path.dirname('__file__'))
if 'notebooks' not in path:
    rel_path = ''
else:
    rel_path = '../../'

# get access to formulapy module
sys.path.append(os.path.realpath(rel_path))  

Import the FIA PDF Parser

In [2]:
from formulapy.data.fia.parsers import parse_laptimes

Call the Parser on our PDF File

This demonstrates the primary data format that this project uses, which is Pandas Dataframe. This is a tabular format that works much like a in memory database. This can be more intuitive, compared to nested structures that are often used with Matlab.

This example looks complicated due to having to work with different paths when building the docs, but when using it locally, you would just perform:

parse_laptimes('\full\path\to\thepdf.pdf')

In [3]:
df = parse_laptimes(os.path.realpath(rel_path + 'data/fia_qualifying.pdf'))
df
Out[3]:
driver_no name time
0 1 S.VETTEL 846.000
1 1 S.VETTEL 94.784
2 1 S.VETTEL 95.517
3 1 S.VETTEL 111.270
4 1 S.VETTEL 112.372
5 1 S.VETTEL 559.995
6 1 S.VETTEL 95.726
7 1 S.VETTEL 94.766
8 1 S.VETTEL 109.571
9 1 S.VETTEL 107.159
10 1 S.VETTEL 977.535
11 1 S.VETTEL 185.605
12 1 S.VETTEL 95.104
13 1 S.VETTEL 94.432
14 1 S.VETTEL 108.724
15 1 S.VETTEL 117.619
16 1 S.VETTEL 348.486
17 3 D.RICCIARDO 846.000
18 3 D.RICCIARDO 94.466
19 3 D.RICCIARDO 95.613
20 3 D.RICCIARDO 122.818
21 3 D.RICCIARDO 119.784
22 3 D.RICCIARDO 546.494
23 3 D.RICCIARDO 95.593
24 3 D.RICCIARDO 94.503
25 3 D.RICCIARDO 107.671
26 3 D.RICCIARDO 106.924
27 3 D.RICCIARDO 987.803
28 3 D.RICCIARDO 189.837
29 3 D.RICCIARDO 95.180
... ... ... ...
258 77 V.BOTTAS 843.000
259 77 V.BOTTAS 93.443
260 77 V.BOTTAS 94.301
261 77 V.BOTTAS 113.357
262 77 V.BOTTAS 113.219
263 77 V.BOTTAS 487.053
264 77 V.BOTTAS 103.937
265 77 V.BOTTAS 93.329
266 77 V.BOTTAS 1188.277
267 77 V.BOTTAS 110.096
268 77 V.BOTTAS 93.801
269 77 V.BOTTAS 293.273
270 77 V.BOTTAS 115.373
271 77 V.BOTTAS 93.128
272 77 V.BOTTAS 398.983
273 77 V.BOTTAS 126.088
274 99 A.SUTIL 842.000
275 99 A.SUTIL 123.845
276 99 A.SUTIL 96.338
277 99 A.SUTIL 397.805
278 99 A.SUTIL 116.811
279 99 A.SUTIL 96.656
280 99 A.SUTIL 96.653
281 99 A.SUTIL 113.326
282 99 A.SUTIL 113.668
283 99 A.SUTIL 460.364
284 99 A.SUTIL 451.681
285 99 A.SUTIL 95.736
286 99 A.SUTIL 95.364
287 99 A.SUTIL 116.468

288 rows × 3 columns

Example of Using DataFrames on Parsed Data

Select a Driver by Number

In [4]:
df[df.driver_no == 1]
Out[4]:
driver_no name time
0 1 S.VETTEL 846.000
1 1 S.VETTEL 94.784
2 1 S.VETTEL 95.517
3 1 S.VETTEL 111.270
4 1 S.VETTEL 112.372
5 1 S.VETTEL 559.995
6 1 S.VETTEL 95.726
7 1 S.VETTEL 94.766
8 1 S.VETTEL 109.571
9 1 S.VETTEL 107.159
10 1 S.VETTEL 977.535
11 1 S.VETTEL 185.605
12 1 S.VETTEL 95.104
13 1 S.VETTEL 94.432
14 1 S.VETTEL 108.724
15 1 S.VETTEL 117.619
16 1 S.VETTEL 348.486

Or, Select by Name

In [5]:
df[df.name == 'V.BOTTAS']
Out[5]:
driver_no name time
258 77 V.BOTTAS 843.000
259 77 V.BOTTAS 93.443
260 77 V.BOTTAS 94.301
261 77 V.BOTTAS 113.357
262 77 V.BOTTAS 113.219
263 77 V.BOTTAS 487.053
264 77 V.BOTTAS 103.937
265 77 V.BOTTAS 93.329
266 77 V.BOTTAS 1188.277
267 77 V.BOTTAS 110.096
268 77 V.BOTTAS 93.801
269 77 V.BOTTAS 293.273
270 77 V.BOTTAS 115.373
271 77 V.BOTTAS 93.128
272 77 V.BOTTAS 398.983
273 77 V.BOTTAS 126.088

Select the Lowest Time

In [6]:
df[df.time == min(df.time)]
Out[6]:
driver_no name time
51 6 N.ROSBERG 92.506

Sort Rows by a Given Column

In [7]:
df.sort(columns='time')
Out[7]:
driver_no name time
51 6 N.ROSBERG 92.506
45 6 N.ROSBERG 92.629
254 44 L.HAMILTON 92.703
248 44 L.HAMILTON 92.946
52 6 N.ROSBERG 92.950
255 44 L.HAMILTON 92.982
271 77 V.BOTTAS 93.128
265 77 V.BOTTAS 93.329
259 77 V.BOTTAS 93.443
149 19 F.MASSA 93.527
155 19 F.MASSA 93.527
143 19 F.MASSA 93.551
247 44 L.HAMILTON 93.611
44 6 N.ROSBERG 93.671
119 14 F.ALONSO 93.675
132 14 F.ALONSO 93.740
268 77 V.BOTTAS 93.801
125 14 F.ALONSO 93.858
128 14 F.ALONSO 94.005
152 19 F.MASSA 94.059
30 3 D.RICCIARDO 94.075
159 20 K.MAGNUSSEN 94.229
171 20 K.MAGNUSSEN 94.242
260 77 V.BOTTAS 94.301
201 22 J.BUTTON 94.317
13 1 S.VETTEL 94.432
165 20 K.MAGNUSSEN 94.437
18 3 D.RICCIARDO 94.466
144 19 F.MASSA 94.483
120 14 F.ALONSO 94.497
... ... ... ...
134 17 J.BIANCHI 842.000
34 4 M.CHILTON 842.000
205 25 J.VERGNE 842.000
219 26 D.KVYAT 842.000
109 13 P.MALDONADO 842.000
258 77 V.BOTTAS 843.000
55 7 K.RAIKKONEN 843.000
245 44 L.HAMILTON 843.000
42 6 N.ROSBERG 843.000
142 19 F.MASSA 843.000
118 14 F.ALONSO 844.000
188 22 J.BUTTON 845.000
158 20 K.MAGNUSSEN 845.000
17 3 D.RICCIARDO 846.000
0 1 S.VETTEL 846.000
81 9 M.ERICSSON 847.000
220 26 D.KVYAT 849.817
208 25 J.VERGNE 874.166
66 7 K.RAIKKONEN 934.009
10 1 S.VETTEL 977.535
27 3 D.RICCIARDO 987.803
198 22 J.BUTTON 1036.979
43 6 N.ROSBERG 1116.970
246 44 L.HAMILTON 1125.072
126 14 F.ALONSO 1144.167
150 19 F.MASSA 1177.634
266 77 V.BOTTAS 1188.277
50 6 N.ROSBERG 1201.803
253 44 L.HAMILTON 1214.207
166 20 K.MAGNUSSEN 1239.418

288 rows × 3 columns

Seaborn Works Well for Statistical Plots

Pandas has some built-in plotting functionality as well. One thing to notice is that the tools built around dataframes can infer information from the DataFrame or Series that wouldn't be available with simple arrays, like the x axis labeled as 'time'.

In [8]:
sns.distplot(df.time)
Out[8]:
<matplotlib.axes.AxesSubplot at 0x16a0a160>

Only Look at the Lower Times Posted

In [9]:
sns.distplot(df.loc[df.time < 100, 'time'])
Out[9]:
<matplotlib.axes.AxesSubplot at 0x167f1358>

(Parsing_FIA_PDF.ipynb; Parsing_FIA_PDF_evaluated.ipynb; )