Data Processing

Data Parsing

F1 data isn’t quite as available as something like college football data, and at best I have only been able to find down to the lap. I have not found any source that will just let you directly download CSV files, and the most open one being ergast’s api for non-commercial use. The majority of the data used for modeling will likely have to come from web-scraping and/or parsing PDFs.

Data Tools

The data parsing tools exist in the formulapy.data module and any local files will go into the data folder that is outside the scope of the actual formulapy module. Local files, like fia pdf files, are ignored through the .gitignore file, so aren’t and cannot be included in the repository.

formulapy.data.fia.parsers.append_time_info(driver, idx, pos, row)[source]

Factors out some common logic for getting the time information for the driver based on their relative position in the row.

Parameters:
  • driver – dict with driver info
  • idx – the indices of the columns where times should be
  • pos – the last column that this driver shouldn’t have values past
  • row – the current row being processed
Returns:

None, driver dicts are modified directly, so no return required

formulapy.data.fia.parsers.format_times(t)[source]

Converts a list of time strings into a list of floats, representing the number of seconds for the given lap. Assumes the values are always minutes:seconds.milliseconds.

Parameters:t – list of time strings
Returns:list of times as floats
formulapy.data.fia.parsers.get_drivers(table)[source]

Takes an input of a pdftable, corresponding to the data on one page of fia lap timing report from qualifications or practice. See Japan Qualifying Report for an example PDF that this function helps parse.

Parameters:table – pdftable from calling parse_laptimes()
Returns:
  • list of python dicts, where each dict represents one driver on this page
formulapy.data.fia.parsers.init_driver(driver_strings, driver_idx)[source]

Initializes the driver dict and stores values from the driver_strings into it, if they exist

Parameters:
  • driver_strings – tuple with driver number and name within it
  • driver_idx – 0, 1, or 2, representing the index of the 3 drivers processed at a time
Returns:

initialized driver dict, where empty if there was no info for the given idx

formulapy.data.fia.parsers.parse_laptimes(filepath)[source]

Parses a PDF of qualifying or practice report lap times from the FIA into data that we can further analyze. See an example at Japan Qualifying Report.

Parameters:filepath – a string pathname to the pdf on your local computer
Returns:
  • a pandas dataframe with column for number, name, and times
formulapy.data.fia.parsers.split_joined_times(t)[source]

Splits times that can sometimes be joined together based on how the PDF is formatted. If it doesn’t detect joined times, then it will just return the single time in a list.

Parameters:t – a string with likely times in it
Returns:a list containing individual times
formulapy.data.fia.parsers.split_num_str(s)[source]

Data Sources

When utilizing web-based data sources, you must be considerate and abide by their requests. This means that much of the data available cannot be directly rehosted. Here is a list of sources that could be used for gathering data to use in building models.