Talk

How you can write a dataframe-agnostic library

Thursday, May 23

11:40 - 12:25
RoomLasagna
LanguageEnglish
Audience levelAdvanced
Elevator pitch

If you’re writing a library which consumes dataframes, should you choose to support pandas, Polars, cuDF, modin, vaex, pyspark, dask, or something else?

Don’t choose - learn how the DataFrame Interchange Protocol and/or Narwhals enable you support them all!

Abstract

In 2023, we saw several libraries - which had previously only supported pandas - add support for other dataframe libraries such as Polars, Modin, and cuDF. They typically did this by keeping their existing code, and converting non-pandas inputs to pandas. They’ve usually been smart about only converting the parts of the dataframe which they need, but nonetheless, this approach has limitations.

Downsides of the “just convert to pandas” approach are:

  • it requires users who otherwise weren’t using pandas to have an extra, non-lightweight dependency
  • transferring data between GPU and CPU can be expensive
  • the pandas API is very flexible, and sometimes overly so, meaning that anyone developing on top of it may end up with code less robust than would be ideal

This talk will introduce you to the DataFrame Interchange Protocol and to Narwhals, which allow library developers to:

  • support multiple dataframe libraries with just a single API
  • minimise, or even eliminate, conversion costs
  • use a strict, minimal API for maximal robustness

The format will roughly be:

  • 2-3 mins: an overview of the dataframe landscape
  • 2-3 mins: what happened in 2023, which libraries started supporting Polars instead of just pandas
  • 5 mins: what are the limitations of just converting to pandas?
  • 5 mins: what’s the Dataframe Interchange Protocol?
  • 7-8 mins: how can you use Narwhals to support multiple dataframes? How can get Narwhals to support your dataframe library?
  • 2-3 mins: what comes next
  • Q&A / awkward silence

By the end of the talk, attendees will have learned about the dataframe ecosystem, and those involved with dataframe-consuming libraries will know all they need in order to effectively support multiple dataframe libraries. Library maintainers and contributors will get the most out of the talk, but anyone regularly using dataframes will also learn a lot about the tools they use.

TagsPandas, APIs, Abstractions
Participant

Marco Gorelli

Marco is a Senior Software Engineer at Quansight Labs, where he works on pandas, Polars, Dataframe Interchange Protocol, and assorted consulting and training activities. He holds an MSc in Mathematics and Foundations of Computer Science from the University of Oxford, and was one of the prize winners in the M6 Forecasting Competition.