DataFrames in Nushell: Introduction and Benchmarks

# Dataframes ::: warning Important! This feature requires the Polars plugin. See the [Plugins Chapter](plugins.md) to learn how to install it. To test that this plugin is properly installed, run `help polars`. ::: As we have seen so far, Nushell makes working with data its main priority. `Lists` and `Tables` are there to help you cycle through values in order to perform multiple operations or find data in a breeze. However, there are certain operations where a row-based data layout is not the most efficient way to process data, especially when working with extremely large files. Operations like group-by or join using large datasets can be costly memory-wise, and may lead to large computation times if they are not done using the appropriate data format. For this reason, the `DataFrame` structure was introduced to Nushell. A `DataFrame` stores its data in a columnar format using as its base the [Apache Arrow](https://arrow.apache.org/) specification, and uses [Polars](https://github.com/pola-rs/polars) as the motor for performing extremely [fast columnar operations](https://h2oai.github.io/db-benchmark/). You may be wondering now how fast this combo could be, and how could it make working with data easier and more reliable. For this reason, we'll start this chapter by presenting benchmarks on common operations that are done when processing data. [[toc]] ## Benchmark Comparisons For this little benchmark exercise we will be comparing native Nushell commands, dataframe Nushell commands and [Python Pandas](https://pandas.pydata.org/) commands. For the time being don't pay too much attention to the [`Dataframe` commands](/commands/categories/dataframe.md). They will be explained in later sections of this page. ::: tip System Details The benchmarks presented in this section were run using a Macbook with a processor M1 pro and 32gb of ram. All examples were run on Nushell version 0.97 using `nu_plugin_polars 0.97`. ::: ### File Information The file that we will be using for the benchmarks is the [New Zealand business demography](https://www.stats.govt.nz/assets/Uploads/New-Zealand-business-demography-statistics/New-Zealand-business-demography-statistics-At-February-2020/Download-data/Geographic-units-by-industry-and-statistical-area-2000-2020-descending-order-CSV.zip) dataset. Feel free to download it if you want to follow these tests. The dataset has 5 columns and 5,429,252 rows. We can check that by using the `polars store-ls` command: ```nu let df_0 = polars open --eager Data7602DescendingYearOrder.csv polars store-ls | select key type columns rows estimated_size # => ╭──────────────────────────────────────┬───────────┬─────────┬─────────┬────────────────╮ # => │ key │ type │ columns │ rows │ estimated_size │ # => ├──────────────────────────────────────┼───────────┼─────────┼─────────┼────────────────┤ # => │ b2519dac-3b64-4e5d-a0d7-24bde9052dc7 │ DataFrame │ 5 │ 5429252 │ 184.5 MB │ # => ╰──────────────────────────────────────┴───────────┴─────────┴─────────┴────────────────╯ ``` ::: tip As of nushell 0.97, `polars open` will open as a lazy dataframe instead of a eager dataframe. To open as an eager dataframe, use the `--eager` flag. ::: We can have a look at the first lines of the file using [`first`](/commands/docs/first.md): ```nu $df_0 | polars first # => ╭───┬──────────┬─────────┬──────┬───────────┬──────────╮ # => │ # │ anzsic06 │ Area │ year │ geo_count │ ec_count │ # => ├───┼──────────┼─────────┼──────┼───────────┼──────────┤ # => │ 0 │ A │ A100100 │ 2000 │ 96 │ 130 │ # => ╰───┴──────────┴─────────┴──────┴───────────┴──────────╯ ``` ...and finally, we can get an idea of the inferred data types: ```nu $df_0 | polars schema # => ╭───────────┬─────╮ # => │ anzsic06 │ str │ # => │ Area │ str │ # => │ year │ i64 │ # => │ geo_count │ i64 │ # => │ ec_count │ i64 │ # => ╰───────────┴─────╯ ``` ### Group-by Comparison To output more statistically correct timings, let's load and use the `std bench` command.

This section introduces DataFrames in Nushell, a columnar data structure based on Apache Arrow and powered by Polars for efficient data processing. It highlights the advantages of DataFrames over Lists and Tables, especially for large datasets and operations like group-by and join. The chapter then presents benchmark comparisons between native Nushell commands, DataFrame commands, and Python Pandas, using the New Zealand business demography dataset to illustrate the performance differences.