# Dataframes
::: warning Important!
This feature requires the Polars plugin. See the
[Plugins Chapter](plugins.md) to learn how to install it.
To test that this plugin is properly installed, run `help polars`.
:::
As we have seen so far, Nushell makes working with data its main priority.
`Lists` and `Tables` are there to help you cycle through values in order to
perform multiple operations or find data in a breeze. However, there are
certain operations where a row-based data layout is not the most efficient way
to process data, especially when working with extremely large files. Operations
like group-by or join using large datasets can be costly memory-wise, and may
lead to large computation times if they are not done using the appropriate
data format.
For this reason, the `DataFrame` structure was introduced to Nushell. A
`DataFrame` stores its data in a columnar format using as its base the [Apache
Arrow](https://arrow.apache.org/) specification, and uses
[Polars](https://github.com/pola-rs/polars) as the motor for performing
extremely [fast columnar operations](https://h2oai.github.io/db-benchmark/).
You may be wondering now how fast this combo could be, and how could it make
working with data easier and more reliable. For this reason, we'll start this
chapter by presenting benchmarks on common operations that are done when
processing data.
[[toc]]
## Benchmark Comparisons
For this little benchmark exercise we will be comparing native Nushell
commands, dataframe Nushell commands and [Python
Pandas](https://pandas.pydata.org/) commands. For the time being don't pay too
much attention to the [`Dataframe` commands](/commands/categories/dataframe.md). They will be explained in later
sections of this page.
::: tip System Details
The benchmarks presented in this section were run using a Macbook with a processor M1 pro and 32gb of ram. All examples were run on Nushell version 0.97 using `nu_plugin_polars 0.97`.
:::
### File Information
The file that we will be using for the benchmarks is the
[New Zealand business demography](https://www.stats.govt.nz/assets/Uploads/New-Zealand-business-demography-statistics/New-Zealand-business-demography-statistics-At-February-2020/Download-data/Geographic-units-by-industry-and-statistical-area-2000-2020-descending-order-CSV.zip) dataset.
Feel free to download it if you want to follow these tests.
The dataset has 5 columns and 5,429,252 rows. We can check that by using the
`polars store-ls` command:
```nu
let df_0 = polars open --eager Data7602DescendingYearOrder.csv
polars store-ls | select key type columns rows estimated_size
# => ╭──────────────────────────────────────┬───────────┬─────────┬─────────┬────────────────╮
# => │ key │ type │ columns │ rows │ estimated_size │
# => ├──────────────────────────────────────┼───────────┼─────────┼─────────┼────────────────┤
# => │ b2519dac-3b64-4e5d-a0d7-24bde9052dc7 │ DataFrame │ 5 │ 5429252 │ 184.5 MB │
# => ╰──────────────────────────────────────┴───────────┴─────────┴─────────┴────────────────╯
```
::: tip
As of nushell 0.97, `polars open` will open as a lazy dataframe instead of a eager dataframe.
To open as an eager dataframe, use the `--eager` flag.
:::
We can have a look at the first lines of the file using [`first`](/commands/docs/first.md):
```nu
$df_0 | polars first
# => ╭───┬──────────┬─────────┬──────┬───────────┬──────────╮
# => │ # │ anzsic06 │ Area │ year │ geo_count │ ec_count │
# => ├───┼──────────┼─────────┼──────┼───────────┼──────────┤
# => │ 0 │ A │ A100100 │ 2000 │ 96 │ 130 │
# => ╰───┴──────────┴─────────┴──────┴───────────┴──────────╯
```
...and finally, we can get an idea of the inferred data types:
```nu
$df_0 | polars schema
# => ╭───────────┬─────╮
# => │ anzsic06 │ str │
# => │ Area │ str │
# => │ year │ i64 │
# => │ geo_count │ i64 │
# => │ ec_count │ i64 │
# => ╰───────────┴─────╯
```
### Group-by Comparison
To output more statistically correct timings, let's load and use the `std bench` command.