Home Explore Blog CI



nushell

2nd chunk of `book/dataframes.md`
77c54b6e010ee46e43059fcb61b6173a5c5c0081dfc9f4f6000000010000126b
# => ╰──────────────────────────────────────┴───────────┴─────────┴─────────┴────────────────╯
```

::: tip
As of nushell 0.97, `polars open` will open as a lazy dataframe instead of a eager dataframe.
To open as an eager dataframe, use the `--eager` flag.
:::

We can have a look at the first lines of the file using [`first`](/commands/docs/first.md):

```nu
$df_0 | polars first
# => ╭───┬──────────┬─────────┬──────┬───────────┬──────────╮
# => │ # │ anzsic06 │  Area   │ year │ geo_count │ ec_count │
# => ├───┼──────────┼─────────┼──────┼───────────┼──────────┤
# => │ 0 │ A        │ A100100 │ 2000 │        96 │      130 │
# => ╰───┴──────────┴─────────┴──────┴───────────┴──────────╯
```

...and finally, we can get an idea of the inferred data types:

```nu
$df_0 | polars schema
# => ╭───────────┬─────╮
# => │ anzsic06  │ str │
# => │ Area      │ str │
# => │ year      │ i64 │
# => │ geo_count │ i64 │
# => │ ec_count  │ i64 │
# => ╰───────────┴─────╯
```

### Group-by Comparison

To output more statistically correct timings, let's load and use the `std bench` command.

```nu
use std/bench
```

We are going to group the data by year, and sum the column `geo_count`.

First, let's measure the performance of a Nushell native commands pipeline.

```nu
bench -n 10 --pretty {
    open 'Data7602DescendingYearOrder.csv'
    | group-by year --to-table
    | update items {|i|
        $i.items.geo_count
        | math sum
    }
}
# => 3sec 268ms +/- 50ms
```

So, 3.3 seconds to perform this aggregation.

Let's try the same operation in pandas:

```nu
('import pandas as pd

df = pd.read_csv("Data7602DescendingYearOrder.csv")
res = df.groupby("year")["geo_count"].sum()
print(res)'
| save load.py -f)
```

And the result from the benchmark is:

```nu
bench -n 10 --pretty {
    python load.py | complete | null
}
# => 1sec 322ms +/- 6ms
```

Not bad at all. Pandas managed to get it 2.6 times faster than Nushell.
And with bigger files, the superiority of Pandas should increase here.

To finish the comparison, let's try Nushell dataframes. We are going to put
all the operations in one `nu` file, to make sure we are doing the correct
comparison:

```nu
( 'polars open Data7602DescendingYearOrder.csv
    | polars group-by year
    | polars agg (polars col geo_count | polars sum)
    | polars collect'
| save load.nu -f )
```

and the benchmark with dataframes (together with loading a new nushell and `polars`
instance for each test in order of honest comparison) is:

```nu
bench -n 10 --pretty {
    nu load.nu | complete | null
}
# => 135ms +/- 4ms
```

The `polars` dataframes plugin managed to finish operation 10 times
faster than `pandas` with python. Isn't that great?

As you can see, the Nushell's `polars` plugin is performant like `polars` itself.
Coupled with Nushell commands and pipelines, it is capable of conducting sophisticated
analysis without leaving the terminal.

Let's clean up the cache from the dataframes that we used during benchmarking.
To do that, let's stop the `polars`.
When we execute our next commands, we will start a new instance of plugin.

```nu
plugin stop polars
```

## Working with Dataframes

After seeing a glimpse of the things that can be done with [`Dataframe` commands](/commands/categories/dataframe.md),
now it is time to start testing them. To begin let's create a sample
CSV file that will become our sample dataframe that we will be using along with
the examples. In your favorite file editor paste the next lines to create out
sample csv file.

```nu
("int_1,int_2,float_1,float_2,first,second,third,word
1,11,0.1,1.0,a,b,c,first
2,12,0.2,1.0,a,b,c,second
3,13,0.3,2.0,a,b,c,third
4,14,0.4,3.0,b,a,c,second
0,15,0.5,4.0,b,a,a,third
6,16,0.6,5.0,b,a,a,second
7,17,0.7,6.0,b,c,a,third
8,18,0.8,7.0,c,c,b,eight
9,19,0.9,8.0,c,c,b,ninth
0,10,0.0,9.0,c,c,b,ninth"
| save --raw --force test_small.csv)
```

Save the file and name it however you want to, for the sake of these examples
the file will be called `test_small.csv`.

Now, to read that file as a dataframe use the `polars open` command like

Title: Group-by Comparison: Nushell vs. Pandas vs. DataFrames
Summary
This section benchmarks the performance of group-by operations using native Nushell commands, Python Pandas, and Nushell DataFrames. The results show that DataFrames, powered by the Polars plugin, significantly outperform both native Nushell and Pandas in terms of execution time for the group-by and sum operation on a large dataset. The section concludes with an introduction to working with DataFrames, including creating a sample CSV file and opening it as a DataFrame using `polars open`.