Group-by Comparison: Nushell vs. Pandas vs. DataFrames

# => ╰──────────────────────────────────────┴───────────┴─────────┴─────────┴────────────────╯ ``` ::: tip As of nushell 0.97, `polars open` will open as a lazy dataframe instead of a eager dataframe. To open as an eager dataframe, use the `--eager` flag. ::: We can have a look at the first lines of the file using [`first`](/commands/docs/first.md): ```nu $df_0 | polars first # => ╭───┬──────────┬─────────┬──────┬───────────┬──────────╮ # => │ # │ anzsic06 │ Area │ year │ geo_count │ ec_count │ # => ├───┼──────────┼─────────┼──────┼───────────┼──────────┤ # => │ 0 │ A │ A100100 │ 2000 │ 96 │ 130 │ # => ╰───┴──────────┴─────────┴──────┴───────────┴──────────╯ ``` ...and finally, we can get an idea of the inferred data types: ```nu $df_0 | polars schema # => ╭───────────┬─────╮ # => │ anzsic06 │ str │ # => │ Area │ str │ # => │ year │ i64 │ # => │ geo_count │ i64 │ # => │ ec_count │ i64 │ # => ╰───────────┴─────╯ ``` ### Group-by Comparison To output more statistically correct timings, let's load and use the `std bench` command. ```nu use std/bench ``` We are going to group the data by year, and sum the column `geo_count`. First, let's measure the performance of a Nushell native commands pipeline. ```nu bench -n 10 --pretty { open 'Data7602DescendingYearOrder.csv' | group-by year --to-table | update items {|i| $i.items.geo_count | math sum } } # => 3sec 268ms +/- 50ms ``` So, 3.3 seconds to perform this aggregation. Let's try the same operation in pandas: ```nu ('import pandas as pd df = pd.read_csv("Data7602DescendingYearOrder.csv") res = df.groupby("year")["geo_count"].sum() print(res)' | save load.py -f) ``` And the result from the benchmark is: ```nu bench -n 10 --pretty { python load.py | complete | null } # => 1sec 322ms +/- 6ms ``` Not bad at all. Pandas managed to get it 2.6 times faster than Nushell. And with bigger files, the superiority of Pandas should increase here. To finish the comparison, let's try Nushell dataframes. We are going to put all the operations in one `nu` file, to make sure we are doing the correct comparison: ```nu ( 'polars open Data7602DescendingYearOrder.csv | polars group-by year | polars agg (polars col geo_count | polars sum) | polars collect' | save load.nu -f ) ``` and the benchmark with dataframes (together with loading a new nushell and `polars` instance for each test in order of honest comparison) is: ```nu bench -n 10 --pretty { nu load.nu | complete | null } # => 135ms +/- 4ms ``` The `polars` dataframes plugin managed to finish operation 10 times faster than `pandas` with python. Isn't that great? As you can see, the Nushell's `polars` plugin is performant like `polars` itself. Coupled with Nushell commands and pipelines, it is capable of conducting sophisticated analysis without leaving the terminal. Let's clean up the cache from the dataframes that we used during benchmarking. To do that, let's stop the `polars`. When we execute our next commands, we will start a new instance of plugin. ```nu plugin stop polars ``` ## Working with Dataframes After seeing a glimpse of the things that can be done with [`Dataframe` commands](/commands/categories/dataframe.md), now it is time to start testing them. To begin let's create a sample CSV file that will become our sample dataframe that we will be using along with the examples. In your favorite file editor paste the next lines to create out sample csv file. ```nu ("int_1,int_2,float_1,float_2,first,second,third,word 1,11,0.1,1.0,a,b,c,first 2,12,0.2,1.0,a,b,c,second 3,13,0.3,2.0,a,b,c,third 4,14,0.4,3.0,b,a,c,second 0,15,0.5,4.0,b,a,a,third 6,16,0.6,5.0,b,a,a,second 7,17,0.7,6.0,b,c,a,third 8,18,0.8,7.0,c,c,b,eight 9,19,0.9,8.0,c,c,b,ninth 0,10,0.0,9.0,c,c,b,ninth" | save --raw --force test_small.csv) ``` Save the file and name it however you want to, for the sake of these examples the file will be called `test_small.csv`. Now, to read that file as a dataframe use the `polars open` command like

This section benchmarks the performance of group-by operations using native Nushell commands, Python Pandas, and Nushell DataFrames. The results show that DataFrames, powered by the Polars plugin, significantly outperform both native Nushell and Pandas in terms of execution time for the group-by and sum operation on a large dataset. The section concludes with an introduction to working with DataFrames, including creating a sample CSV file and opening it as a DataFrame using `polars open`.