Creating and Extending DataFrames in Nushell with Polars

# => │ │ "]; PROJECT */8 COLUMNS; SELECTION: "None" │ # => │ optimized_plan │ SORT BY [col("first")] │ # => │ │ AGGREGATE │ # => │ │ [col("int_1").n_unique(), col("int_2").min(), col("float_1") │ # => │ │ .sum(), col("float_2").count()] BY [col("first")] FROM │ # => │ │ DF ["int_1", "int_2", "float_1", "float_2 │ # => │ │ "]; PROJECT 5/8 COLUMNS; SELECTION: "None" │ # => ╰────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────╯ ``` As you can see, the `GroupBy` object is a very powerful variable and it is worth keeping in memory while you explore your dataset. ## Creating Dataframes It is also possible to construct dataframes from basic Nushell primitives, such as integers, decimals, or strings. Let's create a small dataframe using the command `polars into-df`. ```nu let df_3 = [[a b]; [1 2] [3 4] [5 6]] | polars into-df $df_3 # => ╭───┬───┬───╮ # => │ # │ a │ b │ # => ├───┼───┼───┤ # => │ 0 │ 1 │ 2 │ # => │ 1 │ 3 │ 4 │ # => │ 2 │ 5 │ 6 │ # => ╰───┴───┴───╯ ``` ::: tip For the time being, not all of Nushell primitives can be converted into a dataframe. This will change in the future, as the dataframe feature matures ::: We can append columns to a dataframe in order to create a new variable. As an example, let's append two columns to our mini dataframe `$df_3` ```nu let df_4 = $df_3 | polars with-column $df_3.a --name a2 | polars with-column $df_3.a --name a3 $df_4 # => ╭───┬───┬───┬────┬────╮ # => │ # │ a │ b │ a2 │ a3 │ # => ├───┼───┼───┼────┼────┤ # => │ 0 │ 1 │ 2 │ 1 │ 1 │ # => │ 1 │ 3 │ 4 │ 3 │ 3 │ # => │ 2 │ 5 │ 6 │ 5 │ 5 │ # => ╰───┴───┴───┴────┴────╯ ``` Nushell's powerful piping syntax allows us to create new dataframes by taking data from other dataframes and appending it to them. Now, if you list your dataframes you will see in total five dataframes ```nu polars store-ls | select key type columns rows estimated_size # => ╭──────────────────────────────────────┬─────────────┬─────────┬──────┬────────────────╮ # => │ key │ type │ columns │ rows │ estimated_size │ # => ├──────────────────────────────────────┼─────────────┼─────────┼──────┼────────────────┤ # => │ e780af47-c106-49eb-b38d-d42d3946d66e │ DataFrame │ 8 │ 10 │ 403 B │ # => │ 3146f4c1-f2a0-475b-a623-7375c1fdb4a7 │ DataFrame │ 4 │ 1 │ 32 B │ # => │ 455a1483-e328-43e2-a354-35afa32803b9 │ DataFrame │ 5 │ 4 │ 132 B │ # => │ 0d8532a5-083b-4f78-8f66-b5e6b59dc449 │ LazyGroupBy │ │ │ │ # => │ 9504dfaf-4782-42d4-9110-9dae7c8fb95b │ DataFrame │ 2 │ 3 │ 48 B │ # => │ 37ab1bdc-e1fb-426d-8006-c3f974764a3d │ DataFrame │ 4 │ 3 │ 96 B │ # => ╰──────────────────────────────────────┴─────────────┴─────────┴──────┴────────────────╯ ``` One thing that is important to mention is how the memory is being optimized while working with dataframes, and this is thanks to **Apache Arrow** and **Polars**. In a very simple representation, each column in a DataFrame is an Arrow Array, which is using several memory specifications in order to maintain the data as packed as possible (check [Arrow columnar format](https://arrow.apache.org/docs/format/Columnar.html)). The other optimization trick is the fact that whenever possible, the columns from the dataframes are shared between dataframes, avoiding memory duplication for the same data. This means that dataframes `$df_3` and `$df_4` are sharing the same two

This section explains how to create DataFrames from Nushell primitives using `polars into-df` and how to append columns using `polars with-column`. It highlights Nushell's piping capabilities for creating new DataFrames by combining data from existing ones. It also mentions the memory optimization techniques used by Apache Arrow and Polars, such as columnar storage and sharing columns between DataFrames to avoid memory duplication.