# Analyze computational results using Python, Pandas and LaTeX

## Introduction

Every researcher may use its own set of tools for analyzing the data of their experiments and computing statistics. Writing scripts to automatically compute and include tables in your article can make you save a lot of time. A great tool used by scientists is the R project. The latter is free software and a programming language designed for statistical analysis and graphics. I strongly recommend having a look at this tutorial if you want to learn how to use it. In this post, we will investigate another great programming language: Python. An advantage of Python over R is that it is extremely popular and even more general-purpose. Hence, it offers a huge number of libraries for almost every use case. In particular, we will use a well-known data analysis and manipulation tool: Pandas. This library can perform a lot of stuff, such as loading/saving data in various formats (CSV, Excel...), filtering, and computing numerous statistics. If you are not familiar with Python, there exist plenty of tutorials online, even interactive ones.

## Preparing the data

Suppose we implemented three algorithms to solve a minimization problem. We run experiments on three instances. Let's name our algorithms `alice`

, `bob`

, and `carol`

, and our instances `inst1`

, `inst2`

, and `inst3`

. For the sake of simplicity, we assume that our algorithms are run only once on each instance. We want to compare the algorithms in a beautiful LaTeX article. In particular, we would like to obtain for each algorithm the average and the maximum *relative gaps* with respect to the best-known solution, as well as the average *computational times*. To simplify, we stored the raw results in a single CSV file (`data.csv`

) where each row contains the characteristics of a single run: `instance`

, `algorithm`

, `obj`

(objective value), and `time`

(computational time).

```
instance,algorithm,obj,time
inst1,alice,7,16
inst1,bob,16,19
inst1,carol,6,14
inst2,alice,2,11
inst2,bob,3,18
inst2,carol,15,17
inst3,alice,9,15
inst3,bob,17,10
inst3,carol,19,13
```

## Loading the data

Now, the data analysis can start. We create a python script named `analyze.py`

and load the CSV file.

`# We need to import Pandas to use it`

import pandas as pd

# Load the contents of "data.csv" in a DataFrame object

df = pd.read_csv("data.csv")

# Display the DataFrame object

print(df)

The last instruction should display the following:

```
instance algorithm obj time
0 inst1 alice 7 16
1 inst1 bob 16 19
2 inst1 carol 6 14
3 inst2 alice 2 11
4 inst2 bob 3 18
5 inst2 carol 15 17
6 inst3 alice 9 15
7 inst3 bob 17 10
8 inst3 carol 19 13
```

The contents of `data.csv`

are stored in an object `df`

of type `DataFrame`

. This object offers powerful features to transform existing data or create new data. I recall that we want to compute the minimum/average relative gaps and the average computational times.

## Computing new data

### Best-known solution

First of all, let's compute the best-known solution for each instance, required to compute the relative gaps. To do so, we group rows by `instance`

and take the minimum `obj`

value in each group, as follows:

`# Obtain a Series object that contains the best-known solutions`

bks = df.groupby("instance")["obj"].transform(min)

Don't panic, let's decompose this instruction. First, `df.groupby("instance")`

creates a `GroupBy`

object. From the latter, we tell Pandas that we need only the `obj`

column, so we use the `[...]`

operator. Finally, the `.transform(...)`

method applies a function (`min`

) on each group. It returns a `Series`

filled with transformed values but the original shape is preserved. As a consequence, we can directly use the `bks`

(*best-known solution*) variable as a new column for `df`

, using the `.assign(...)`

method.

`# Add the "bks" column`

df = df.assign(bks=bks)

# Display the updated DataFrame

print(df)

We can observe that the column has been successfully added:

```
instance algorithm obj time bks
0 inst1 alice 7 16 6
1 inst1 bob 16 19 6
2 inst1 carol 6 14 6
3 inst2 alice 2 11 2
4 inst2 bob 3 18 2
5 inst2 carol 15 17 2
6 inst3 alice 9 15 9
7 inst3 bob 17 10 9
8 inst3 carol 19 13 9
```

### Relative gap

Legit question: why did we add the `bks`

column? Well, it is not mandatory but doing so makes it easy to compute our relative gap given by the formula: `.assign(...)`

, let's do it in one line:

`# Add the "gap" column`

df = df.assign(gap=(df["obj"] - df["bks"]) / (df["obj"]))

# Display the updated DataFrame

print(df)

If `obj`

can have a zero value, you need to adapt the formula (e.g. `(df["obj"] - df["bks"]) / (df["obj"] + 1)`

). In our case, the result is:

```
instance algorithm obj time bks gap
0 inst1 alice 7 16 6 0.142857
1 inst1 bob 16 19 6 0.625000
2 inst1 carol 6 14 6 0.000000
3 inst2 alice 2 11 2 0.000000
4 inst2 bob 3 18 2 0.333333
5 inst2 carol 15 17 2 0.866667
6 inst3 alice 9 15 9 0.000000
7 inst3 bob 17 10 9 0.470588
8 inst3 carol 19 13 9 0.526316
```

## Summarizing the results

We are ready to compute the average/maximum relative gap and the average computational time for each algorithm. First, group the rows by `algorithm`

in a temporary variable `df_g`

.

`# Group by algorithm`

df_g = df.groupby("algorithm")

Remember that the `df_g`

is an object of type `GroupBy`

. The `GroupBy`

class offers handy methods to compute the most common statistics, in particular, `.min()`

, `.max()`

, and `.mean()`

. We create a new `DataFrame`

summarizing the results:

`# Compute the average and the maximum gap, plus the average time`

df_summary = pd.DataFrame(

{

"avg_gap": df_g["gap"].mean().mul(100),

"max_gap": df_g["gap"].max().mul(100),

"avg_time": df_g["time"].mean(),

}

)

# Display the summary

print(df_summary)

Let's decompose the previous code a little bit. We create a `DataFrame`

from a dictionary where each element is a `Series`

, thus `df_summary`

will contain three columns. The `df_g["gap"].mean()`

instruction tells that we want to operate on the `gap`

column only, then compute the average of `gap`

in each group. We call `.mul(100)`

to multiply values by 100, since they are percentages. Similarly, it is easy to understand the meaning of the other columns. See the contents of `df_summary`

:

```
avg_gap max_gap avg_time
algorithm
alice 4.761905 14.285714 14.000000
bob 47.630719 62.500000 15.666667
carol 46.432749 86.666667 14.666667
```

Looks pretty, isn't it?

## Output table

I know two ways of importing the table into a LaTeX article. One way is to call the `.to_latex()`

method. This converts the table into a valid LaTeX code. The other way consists in saving the table to a CSV file (`.to_csv()`

) and letting LaTeX processing it. Both approaches have their pros and their cons, depending on your case. If you need to reuse the same data for several outputs (e.g. a table and a figure), I recommend the second approach. Otherwise, the first approach is fine for common situations.

### From Python

Let's start using the `.to_latex()`

method.

`# Display the DataFrame in LaTeX format`

print(df_summary.to_latex())

# Export the DataFrame to a LaTeX file

df_summary.to_latex("summary.tex")

This outputs the following LaTeX code.

`\begin{tabular}{lrrr}`

\toprule

{} & avg\_gap & max\_gap & avg\_time \\

algorithm & & & \\

\midrule

alice & 4.761905 & 14.285714 & 14.000000 \\

bob & 47.630719 & 62.500000 & 15.666667 \\

carol & 46.432749 & 86.666667 & 14.666667 \\

\bottomrule

\end{tabular}

Well, we require some modifications:

- Our algorithms are so great, we want to name them with a capital letter.
- Remove the row starting with
`algorithm`

. - Rename the columns as
`avg gap (%)`

,`max gap (%)`

, and`avg time (s)`

. - All values need two decimals only.

Since the table is pretty small, we could just edit it by hand. But what if we had one hundred algorithms to compare? We can exploit Pandas to automatically post-process our table. First, change the name of the rows and columns. To do so, we use the `.rename(...)`

method.

`# Rename rows (indexes) and columns`

df_summary.rename(

index={

"alice": "Alice",

"bob": "Bob",

"carol": "Carol"

},

columns={

"avg_gap": "avg gap (%)",

"max_gap": "max gap (%)",

"avg_time": "avg time (s)",

},

inplace=True,

)

Note that the `inplace`

argument indicates the method should modify directly `df_summary`

. Next, we format the values according to our needs. Through its `formatters`

argument, `.to_latex()`

allows us to apply a function on each column as follows.

`# Export the DataFrame to LaTeX`

df_summary.to_latex(

"summary.tex",

formatters={

"avg gap (%)": "{:.2f}".format,

"max gap (%)": "{:.2f}".format,

"avg time (s)": "{:.2f}".format,

},

)

Yeah! It provides the result I want. Yet, I think I can improve this Python script. In particular, I don't like to copy-paste the transformed column names, because it means I need to edit them in several places in case I change my mind. Thus, I prefer to define the formatting using the original CSV column names (`avg_gap`

, ...). We can do this elegantly with the magic of dict comprehension:

`index = {"alice": "Alice", "bob": "Bob", "carol": "Carol"}`

columns = {

"avg_gap": "avg gap (%)",

"max_gap": "max gap (%)",

"avg_time": "avg time (s)",

}

formatters = {

"avg_gap": "{:.2f}".format,

"max_gap": "{:.2f}".format,

"avg_time": "{:.2f}".format,

}

# Rename rows (indexes) and columns

df_summary.rename(index=index, columns=columns, inplace=True)

# Export the DataFrame to LaTeX

df_summary.to_latex(

"summary.tex",

formatters={columns[c]: f for c, f in formatters.items()},

index_names=False,

)

Finally, we set `index_names=False`

to remove the `algorithm`

row. Here we go! The content of `summary.tex`

should be:

`\begin{tabular}{lrrr}`

\toprule

{} & avg gap (\%) & max gap (\%) & avg time (s) \\

\midrule

Alice & 4.76 & 14.29 & 14.00 \\

Bob & 47.63 & 62.50 & 15.67 \\

Carol & 46.43 & 86.67 & 14.67 \\

\bottomrule

\end{tabular}

We can copy-paste this code into our LaTeX article. Personally, I prefer to save it as a separate file and use `\input{summary}`

. Here is my LaTeX template:

`\documentclass{article}`

\usepackage{booktabs}

\begin{document}

\begin{table}

\centering

\caption{Comparison of algorithms}

\input{summary}

\end{table}

\end{document}

### From LaTeX

The previous method exports the table to ready-to-use LaTeX using Python. Whereas editing the table from Python is often handier than editing it from LaTeX, I still like the second approach. LaTeX has the ability to import CSV files thanks to packages such as `csvsimple`

and `pgfplotstable`

. One of the great advantages of using a CSV file is that we can use the latter as a single source of truth. Why this is an advantage? For example, we can display the same data simultaneously as a table and as a chart. In the following, we assume that our `df_summary`

table has been unchanged (its columns are still named `avg_gap`

, `max_gap`

, and `avg_time`

). Although we might be able to do it in LaTeX, we choose to rename the algorithm names in the Python script before exporting the table to `summary.csv`

.

`# Rename rows (indexes)`

index = {"alice": "Alice", "bob": "Bob", "carol": "Carol"}

df_summary.rename(index=index, inplace=True)

# Export the DataFrame to CSV

df_summary.to_csv("summary.csv")

From now, we use `pgfplotstable`

to load the CSV file. Our LaTeX article follows this template:

`\documentclass{article}`

\usepackage{booktabs}

\usepackage{pgfplotstable}

\pgfplotstableread[col sep=comma]{summary.csv}{\summarytable}

\begin{document}

\begin{table}

\centering

\caption{Comparison of algorithms}

\pgfplotstabletypeset[<options...>]{\summarytable}

\end{table}

\end{document}

The `\pgfplotstableread`

command imports the CSV file to the `\summarytable`

variable. Note that we need to specify `col sep=comma`

, otherwise it is assumed that values are separated by white spaces. The `\pgfplotstabletypeset`

command outputs a table from the `\summarytable`

variable. All we need to do is to define the options to satisfy our requirements. Since there are plenty of them, we will go through them step by step.

First, we can specify which columns of the CSV file we are using. Although this is optional (we are using all the columns), I recommend doing so in the case we update our CSV file with more columns.

```
columns={algorithm, avg_gap, max_gap, avg_time},
```

Next, let's format the column `algorithm`

:

```
columns/{algorithm}/.style={
column name={},
column type=l,
string type},
```

We decided that the column `algorithm`

has no name and its content is aligned to the left (`l`

). We specified that this column contains strings, not numbers (otherwise it will raise an error). Similarly, we format the column `avg_gap`

:

```
columns/{avg_gap}/.style={
column name={avg gap (\%)},
column type=r,
precision=2,
fixed,
fixed zerofill},
```

This time, we want the column to be aligned to the right (`r`

). The `precision`

argument determines the number of decimals to show. Moreover, the numbers should be in fixed notation and filled with zeros. Except for the column name, the options for `max_gap`

and `avg_time`

are identical. To make our table look pretty, we add some `\toprule`

, `\midrule`

, and `\bottomrule`

.

```
every head row/.style={before row=\toprule, after row=\midrule},
every last row/.style={after row=\bottomrule},
```

Please find here the complete code of the table:

`\begin{table}`

\centering

\caption{Comparison of algorithms}

\pgfplotstabletypeset[

columns={algorithm, avg_gap, max_gap, avg_time},

columns/{algorithm}/.style={

column name={},

column type=l,

string type},

columns/{avg_gap}/.style={

column name={avg gap (\%)},

column type=r,

precision=2,

fixed,

fixed zerofill},

columns/{max_gap}/.style={

column name={max gap (\%)},

column type=r,

precision=2,

fixed,

fixed zerofill},

columns/{avg_time}/.style={

column name={avg time (s)},

column type=r,

precision=2,

fixed,

fixed zerofill},

every head row/.style={before row=\toprule, after row=\midrule},

every last row/.style={after row=\bottomrule},

]{\summarytable}

\end{table}

Previously, I said that we can display a chart using the same source data. This can be done by including the `pgfplots`

package.

`\usepackage{pgfplots}`

The following code creates a simple vertical bar chart embedded in a figure for the average gap.

`\begin{figure}`

\centering

\begin{tikzpicture}

\begin{axis}[

ybar,

xlabel={Algorithm},

xtick=data,

xticklabels from table={\summarytable}{algorithm},

ylabel={Average gap (\%)},

ymin=0,

bar width=40,

enlarge x limits=0.25]

\addplot table [x expr=\coordindex, y={avg_gap}]{\summarytable};

\end{axis}

\end{tikzpicture}

\caption{Comparison of algorithms}

\end{figure}

Commands and arguments of `pgfplots`

are detailed in the manual. And voilà! We obtain both a table and a chart from a single source of data.

## Conclusion

I described a method for computing statistics and creating LaTeX tables programmatically using Python and Pandas. We discussed two ways to include tables in a LaTeX article. Preparing such scripts can help saving time and avoiding mistakes, compared to writing values manually. Along with R, Pandas is a very mature and powerful library that is not limited to our use case. Check out the manual for more details.