Analyze computational results using Python, Pandas and LaTeX

Introduction

Every researcher may use its own set of tools for analyzing the data of their experiments and computing statistics. Writing scripts to automatically compute and include tables in your article can make you save a lot of time. A great tool used by scientists is the R project. The latter is free software and a programming language designed for statistical analysis and graphics. I strongly recommend having a look at this tutorial if you want to learn how to use it. In this post, we will investigate another great programming language: Python. An advantage of Python over R is that it is extremely popular and even more general-purpose. Hence, it offers a huge number of libraries for almost every use case. In particular, we will use a well-known data analysis and manipulation tool: Pandas. This library can perform a lot of stuff, such as loading/saving data in various formats (CSV, Excel…), filtering, and computing numerous statistics. If you are not familiar with Python, there exist plenty of tutorials online, even interactive ones.

Preparing the data

Suppose we implemented three algorithms to solve a minimization problem. We run experiments on three instances. Let’s name our algorithms alice, bob, and carol, and our instances inst1, inst2, and inst3. For the sake of simplicity, we assume that our algorithms are run only once on each instance. We want to compare the algorithms in a beautiful LaTeX article. In particular, we would like to obtain for each algorithm the average and the maximum relative gaps with respect to the best-known solution, as well as the average computational times. To simplify, we stored the raw results in a single CSV file (data.csv) where each row contains the characteristics of a single run: instance, algorithm, obj (objective value), and time (computational time).

instance,algorithm,obj,time
inst1,alice,7,16
inst1,bob,16,19
inst1,carol,6,14
inst2,alice,2,11
inst2,bob,3,18
inst2,carol,15,17
inst3,alice,9,15
inst3,bob,17,10
inst3,carol,19,13

Loading the data

Now, the data analysis can start. We create a python script named analyze.py and load the CSV file.

# We need to import Pandas to use it
import pandas as pd

# Load the contents of "data.csv" in a DataFrame object
df = pd.read_csv("data.csv")

# Display the DataFrame object
print(df)

The last instruction should display the following:

  instance algorithm  obj  time
0    inst1     alice    7    16
1    inst1       bob   16    19
2    inst1     carol    6    14
3    inst2     alice    2    11
4    inst2       bob    3    18
5    inst2     carol   15    17
6    inst3     alice    9    15
7    inst3       bob   17    10
8    inst3     carol   19    13

The contents of data.csv are stored in an object df of type DataFrame. This object offers powerful features to transform existing data or create new data. I recall that we want to compute the minimum/average relative gaps and the average computational times.

Computing new data

Best-known solution

First of all, let’s compute the best-known solution for each instance, required to compute the relative gaps. To do so, we group rows by instance and take the minimum obj value in each group, as follows:

# Obtain a Series object that contains the best-known solutions
bks = df.groupby("instance")["obj"].transform(min)

Don’t panic, let’s decompose this instruction. First, df.groupby(“instance”) creates a GroupBy object. From the latter, we tell Pandas that we need only the obj column, so we use the […] operator. Finally, the .transform(…) method applies a function (min) on each group. It returns a Series filled with transformed values but the original shape is preserved. As a consequence, we can directly use the bks (best-known solution) variable as a new column for df, using the .assign(…) method.

# Add the "bks" column
df = df.assign(bks=bks)

# Display the updated DataFrame
print(df)

We can observe that the column has been successfully added:

  instance algorithm  obj  time  bks
0    inst1     alice    7    16    6
1    inst1       bob   16    19    6
2    inst1     carol    6    14    6
3    inst2     alice    2    11    2
4    inst2       bob    3    18    2
5    inst2     carol   15    17    2
6    inst3     alice    9    15    9
7    inst3       bob   17    10    9
8    inst3     carol   19    13    9

Relative gap

Legit question: why did we add the bks column? Well, it is not mandatory but doing so makes it easy to compute our relative gap given by the formula: objbksobj\frac{\text{obj} - \text{bks}}{\text{obj}}. Since we are familiar with .assign(…), let’s do it in one line:

# Add the "gap" column
df = df.assign(gap=(df["obj"] - df["bks"]) / (df["obj"]))

# Display the updated DataFrame
print(df)

If obj can have a zero value, you need to adapt the formula (e.g. (df[“obj”] - df[“bks”]) / (df[“obj”] + 1)). In our case, the result is:

  instance algorithm  obj  time  bks       gap
0    inst1     alice    7    16    6  0.142857
1    inst1       bob   16    19    6  0.625000
2    inst1     carol    6    14    6  0.000000
3    inst2     alice    2    11    2  0.000000
4    inst2       bob    3    18    2  0.333333
5    inst2     carol   15    17    2  0.866667
6    inst3     alice    9    15    9  0.000000
7    inst3       bob   17    10    9  0.470588
8    inst3     carol   19    13    9  0.526316

Summarizing the results

We are ready to compute the average/maximum relative gap and the average computational time for each algorithm. First, group the rows by algorithm in a temporary variable df_g.

# Group by algorithm
df_g = df.groupby("algorithm")

Remember that the df_g is an object of type GroupBy. The GroupBy class offers handy methods to compute the most common statistics, in particular, .min(), .max(), and .mean(). We create a new DataFrame summarizing the results:

# Compute the average and the maximum gap, plus the average time
df_summary = pd.DataFrame(
    {
        "avg_gap": df_g["gap"].mean().mul(100),
        "max_gap": df_g["gap"].max().mul(100),
        "avg_time": df_g["time"].mean(),
    }
)

# Display the summary
print(df_summary)

Let’s decompose the previous code a little bit. We create a DataFrame from a dictionary where each element is a Series, thus df_summary will contain three columns. The df_g[“gap”].mean() instruction tells that we want to operate on the gap column only, then compute the average of gap in each group. We call .mul(100) to multiply values by 100, since they are percentages. Similarly, it is easy to understand the meaning of the other columns. See the contents of df_summary:

             avg_gap    max_gap   avg_time
algorithm                                 
alice       4.761905  14.285714  14.000000
bob        47.630719  62.500000  15.666667
carol      46.432749  86.666667  14.666667

Looks pretty, isn’t it?

Output table

I know two ways of importing the table into a LaTeX article. One way is to call the .to_latex() method. This converts the table into a valid LaTeX code. The other way consists in saving the table to a CSV file (.to_csv()) and letting LaTeX processing it. Both approaches have their pros and their cons, depending on your case. If you need to reuse the same data for several outputs (e.g. a table and a figure), I recommend the second approach. Otherwise, the first approach is fine for common situations.

The LaTeX table we want

From Python

Let’s start using the .to_latex() method.

# Display the DataFrame in LaTeX format
print(df_summary.to_latex())

# Export the DataFrame to a LaTeX file
df_summary.to_latex("summary.tex")

This outputs the following LaTeX code.

\begin{tabular}{lrrr}
\toprule
{} &    avg\_gap &    max\_gap &   avg\_time \\
algorithm &            &            &            \\
\midrule
alice     &   4.761905 &  14.285714 &  14.000000 \\
bob       &  47.630719 &  62.500000 &  15.666667 \\
carol     &  46.432749 &  86.666667 &  14.666667 \\
\bottomrule
\end{tabular}

Well, we require some modifications:

  • Our algorithms are so great, we want to name them with a capital letter.
  • Remove the row starting with algorithm.
  • Rename the columns as avg gap (%), max gap (%), and avg time (s).
  • All values need two decimals only.

Since the table is pretty small, we could just edit it by hand. But what if we had one hundred algorithms to compare? We can exploit Pandas to automatically post-process our table. First, change the name of the rows and columns. To do so, we use the .rename(…) method.

# Rename rows (indexes) and columns
df_summary.rename(
    index={
        "alice": "Alice",
        "bob": "Bob",
        "carol": "Carol"
    },
    columns={
        "avg_gap": "avg gap (%)",
        "max_gap": "max gap (%)",
        "avg_time": "avg time (s)",
    },
    inplace=True,
)

Note that the inplace argument indicates the method should modify directly df_summary. Next, we format the values according to our needs. Through its formatters argument, .to_latex() allows us to apply a function on each column as follows.

# Export the DataFrame to LaTeX
df_summary.to_latex(
    "summary.tex",
    formatters={
        "avg gap (%)": "{:.2f}".format,
        "max gap (%)": "{:.2f}".format,
        "avg time (s)": "{:.2f}".format,
    },
)

Yeah! It provides the result I want. Yet, I think I can improve this Python script. In particular, I don’t like to copy-paste the transformed column names, because it means I need to edit them in several places in case I change my mind. Thus, I prefer to define the formatting using the original CSV column names (avg_gap, …). We can do this elegantly with the magic of dict comprehension:

index = {"alice": "Alice", "bob": "Bob", "carol": "Carol"}

columns = {
    "avg_gap": "avg gap (%)",
    "max_gap": "max gap (%)",
    "avg_time": "avg time (s)",
}

formatters = {
    "avg_gap": "{:.2f}".format,
    "max_gap": "{:.2f}".format,
    "avg_time": "{:.2f}".format,
}

# Rename rows (indexes) and columns
df_summary.rename(index=index, columns=columns, inplace=True)

# Export the DataFrame to LaTeX
df_summary.to_latex(
    "summary.tex",
    formatters={columns[c]: f for c, f in formatters.items()},
    index_names=False,
)

Finally, we set index_names=False to remove the algorithm row. Here we go! The content of summary.tex should be:

\begin{tabular}{lrrr}
\toprule
{} & avg gap (\%) & max gap (\%) & avg time (s) \\
\midrule
Alice &        4.76 &       14.29 &        14.00 \\
Bob   &       47.63 &       62.50 &        15.67 \\
Carol &       46.43 &       86.67 &        14.67 \\
\bottomrule
\end{tabular}

We can copy-paste this code into our LaTeX article. Personally, I prefer to save it as a separate file and use \input{summary}. Here is my LaTeX template:

\documentclass{article}
\usepackage{booktabs}

\begin{document}

\begin{table}
    \centering
    \caption{Comparison of algorithms}
    \input{summary}
\end{table}

\end{document}

From LaTeX

The previous method exports the table to ready-to-use LaTeX using Python. Whereas editing the table from Python is often handier than editing it from LaTeX, I still like the second approach. LaTeX has the ability to import CSV files thanks to packages such as csvsimple and pgfplotstable. One of the great advantages of using a CSV file is that we can use the latter as a single source of truth. Why this is an advantage? For example, we can display the same data simultaneously as a table and as a chart. In the following, we assume that our df_summary table has been unchanged (its columns are still named avg_gap, max_gap, and avg_time). Although we might be able to do it in LaTeX, we choose to rename the algorithm names in the Python script before exporting the table to summary.csv.

# Rename rows (indexes)
index = {"alice": "Alice", "bob": "Bob", "carol": "Carol"}
df_summary.rename(index=index, inplace=True)

# Export the DataFrame to CSV
df_summary.to_csv("summary.csv")

From now, we use pgfplotstable to load the CSV file. Our LaTeX article follows this template:

\documentclass{article}
\usepackage{booktabs}
\usepackage{pgfplotstable}

\pgfplotstableread[col sep=comma]{summary.csv}{\summarytable}

\begin{document}

\begin{table}
    \centering
    \caption{Comparison of algorithms}
    \pgfplotstabletypeset[<options...>]{\summarytable}
\end{table}

\end{document}

The \pgfplotstableread command imports the CSV file to the \summarytable variable. Note that we need to specify col sep=comma, otherwise it is assumed that values are separated by white spaces. The \pgfplotstabletypeset command outputs a table from the \summarytable variable. All we need to do is to define the options to satisfy our requirements. Since there are plenty of them, we will go through them step by step.

First, we can specify which columns of the CSV file we are using. Although this is optional (we are using all the columns), I recommend doing so in the case we update our CSV file with more columns.

    columns={algorithm, avg_gap, max_gap, avg_time},

Next, let’s format the column algorithm:

    columns/{algorithm}/.style={
            column name={},
            column type=l,
            string type},

We decided that the column algorithm has no name and its content is aligned to the left (l). We specified that this column contains strings, not numbers (otherwise it will raise an error). Similarly, we format the column avg_gap:

    columns/{avg_gap}/.style={
            column name={avg gap (\%)},
            column type=r,
            precision=2,
            fixed,
            fixed zerofill},

This time, we want the column to be aligned to the right (r). The precision argument determines the number of decimals to show. Moreover, the numbers should be in fixed notation and filled with zeros. Except for the column name, the options for max_gap and avg_time are identical. To make our table look pretty, we add some \toprule, \midrule, and \bottomrule.

    every head row/.style={before row=\toprule, after row=\midrule},
    every last row/.style={after row=\bottomrule},

Please find here the complete code of the table:

\begin{table}
    \centering
    \caption{Comparison of algorithms}
    \pgfplotstabletypeset[
        columns={algorithm, avg_gap, max_gap, avg_time},
        columns/{algorithm}/.style={
                column name={},
                column type=l,
                string type},
        columns/{avg_gap}/.style={
                column name={avg gap (\%)},
                column type=r,
                precision=2,
                fixed,
                fixed zerofill},
        columns/{max_gap}/.style={
                column name={max gap (\%)},
                column type=r,
                precision=2,
                fixed,
                fixed zerofill},
        columns/{avg_time}/.style={
                column name={avg time (s)},
                column type=r,
                precision=2,
                fixed,
                fixed zerofill},
        every head row/.style={before row=\toprule, after row=\midrule},
        every last row/.style={after row=\bottomrule},
    ]{\summarytable}
\end{table}

Previously, I said that we can display a chart using the same source data. This can be done by including the pgfplots package.

\usepackage{pgfplots}

The following code creates a simple vertical bar chart embedded in a figure for the average gap.

\begin{figure}
    \centering
    \begin{tikzpicture}
        \begin{axis}[
                ybar,
                xlabel={Algorithm},
                xtick=data,
                xticklabels from table={\summarytable}{algorithm},
                ylabel={Average gap (\%)},
                ymin=0,
                bar width=40,
                enlarge x limits=0.25]
            \addplot table [x expr=\coordindex, y={avg_gap}]{\summarytable};
        \end{axis}
    \end{tikzpicture}
    \caption{Comparison of algorithms}
\end{figure}

LaTeX figure based on \summarytable

Commands and arguments of pgfplots are detailed in the manual. And voilà! We obtain both a table and a chart from a single source of data.

Conclusion

I described a method for computing statistics and creating LaTeX tables programmatically using Python and Pandas. We discussed two ways to include tables in a LaTeX article. Preparing such scripts can help saving time and avoiding mistakes, compared to writing values manually. Along with R, Pandas is a very mature and powerful library that is not limited to our use case. Check out the manual for more details.