Skip to content Skip to sidebar Skip to footer

Optimize Changing Variables To Get Max Pearson's Correlation Coefficient For Multiple Columns

Amendment: If I have a pandas DataFrame that includes 5 columns Col1 & Col2 & Col3 & Col4 & Col5 and I need to get max Pearson's correlation coefficient between(Col

Solution 1:

Not extremely elegant, but works; feel free to make this more generic:

import pandas as pd
from scipy.optimize import minimize


def minimize_me(b, df):

    # we want to maximize, so we have to multiply by -1
    return -1 * df['Col3'].corr(df['Col2'] * df['Col1'] ** b )

# read your dataframe from somehwere, e.g. csv
df = pd.read_clipboard(sep=',')

# B is greater than 0 for now
bnds = [(0, None)]

res = minimize(minimize_me, (1), args=(df,), bounds=bnds)

if res.success:
    # that's the optimal B
    print(res.x[0])

    # that's the highest correlation you can get
    print(-1 * res.fun)
else:
    print("Sorry, the optimization was not successful. Try with another initial"
          " guess or optimization method")

This will print:

0.9020784246026575 # your B
0.7614993786787415 # highest correlation for corr(col2, col3)

I now read from clipboard, replace that by your .csv file. You should then also avoid the hardcoding of the columns; the code above is just for demonstration purposes, so that you see how to set up the optimization problem itself.

If you are interested in the sum, you can use (rest of code unmodified):

def minimize_me(b, df):

    col_mod = df['Col2'] * df['Col1'] ** b

    # we want to maximize, so we have to multiply by -1
    return -1 * (df['Col3'].corr(col_mod) +
                 df['Col4'].corr(col_mod) +
                 df['Col5'].corr(col_mod))

This will print:

1.0452394748131613
2.3428368479642137

Post a Comment for "Optimize Changing Variables To Get Max Pearson's Correlation Coefficient For Multiple Columns"