Optimize Changing Variables To Get Max Pearson's Correlation Coefficient For Multiple Columns
Amendment: If I have a pandas DataFrame that includes 5 columns Col1 & Col2 & Col3 & Col4 & Col5 and I need to get max Pearson's correlation coefficient between(Col
Solution 1:
Not extremely elegant, but works; feel free to make this more generic:
import pandas as pd
from scipy.optimize import minimize
def minimize_me(b, df):
# we want to maximize, so we have to multiply by -1
return -1 * df['Col3'].corr(df['Col2'] * df['Col1'] ** b )
# read your dataframe from somehwere, e.g. csv
df = pd.read_clipboard(sep=',')
# B is greater than 0 for now
bnds = [(0, None)]
res = minimize(minimize_me, (1), args=(df,), bounds=bnds)
if res.success:
# that's the optimal B
print(res.x[0])
# that's the highest correlation you can get
print(-1 * res.fun)
else:
print("Sorry, the optimization was not successful. Try with another initial"
" guess or optimization method")
This will print:
0.9020784246026575 # your B
0.7614993786787415 # highest correlation for corr(col2, col3)
I now read from clipboard
, replace that by your .csv
file. You should then also avoid the hardcoding of the columns; the code above is just for demonstration purposes, so that you see how to set up the optimization problem itself.
If you are interested in the sum, you can use (rest of code unmodified):
def minimize_me(b, df):
col_mod = df['Col2'] * df['Col1'] ** b
# we want to maximize, so we have to multiply by -1
return -1 * (df['Col3'].corr(col_mod) +
df['Col4'].corr(col_mod) +
df['Col5'].corr(col_mod))
This will print:
1.0452394748131613
2.3428368479642137
Post a Comment for "Optimize Changing Variables To Get Max Pearson's Correlation Coefficient For Multiple Columns"