Skip to content Skip to sidebar Skip to footer

Need To Create A Pandas Dataframe By Reading Csv File With Random Columns

I have the following csv file with records: A 1, B 2, C 10, D 15 A 5, D 10, G 2 D 6, E 7 H 7, G 8 My column headers/names are: A, B, C, D, E, F, G So my initial dataframe after

Solution 1:

You can loop through rows with apply function(axis = 1) and construct a pandas series for each row based on the key value pairs after the splitting, and the newly constructed series will be automatically aligned by their index, just notice here there is no F column but an extra H, not sure if it is what you need. But removing the H and adding an extra NaN F column should be straight forward:

df.apply(lambda r: pd.Series({x[0]: x[1] for x in r.str.split(' ') 
                                    ifisinstance(x, list) andlen(x) == 2}), axis = 1)


#     A   B   C   D   E   G   H#0    1   2  10  15 NaN NaN NaN#1    5 NaN NaN  10 NaN   2 NaN#2  NaN NaN NaN   6   7 NaN NaN#3  NaN NaN NaN NaN NaN   8   7

Solution 2:

Apply solution:

Use split by whitespace, remove NaN rows by dropna, set_index and convert one column DataFrame to Series by DataFrame.squeeze. Last reindex by new column names:

print (df.apply(lambda x: x.str.split(expand=True)
                               .dropna()
                               .set_index(0)
                               .squeeze(), axis=1)
         .reindex(columns=list('ABCDEFGH')))

     A    B    C    D    E   F    G    H
0121015NaNNaNNaNNaN15NaNNaN10NaNNaN2NaN2NaNNaNNaN67NaNNaNNaN3NaNNaNNaNNaNNaNNaN87

Stack solution:

Use stack for creating Series, split by whitespace and create new columns, append column with new column names (A, B...) to index by set_index, convert one column DataFrame to Series by DataFrame.squeeze, remove index values with old column names by reset_index, unstack, reindex by new column names (it add missing columns filled by NaN),convert values to float by astype and last remove column name by rename_axis (new in pandas0.18.0):

print (df.stack()
         .str.split(expand=True)
         .set_index(0, append=True)
         .squeeze()
         .reset_index(level=1, drop=True)
         .unstack()
         .reindex(columns=list('ABCDEFGH'))
         .astype(float)
         .rename_axis(None, axis=1))

     AB     C     D    E   F    G    H
01.02.010.015.0  NaN NaN  NaN  NaN
15.0  NaN   NaN  10.0  NaN NaN  2.0  NaN
2  NaN  NaN   NaN   6.07.0 NaN  NaN  NaN
3  NaN  NaN   NaN   NaN  NaN NaN  8.07.0

Solution 3:

Here is the code:

res = pd.DataFrame(index=df.index, columns=list('ABCDEFGH'))

defclassifier(row):
    cols = row.str.split().str[0].dropna().tolist()
    vals = row.str.split().str[1].dropna().tolist()
    res.loc[row.name, cols] = vals

df.apply(classifier, axis=1)

Input:

from io import StringIO
import pandas as pd
import numpy as np

data = """A 1, B 2, C 10, D 15
A 5, D 10, G 2
D 6, E 7
H 7, G 8"""

df = pd.read_csv(StringIO(data), header=None)
print("df:\n", df)

res = pd.DataFrame(index=df.index, columns=list('ABCDEFGH'))

defclassifier(row):
    cols = row.str.split().str[0].dropna().tolist()
    vals = row.str.split().str[1].dropna().tolist()
    res.loc[row.name, cols] = vals
df.apply(classifier, axis=1)

print("\nres:\n", res)

Output:

df:01230   A 1  B 2   C 10  D 151   A 5  D 10  G 2NaN2   D 6  E 7NaNNaN3   H 7  G 8NaNNaN

res:
    A   B   C   D   E   F   G   H
0121015NaNNaNNaNNaN15NaNNaN10NaNNaN2NaN2NaNNaNNaN67NaNNaNNaN3NaNNaNNaNNaNNaNNaN87

Post a Comment for "Need To Create A Pandas Dataframe By Reading Csv File With Random Columns"