Skip to content Skip to sidebar Skip to footer

Issue In Encoding Non-numeric Feature To Numeric In Spark And Ipython

I am working on something where I have to make predictions for numeric data (monthly employee spending) using non-numeric features. I am using Spark MLlibs Random Forests algorthim

Solution 1:

Generally speaking if you have data that can be processed using Pandas data frames and scikit-learn using Spark seems to be a serious overkill. Still, if you do it probably make more sense to use Spark tools all the way. Lets start with indexing your features:

from pyspark.ml.feature import StringIndexer
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import VectorAssembler

label_col = "x3"# For example# I assume this comes from your previous question
df = (rdd.map(lambda row: [row[i] for i in columns_num])
    .toDF(("x0", "x1", "x2", "x3")))

# Indexers encode strings with doubles
string_indexers = [
   StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))

   # For classifications problems#   - if you want to use ML you should index label as well#   - if you want to use MLlib it is not necessary# For regression problems you should omit label in the indexing# as shown belowfor x in df.columns if x notin {label_col} # Exclude other columns if needed
]

# Assembles multiple columns into a single vector
assembler = VectorAssembler(
    inputCols=["idx_{0}".format(x) for x in df.columns if x != label_col],
    outputCol="features"
)


pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(df)
indexed = model.transform(df)

Pipeline defined above will create following data frame:

indexed.printSchema()
## root
##  |-- x0: string (nullable = true)
##  |-- x1: string (nullable = true)
##  |-- x2: string (nullable = true)
##  |-- x3: string (nullable = true)
##  |-- idx_x0: double (nullable = true)
##  |-- idx_x1: double (nullable = true)
##  |-- idx_x2: double (nullable = true)
##  |-- features: vector (nullable = true)

where features should be a valid input for mllib.tree.DecisionTree (see SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?).

You can create label points out of it as follows:

from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col

label_points = (indexed
    .select(col(label_col).alias("label"), col("features"))
    .map(lambda row: LabeledPoint(row.label, row.features)))

Post a Comment for "Issue In Encoding Non-numeric Feature To Numeric In Spark And Ipython"