Issue In Encoding Non-numeric Feature To Numeric In Spark And Ipython
I am working on something where I have to make predictions for numeric data (monthly employee spending) using non-numeric features. I am using Spark MLlibs Random Forests algorthim
Solution 1:
Generally speaking if you have data that can be processed using Pandas data frames and scikit-learn
using Spark seems to be a serious overkill. Still, if you do it probably make more sense to use Spark tools all the way. Lets start with indexing your features:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.feature import VectorAssembler
label_col = "x3"# For example# I assume this comes from your previous question
df = (rdd.map(lambda row: [row[i] for i in columns_num])
.toDF(("x0", "x1", "x2", "x3")))
# Indexers encode strings with doubles
string_indexers = [
StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))
# For classifications problems# - if you want to use ML you should index label as well# - if you want to use MLlib it is not necessary# For regression problems you should omit label in the indexing# as shown belowfor x in df.columns if x notin {label_col} # Exclude other columns if needed
]
# Assembles multiple columns into a single vector
assembler = VectorAssembler(
inputCols=["idx_{0}".format(x) for x in df.columns if x != label_col],
outputCol="features"
)
pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(df)
indexed = model.transform(df)
Pipeline defined above will create following data frame:
indexed.printSchema()
## root
## |-- x0: string (nullable = true)
## |-- x1: string (nullable = true)
## |-- x2: string (nullable = true)
## |-- x3: string (nullable = true)
## |-- idx_x0: double (nullable = true)
## |-- idx_x1: double (nullable = true)
## |-- idx_x2: double (nullable = true)
## |-- features: vector (nullable = true)
where features
should be a valid input for mllib.tree.DecisionTree
(see SPARK: How to create categoricalFeaturesInfo for decision trees from LabeledPoint?).
You can create label points out of it as follows:
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col
label_points = (indexed
.select(col(label_col).alias("label"), col("features"))
.map(lambda row: LabeledPoint(row.label, row.features)))
Post a Comment for "Issue In Encoding Non-numeric Feature To Numeric In Spark And Ipython"