python - Validating the data type of a column in pyspark dataframe -

February 15, 2012

i have pyspark dataframe 3 columns. ddl of hive table 'test1' having string data types. if df.printschema string data type shown below,

>>> df = spark.sql("select * default.test1") >>> df.printschema()                                                                                                                                                                      root                                                                                                                                                                                         |-- c1: string (nullable = true)                                                                                                                                              |-- c2: string (nullable = true)                                                                                                                                          |-- c3: string (nullable = true)    +----------+--------------+-------------------+                                                                                                                  |c1        |c2            |c3                 |                                                                                                                  +----------+--------------+-------------------+                                                                                                                  |april     |20132014      |4                  |                                                                                                                  |may       |20132014      |5                  |                                                                                                                  |june      |abcdefgh      |6                  |                                                                                                                  +----------+--------------+-------------------+

now want filter records of integer type in 'c2' column. need first 2 records integer type '20132014'. , exclude other records.

in 1 line:

df.withcolumn("c2", df["c2"].cast("integer")).na.drop(subset=["c2"])

if c2 not valid integer, null , dropped in subsequent step.

without changing type

valid = df.where(df["c2"].cast("integer").isnotnull()) invalid = df.where(df["c2"].cast("integer").isnull())

Search This Blog

Enable

python - Validating the data type of a column in pyspark dataframe -

Comments

Post a Comment

Popular posts from this blog

resizing Telegram inline keyboard -

javascript - How to bind ViewModel Store to View? -

recursion - Can every recursive algorithm be improved with dynamic programming? -