python - Validating the data type of a column in pyspark dataframe -


i have pyspark dataframe 3 columns. ddl of hive table 'test1' having string data types. if df.printschema string data type shown below,

>>> df = spark.sql("select * default.test1") >>> df.printschema()                                                                                                                                                                      root                                                                                                                                                                                         |-- c1: string (nullable = true)                                                                                                                                              |-- c2: string (nullable = true)                                                                                                                                          |-- c3: string (nullable = true)    +----------+--------------+-------------------+                                                                                                                  |c1        |c2            |c3                 |                                                                                                                  +----------+--------------+-------------------+                                                                                                                  |april     |20132014      |4                  |                                                                                                                  |may       |20132014      |5                  |                                                                                                                  |june      |abcdefgh      |6                  |                                                                                                                  +----------+--------------+-------------------+  

now want filter records of integer type in 'c2' column. need first 2 records integer type '20132014'. , exclude other records.

in 1 line:

df.withcolumn("c2", df["c2"].cast("integer")).na.drop(subset=["c2"]) 

if c2 not valid integer, null , dropped in subsequent step.

without changing type

valid = df.where(df["c2"].cast("integer").isnotnull()) invalid = df.where(df["c2"].cast("integer").isnull()) 

Comments

Popular posts from this blog

resizing Telegram inline keyboard -

command line - How can a Python program background itself? -

php - "cURL error 28: Resolving timed out" on Wordpress on Azure App Service on Linux -