Examples of all String Related Functions in Spark
Spark has lots of functions already built-in it's core, but sometimes it could be difficult to know what does each one of those. In this page, you'll find a code example of how to use each String-related function using the Dataframe API.
To use any of these functions, you can import them from the package org.apache.spark.sql
import org.apache.spark.sql.functions
// or
import org.apache.spark.sql.functions._
// or
import org.apache.spark.sql.functions.{function1, function2, ...}
All of these functions, unless specified otherwise, will work in either a select
or withColumn
statement, and they work exactly the same in python (pyspark), scala or java, but these examples are written in scala.
String manipulation
CONCAT_WS
Concatenates multiple columns together into a single string using the given separator.
concat_ws(sep: String, column1: Column, column2: Column…)
val data = Seq(("Hello", "Data")).toDF("C1", "C2")
data
.withColumn("Concat_Value",concat_ws(" ",col("C1"),col("C2")))
.show(false)
+-----+----+------------+
|C1 |C2 |Concat_Value|
+-----+----+------------+
|Hello|Data|Hello Data |
+-----+----+------------+
TRIM / RTRIM / LTRIM
trim: Trims the spaces from both ends of the given string column.
rtrim: Trims the spaces from the right end of the given string column.
ltrim: Trims the spaces from the left start of the given string column.
val data = Seq((" Hello! ")).toDF("C1")
data
.withColumn("trim_Value",trim(col("C1")))
.withColumn("rtrim_Value",rtrim(col("C1")))
.withColumn("ltrim_Value",ltrim(col("C1")))
.show(false)
+--------------+----------+-----------+-----------+
|C1 |trim_Value|rtrim_Value|ltrim_Value|
+--------------+----------+-----------+-----------+
| Hello! |Hello! | Hello! |Hello! |
+--------------+----------+-----------+-----------+
LPAD / RPAD
lpad: Fills a string with values “pad” to the left until the length of the string has a value of “len”.
lpad(columnName: String, len: String, pad: String)
rpad: Fills a string with values “pad” to the right until the length of the string has a value of “len”.
rpad(columnName: String, len: String, pad: String)
val data = Seq(("Hello!")).toDF("C1")
data
.withColumn("lpad_Value",lpad(col("C1"), 10, "-"))
.withColumn("rpad_Value",rpad(col("C1"), 10, "-"))
.show(false)
+------+----------+----------+
|C1 |lpad_Value|rpad_Value|
+------+----------+----------+
|Hello!|----Hello!|Hello!----|
+------+----------+----------+
REPEAT
Repeats the content of a column the specified amount of times.
repeat(columnName: Column, n: Int)
val data = Seq(("Hello!")).toDF("C1")
data
.withColumn("repeat_Value",repeat(col("C1"), 3))
.show(false)
+------+------------------+
|C1 |repeat_Value |
+------+------------------+
|Hello!|Hello!Hello!Hello!|
+------+------------------+
SPLIT
Breaks a string by the specified value. Returns an array with the content with all the parts.
split(columnName: Column, pattern: String)
val data = Seq(("Hello Data Engineering!")).toDF("C1")
data
.withColumn("split_Value",split(col("C1"), " "))
.show(false)
+-----------------------+---------------------------+
|C1 |split_Value |
+-----------------------+---------------------------+
|Hello Data Engineering!|[Hello, Data, Engineering!]|
+-----------------------+---------------------------+
TRANSLATE
Translates any character in the src by a character in the replaceString. The characters in the replaceString correspond to the characters in the matchingString.
In the example, the values will be replaced:
A → Z
B → X
C → Y
translate(src: Column, matchingString: String, replaceString: String)
val data = Seq(("ABCACB!")).toDF("C1")
data
.withColumn("translate_Value",translate(col("C1"), "ABC", "ZXY"))
.show(false)
+-------+---------------+
|C1 |translate_Value|
+-------+---------------+
|ABCACB!|ZXYZYX! |
+-------+---------------+
OVERLAY
Overlays the specified portion of src with replace, starting from the position pos of src and procesing for len positions. All the arguments need to be Columns, so you may need to use lit
.
overlay(src: Column, replace: Column, pos: Column, len: Column)
val data = Seq(("Hello Data Engineering!")).toDF("C1")
data
.withColumn("overlay_Value", functions.overlay(col("C1"),lit("Bye"),lit(0),lit(6)))
.show(false)
+-----------------------+---------------------+
|C1 |overlay_Value |
+-----------------------+---------------------+
|Hello Data Engineering!|Bye Data Engineering!|
+-----------------------+---------------------+
LENGTH
Returns the amount of characters in a given string. The length of the character include the trailing spaces.
val data = Seq(("Hello Data Engineering!")).toDF("C1")
data
.withColumn("Length_Value",length(col("C1")))
.show(false)
+-----------------------+------------+
|C1 |Length_Value|
+-----------------------+------------+
|Hello Data Engineering!|23 |
+-----------------------+------------+
String transformation
ASCII
Returns the ASCII value of the first character in the string.
val data = Seq(("H")).toDF("C1")
data
.withColumn("ASCII_Value",ascii(col("C1")))
.show(false)
+---+-----------+
|C1 |ASCII_Value|
+---+-----------+
|H |72 |
+---+-----------+
BASE64 / UNBASE64
base64: Transform the given string into its Base64 representation.
unbase64: Decodes a base64 encoded string. You may need to cast it to a String or you will get back the array of bytes.
val data = Seq(("Hello Data Engineering")).toDF("C1")
data
.withColumn("Base64_Value",base64(col("C1")))
.withColumn("UnBase64_Value",unbase64(col("Base64_Value")).cast(StringType))
.show(false)
+----------------------+--------------------------------+----------------------+
|C1 |Base64_Value |UnBase64_Value |
+----------------------+--------------------------------+----------------------+
|Hello Data Engineering|SGVsbG8gRGF0YSBFbmdpbmVlcmluZw==|Hello Data Engineering|
+----------------------+--------------------------------+----------------------+
DECODE / ENCODE
Decode: Decodes the given string from the charset specified.
Encode: Encodes the given string into the charset specified.
val data = Seq(("Hello Data Engineering!")).toDF("C1")
data
.withColumn("Decode_Value",decode(col("C1"), "utf-8"))
.withColumn("Encode_Value",encode(col("C1"), "utf-8"))
.show(false)
+-----------------------+-----------------------+----------------------------------------------------------------------+
|C1 |Decode_Value |Encode_Value |
+-----------------------+-----------------------+----------------------------------------------------------------------+
|Hello Data Engineering!|Hello Data Engineering!|[48 65 6C 6C 6F 20 44 61 74 61 20 45 6E 67 69 6E 65 65 72 69 6E 67 21]|
+-----------------------+-----------------------+----------------------------------------------------------------------+
FORMAT_NUMBER
Formats a numeric column into a String. You can specified a formating expression or the amount of decimal places that you want back, and it will round to that decimal position.
val data = Seq((16000.12345)).toDF("C1")
data
.withColumn("FormatNumber_Value",format_number(col("C1"), 2))
.show(false)
+-----------+------------------+
|C1 |FormatNumber_Value|
+-----------+------------------+
|16000.12345|16,000.12 |
+-----------+------------------+
FORMAT_STRING
Formats the argument in printf-style and returns the result as a single column.
val data = Seq(("Hello Data", 100)).toDF("C1", "C2")
data
.withColumn("FormatString_Value",format_string("%s - %d",col("C1"),col("C2")))
.show(false)
+----------+---+------------------+
|C1 |C2 |FormatString_Value|
+----------+---+------------------+
|Hello Data|100|Hello Data - 100 |
+----------+---+------------------+
SOUNDEX
Returns the Soundex Code for the string. A code based on the way a name sounds rather than the way it's spelled.
val data = Seq(("Hello Data Engineering!")).toDF("C1")
data
.withColumn("Soundex_Value",soundex(col("C1")))
.show(false)
+-----------------------+-------------+
|C1 |Soundex_Value|
+-----------------------+-------------+
|Hello Data Engineering!|H433 |
+-----------------------+-------------+
LEVENSHTEIN
Computes the Levenshtein distance of the two given string columns.
val data = Seq(("Hello", "Goobye")).toDF("C1", "C2")
data
.withColumn("levenshtein_Value",levenshtein(col("C1"),col("C2")))
.show(false)
+-----+------+-----------------+
|C1 |C2 |levenshtein_Value|
+-----+------+-----------------+
|Hello|Goobye|6 |
+-----+------+-----------------+
INITCAP / LOWER / UPPER
initcap: Returns a new String Column by converting the first letter of each word to uppercase. Words are delimited by whitespace.
lower: Converts a string column to lower case.
upper: Converts a string column to upper case.
val data = Seq(("hello Data Engineering!")).toDF("C1")
data
.withColumn("InitCap_Value",initcap(col("C1")))
.withColumn("Lower_Value",lower(col("C1")))
.withColumn("Upper_Value",upper(col("C1")))
.show(false)
+-----------------------+-----------------------+-----------------------+-----------------------+
|C1 |InitCap_Value |Lower_Value |Upper_Value |
+-----------------------+-----------------------+-----------------------+-----------------------+
|hello Data Engineering!|Hello Data Engineering!|hello data engineering!|HELLO DATA ENGINEERING!|
+-----------------------+-----------------------+-----------------------+-----------------------+
String search
SUBSTRING / SUBSTRING_INDEX
substring: Returns the value of the string starting at pos and with a length of len.
substring(columnName: Column, pos: Int, len: Int)
substring_index: Returns the substring from the String Column until the point were there has been count occurrences of delim. In the example it returns the string until the point where there has been 2 whitepsaces.
substring_index(columnName: Column, delim: String, count: Int)
val data = Seq(("Hello Data Engineering!")).toDF("C1")
data
.withColumn("Substring_Value",substring(col("C1"), 0, 5))
.withColumn("SubstringIndex_Value",substring_index(col("C1"), " ", 2))
.show(false)
+-----------------------+---------------+--------------------+
|C1 |Substring_Value|SubstringIndex_Value|
+-----------------------+---------------+--------------------+
|Hello Data Engineering!|Hello |Hello Data |
+-----------------------+---------------+--------------------+
REGEXP_REPLACE
Replace all substring of the specified string value that match the pattern with replacement:
regexp_replace(columnName: Column, pattern: Column, replacement: Column)
val data = Seq(("Hello Data Engineering!")).toDF("C1")
data
.withColumn("Replace_Value",regexp_replace(col("C1"),lit("Hello"),lit("Goodbye")))
.show(false)
+-----------------------+-------------------------+
|C1 |Replace_Value |
+-----------------------+-------------------------+
|Hello Data Engineering!|Goodbye Data Engineering!|
+-----------------------+-------------------------+
INSTR
Locates the position of the first occurrence of the substring in the given string column. Returns 0 if substring can’t be found. The difference with the locate function is that here, the substring is a fixed value for all the dataframe.
instr(columnName: Column, substring: String)
val data = Seq(("Hello Data Engineering!")).toDF("C1")
data
.withColumn("instr_Value",instr(col("C1"), "Data"))
.show(false)
+-----------------------+-----------+
|C1 |instr_Value|
+-----------------------+-----------+
|Hello Data Engineering!|7 |
+-----------------------+-----------+
LOCATE
Locates the position of the first occurrence of the str in the given string column. Returns 0 if substring can’t be found. The difference with the instr function is that here, the str is a Column, so it can be different on each row of the dataframe.
locate(columnName: Column, str: Column)
val data = Seq(("Hello Data Engineering!")).toDF("C1")
data
.withColumn("locate_Value",locate("Data",col("C1")))
.show(false)
+-----------------------+------------+
|C1 |locate_Value|
+-----------------------+------------+
|Hello Data Engineering!|7 |
+-----------------------+------------+
Sources
You can find the source code of these functions here: