Examples of all String Related Functions in Spark

Spark has lots of functions already built-in it's core, but sometimes it could be difficult to know what does each one of those. In this page, you'll find a code example of how to use each String-related function using the Dataframe API.

To use any of these functions, you can import them from the package org.apache.spark.sql

import org.apache.spark.sql.functions
// or
import org.apache.spark.sql.functions._
// or
import org.apache.spark.sql.functions.{function1, function2, ...}

All of these functions, unless specified otherwise, will work in either a select or withColumn statement, and they work exactly the same in python (pyspark), scala or java, but these examples are written in scala.

String manipulation

CONCAT_WS

Concatenates multiple columns together into a single string using the given separator.

concat_ws(sep: String, column1: Column, column2: Column…)

val data = Seq(("Hello", "Data")).toDF("C1", "C2")

data
  .withColumn("Concat_Value",concat_ws(" ",col("C1"),col("C2")))
  .show(false)

+-----+----+------------+
|C1   |C2  |Concat_Value|
+-----+----+------------+
|Hello|Data|Hello Data  |
+-----+----+------------+

TRIM / RTRIM / LTRIM

trim: Trims the spaces from both ends of the given string column.

rtrim: Trims the spaces from the right end of the given string column.

ltrim: Trims the spaces from the left start of the given string column.

val data = Seq(("    Hello!    ")).toDF("C1")

data
  .withColumn("trim_Value",trim(col("C1")))
  .withColumn("rtrim_Value",rtrim(col("C1")))
  .withColumn("ltrim_Value",ltrim(col("C1")))
  .show(false)

+--------------+----------+-----------+-----------+
|C1            |trim_Value|rtrim_Value|ltrim_Value|
+--------------+----------+-----------+-----------+
|    Hello!    |Hello!    |    Hello! |Hello!     |
+--------------+----------+-----------+-----------+

LPAD / RPAD

lpad: Fills a string with values “pad” to the left until the length of the string has a value of “len”.

lpad(columnName: String, len: String, pad: String)

rpad: Fills a string with values “pad” to the right until the length of the string has a value of “len”.

rpad(columnName: String, len: String, pad: String)

val data = Seq(("Hello!")).toDF("C1")

data
  .withColumn("lpad_Value",lpad(col("C1"), 10, "-"))
  .withColumn("rpad_Value",rpad(col("C1"), 10, "-"))
  .show(false)

+------+----------+----------+
|C1    |lpad_Value|rpad_Value|
+------+----------+----------+
|Hello!|----Hello!|Hello!----|
+------+----------+----------+

REPEAT

Repeats the content of a column the specified amount of times.

repeat(columnName: Column, n: Int)

val data = Seq(("Hello!")).toDF("C1")

data
  .withColumn("repeat_Value",repeat(col("C1"), 3))
  .show(false)

+------+------------------+
|C1    |repeat_Value      |
+------+------------------+
|Hello!|Hello!Hello!Hello!|
+------+------------------+

SPLIT

Breaks a string by the specified value. Returns an array with the content with all the parts.

split(columnName: Column, pattern: String)

val data = Seq(("Hello Data Engineering!")).toDF("C1")

data
  .withColumn("split_Value",split(col("C1"), " "))
  .show(false)

+-----------------------+---------------------------+
|C1                     |split_Value                |
+-----------------------+---------------------------+
|Hello Data Engineering!|[Hello, Data, Engineering!]|
+-----------------------+---------------------------+

TRANSLATE

Translates any character in the src by a character in the replaceString. The characters in the replaceString correspond to the characters in the matchingString.

In the example, the values will be replaced:

A → Z

B → X

C → Y

translate(src: Column, matchingString: String, replaceString: String)

val data = Seq(("ABCACB!")).toDF("C1")

data
  .withColumn("translate_Value",translate(col("C1"), "ABC", "ZXY"))
  .show(false)

+-------+---------------+
|C1     |translate_Value|
+-------+---------------+
|ABCACB!|ZXYZYX!        |
+-------+---------------+

OVERLAY

Overlays the specified portion of src with replace, starting from the position pos of src and procesing for len positions. All the arguments need to be Columns, so you may need to use lit.

overlay(src: Column, replace: Column, pos: Column, len: Column)

val data = Seq(("Hello Data Engineering!")).toDF("C1")

data
  .withColumn("overlay_Value", functions.overlay(col("C1"),lit("Bye"),lit(0),lit(6)))
  .show(false)

+-----------------------+---------------------+
|C1                     |overlay_Value        |
+-----------------------+---------------------+
|Hello Data Engineering!|Bye Data Engineering!|
+-----------------------+---------------------+

LENGTH

Returns the amount of characters in a given string. The length of the character include the trailing spaces.

val data = Seq(("Hello Data Engineering!")).toDF("C1")

data
  .withColumn("Length_Value",length(col("C1")))
  .show(false)

+-----------------------+------------+
|C1                     |Length_Value|
+-----------------------+------------+
|Hello Data Engineering!|23          |
+-----------------------+------------+

String transformation

ASCII

Returns the ASCII value of the first character in the string.

val data = Seq(("H")).toDF("C1")

data
  .withColumn("ASCII_Value",ascii(col("C1")))
  .show(false)

+---+-----------+
|C1 |ASCII_Value|
+---+-----------+
|H  |72         |
+---+-----------+

BASE64 / UNBASE64

base64: Transform the given string into its Base64 representation.

unbase64: Decodes a base64 encoded string. You may need to cast it to a String or you will get back the array of bytes.

val data = Seq(("Hello Data Engineering")).toDF("C1")

data
  .withColumn("Base64_Value",base64(col("C1")))
  .withColumn("UnBase64_Value",unbase64(col("Base64_Value")).cast(StringType))
  .show(false)

+----------------------+--------------------------------+----------------------+
|C1                    |Base64_Value                    |UnBase64_Value        |
+----------------------+--------------------------------+----------------------+
|Hello Data Engineering|SGVsbG8gRGF0YSBFbmdpbmVlcmluZw==|Hello Data Engineering|
+----------------------+--------------------------------+----------------------+

DECODE / ENCODE

Decode: Decodes the given string from the charset specified.

Encode: Encodes the given string into the charset specified.

val data = Seq(("Hello Data Engineering!")).toDF("C1")

data
  .withColumn("Decode_Value",decode(col("C1"), "utf-8"))
  .withColumn("Encode_Value",encode(col("C1"), "utf-8"))
  .show(false)

+-----------------------+-----------------------+----------------------------------------------------------------------+
|C1                     |Decode_Value           |Encode_Value                                                          |
+-----------------------+-----------------------+----------------------------------------------------------------------+
|Hello Data Engineering!|Hello Data Engineering!|[48 65 6C 6C 6F 20 44 61 74 61 20 45 6E 67 69 6E 65 65 72 69 6E 67 21]|
+-----------------------+-----------------------+----------------------------------------------------------------------+

FORMAT_NUMBER

Formats a numeric column into a String. You can specified a formating expression or the amount of decimal places that you want back, and it will round to that decimal position.

val data = Seq((16000.12345)).toDF("C1")

data
  .withColumn("FormatNumber_Value",format_number(col("C1"), 2))
  .show(false)

+-----------+------------------+
|C1         |FormatNumber_Value|
+-----------+------------------+
|16000.12345|16,000.12         |
+-----------+------------------+

FORMAT_STRING

Formats the argument in printf-style and returns the result as a single column.

val data = Seq(("Hello Data", 100)).toDF("C1", "C2")

data
  .withColumn("FormatString_Value",format_string("%s - %d",col("C1"),col("C2")))
  .show(false)

+----------+---+------------------+
|C1        |C2 |FormatString_Value|
+----------+---+------------------+
|Hello Data|100|Hello Data - 100  |
+----------+---+------------------+

SOUNDEX

Returns the Soundex Code for the string. A code based on the way a name sounds rather than the way it's spelled.

val data = Seq(("Hello Data Engineering!")).toDF("C1")

data
  .withColumn("Soundex_Value",soundex(col("C1")))
  .show(false)

+-----------------------+-------------+
|C1                     |Soundex_Value|
+-----------------------+-------------+
|Hello Data Engineering!|H433         |
+-----------------------+-------------+

LEVENSHTEIN

Computes the Levenshtein distance of the two given string columns.

val data = Seq(("Hello", "Goobye")).toDF("C1", "C2")

data
  .withColumn("levenshtein_Value",levenshtein(col("C1"),col("C2")))
  .show(false)

+-----+------+-----------------+
|C1   |C2    |levenshtein_Value|
+-----+------+-----------------+
|Hello|Goobye|6                |
+-----+------+-----------------+

INITCAP / LOWER / UPPER

initcap: Returns a new String Column by converting the first letter of each word to uppercase. Words are delimited by whitespace.

lower: Converts a string column to lower case.

upper: Converts a string column to upper case.

val data = Seq(("hello Data Engineering!")).toDF("C1")

data
  .withColumn("InitCap_Value",initcap(col("C1")))
  .withColumn("Lower_Value",lower(col("C1")))
  .withColumn("Upper_Value",upper(col("C1")))
  .show(false)

+-----------------------+-----------------------+-----------------------+-----------------------+
|C1                     |InitCap_Value          |Lower_Value            |Upper_Value            |
+-----------------------+-----------------------+-----------------------+-----------------------+
|hello Data Engineering!|Hello Data Engineering!|hello data engineering!|HELLO DATA ENGINEERING!|
+-----------------------+-----------------------+-----------------------+-----------------------+

String search

SUBSTRING / SUBSTRING_INDEX

substring: Returns the value of the string starting at pos and with a length of len.

substring(columnName: Column, pos: Int, len: Int)

substring_index: Returns the substring from the String Column until the point were there has been count occurrences of delim. In the example it returns the string until the point where there has been 2 whitepsaces.

substring_index(columnName: Column, delim: String, count: Int)

val data = Seq(("Hello Data Engineering!")).toDF("C1")

data
  .withColumn("Substring_Value",substring(col("C1"), 0, 5))
  .withColumn("SubstringIndex_Value",substring_index(col("C1"), " ", 2))
  .show(false)

+-----------------------+---------------+--------------------+
|C1                     |Substring_Value|SubstringIndex_Value|
+-----------------------+---------------+--------------------+
|Hello Data Engineering!|Hello          |Hello Data          |
+-----------------------+---------------+--------------------+

REGEXP_REPLACE

Replace all substring of the specified string value that match the pattern with replacement:

regexp_replace(columnName: Column, pattern: Column, replacement: Column)

val data = Seq(("Hello Data Engineering!")).toDF("C1")

data
  .withColumn("Replace_Value",regexp_replace(col("C1"),lit("Hello"),lit("Goodbye")))
  .show(false)

+-----------------------+-------------------------+
|C1                     |Replace_Value            |
+-----------------------+-------------------------+
|Hello Data Engineering!|Goodbye Data Engineering!|
+-----------------------+-------------------------+

INSTR

Locates the position of the first occurrence of the substring in the given string column. Returns 0 if substring can’t be found. The difference with the locate function is that here, the substring is a fixed value for all the dataframe.

instr(columnName: Column, substring: String)

val data = Seq(("Hello Data Engineering!")).toDF("C1")

data
  .withColumn("instr_Value",instr(col("C1"), "Data"))
  .show(false)

+-----------------------+-----------+
|C1                     |instr_Value|
+-----------------------+-----------+
|Hello Data Engineering!|7          |
+-----------------------+-----------+

LOCATE

Locates the position of the first occurrence of the str in the given string column. Returns 0 if substring can’t be found. The difference with the instr function is that here, the str is a Column, so it can be different on each row of the dataframe.

locate(columnName: Column, str: Column)

val data = Seq(("Hello Data Engineering!")).toDF("C1")

data
  .withColumn("locate_Value",locate("Data",col("C1")))
  .show(false)

+-----------------------+------------+
|C1                     |locate_Value|
+-----------------------+------------+
|Hello Data Engineering!|7           |
+-----------------------+------------+

Sources

You can find the source code of these functions here:

spark/sql/functions.scala

String manipulation​

CONCAT_WS​

TRIM / RTRIM / LTRIM​

LPAD / RPAD​

REPEAT​

SPLIT​

TRANSLATE​

OVERLAY​

LENGTH​

String transformation​

ASCII​

BASE64 / UNBASE64​

DECODE / ENCODE​

FORMAT_NUMBER​

FORMAT_STRING​

SOUNDEX​

LEVENSHTEIN​

INITCAP / LOWER / UPPER​

String search​

SUBSTRING / SUBSTRING_INDEX​

REGEXP_REPLACE​

INSTR​

LOCATE​

Sources​