pyspark.sql.SparkSession#

class pyspark.sql.SparkSession(sparkContext, jsparkSession=None, options={})[source]#

The entry point to programming Spark with the Dataset and DataFrame API.

A SparkSession can be used to create DataFrame, register DataFrame as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern:

Changed in version 3.4.0: Supports Spark Connect.

builder#

Creates a Builder for constructing a SparkSession.

Changed in version 3.4.0: Supports Spark Connect.

Examples

Create a Spark session.

>>> spark = (
...     SparkSession.builder
...         .master("local")
...         .appName("Word Count")
...         .config("spark.some.config.option", "some-value")
...         .getOrCreate()
... )

Create a Spark session with Spark Connect.

>>> spark = (
...     SparkSession.builder
...         .remote("sc://localhost")
...         .appName("Word Count")
...         .config("spark.some.config.option", "some-value")
...         .getOrCreate()
... )  

Methods

active()

Returns the active or default SparkSession for the current thread, returned by the builder.

addArtifact(*path[, pyfile, archive, file])

Add artifact(s) to the client session.

addArtifacts(*path[, pyfile, archive, file])

Add artifact(s) to the client session.

addTag(tag)

Add a tag to be assigned to all the operations started by this thread in this session.

clearProgressHandlers()

Clear all registered progress handlers.

clearTags()

Clear the current thread's operation tags.

copyFromLocalToFs(local_path, dest_path)

Copy file from local to cloud storage file system.

createDataFrame(data[, schema, ...])

Creates a DataFrame from an RDD, a list, a pandas.DataFrame, a numpy.ndarray, or a pyarrow.Table.

getActiveSession()

Returns the active SparkSession for the current thread, returned by the builder

getTags()

Get the tags that are currently set to be assigned to all the operations started by this thread.

interruptAll()

Interrupt all operations of this session currently running on the connected server.

interruptOperation(op_id)

Interrupt an operation of this session with the given operationId.

interruptTag(tag)

Interrupt all operations of this session with the given operation tag.

newSession()

Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and UDFs, but shared SparkContext and table cache.

range(start[, end, step, numPartitions])

Create a DataFrame with single pyspark.sql.types.LongType column named id, containing elements in a range from start to end (exclusive) with step value step.

registerProgressHandler(handler)

Register a progress handler to be called when a progress update is received from the server.

removeProgressHandler(handler)

Remove a progress handler that was previously registered.

removeTag(tag)

Remove a tag previously added to be assigned to all the operations started by this thread in this session.

sql(sqlQuery[, args])

Returns a DataFrame representing the result of the given query.

stop()

Stop the underlying SparkContext.

table(tableName)

Returns the specified table as a DataFrame.

Attributes

builder

Creates a Builder for constructing a SparkSession.

catalog

Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.

client

Gives access to the Spark Connect client.

conf

Runtime configuration interface for Spark.

dataSource

Returns a DataSourceRegistration for data source registration.

profile

Returns a Profile for performance/memory profiling.

read

Returns a DataFrameReader that can be used to read data in as a DataFrame.

readStream

Returns a DataStreamReader that can be used to read data streams as a streaming DataFrame.

sparkContext

Returns the underlying SparkContext.

streams

Returns a StreamingQueryManager that allows managing all the StreamingQuery instances active on this context.

udf

Returns a UDFRegistration for UDF registration.

udtf

Returns a UDTFRegistration for UDTF registration.

version

The version of Spark on which this application is running.