pyspark.sql.SparkSession#
- class pyspark.sql.SparkSession(sparkContext, jsparkSession=None, options={})[source]#
The entry point to programming Spark with the Dataset and DataFrame API.
A SparkSession can be used to create
DataFrame
, registerDataFrame
as tables, execute SQL over tables, cache tables, and read parquet files. To create aSparkSession
, use the following builder pattern:Changed in version 3.4.0: Supports Spark Connect.
- builder#
Creates a
Builder
for constructing aSparkSession
.Changed in version 3.4.0: Supports Spark Connect.
Examples
Create a Spark session.
>>> spark = ( ... SparkSession.builder ... .master("local") ... .appName("Word Count") ... .config("spark.some.config.option", "some-value") ... .getOrCreate() ... )
Create a Spark session with Spark Connect.
>>> spark = ( ... SparkSession.builder ... .remote("sc://localhost") ... .appName("Word Count") ... .config("spark.some.config.option", "some-value") ... .getOrCreate() ... )
Methods
active
()Returns the active or default
SparkSession
for the current thread, returned by the builder.addArtifact
(*path[, pyfile, archive, file])Add artifact(s) to the client session.
addArtifacts
(*path[, pyfile, archive, file])Add artifact(s) to the client session.
addTag
(tag)Add a tag to be assigned to all the operations started by this thread in this session.
Clear all registered progress handlers.
Clear the current thread's operation tags.
copyFromLocalToFs
(local_path, dest_path)Copy file from local to cloud storage file system.
createDataFrame
(data[, schema, ...])Creates a
DataFrame
from anRDD
, a list, apandas.DataFrame
, anumpy.ndarray
, or apyarrow.Table
.Returns the active
SparkSession
for the current thread, returned by the buildergetTags
()Get the tags that are currently set to be assigned to all the operations started by this thread.
Interrupt all operations of this session currently running on the connected server.
interruptOperation
(op_id)Interrupt an operation of this session with the given operationId.
interruptTag
(tag)Interrupt all operations of this session with the given operation tag.
Returns a new
SparkSession
as new session, that has separate SQLConf, registered temporary views and UDFs, but sharedSparkContext
and table cache.range
(start[, end, step, numPartitions])Create a
DataFrame
with singlepyspark.sql.types.LongType
column namedid
, containing elements in a range fromstart
toend
(exclusive) with step valuestep
.registerProgressHandler
(handler)Register a progress handler to be called when a progress update is received from the server.
removeProgressHandler
(handler)Remove a progress handler that was previously registered.
removeTag
(tag)Remove a tag previously added to be assigned to all the operations started by this thread in this session.
sql
(sqlQuery[, args])Returns a
DataFrame
representing the result of the given query.stop
()Stop the underlying
SparkContext
.table
(tableName)Returns the specified table as a
DataFrame
.Attributes
Creates a
Builder
for constructing aSparkSession
.Interface through which the user may create, drop, alter or query underlying databases, tables, functions, etc.
Gives access to the Spark Connect client.
Runtime configuration interface for Spark.
Returns a
DataSourceRegistration
for data source registration.Returns a
Profile
for performance/memory profiling.Returns a
DataFrameReader
that can be used to read data in as aDataFrame
.Returns a
DataStreamReader
that can be used to read data streams as a streamingDataFrame
.Returns the underlying
SparkContext
.Returns a
StreamingQueryManager
that allows managing all theStreamingQuery
instances active on this context.Returns a
UDFRegistration
for UDF registration.Returns a
UDTFRegistration
for UDTF registration.The version of Spark on which this application is running.