pyspark.SparkContext.pickleFile#

SparkContext.pickleFile(name, minPartitions=None)[source]#

Load an RDD previously saved using RDD.saveAsPickleFile() method.

New in version 1.1.0.

Parameters
namestr

directory to the input data files, the path can be comma separated paths as a list of inputs

minPartitionsint, optional

suggested minimum number of partitions for the resulting RDD

Returns
RDD

RDD representing unpickled data from the file(s).

Examples

>>> import os
>>> import tempfile
>>> with tempfile.TemporaryDirectory(prefix="pickleFile") as d:
...     # Write a temporary pickled file
...     path1 = os.path.join(d, "pickled1")
...     sc.parallelize(range(10)).saveAsPickleFile(path1, 3)
...
...     # Write another temporary pickled file
...     path2 = os.path.join(d, "pickled2")
...     sc.parallelize(range(-10, -5)).saveAsPickleFile(path2, 3)
...
...     # Load picked file
...     collected1 = sorted(sc.pickleFile(path1, 3).collect())
...     collected2 = sorted(sc.pickleFile(path2, 4).collect())
...
...     # Load two picked files together
...     collected3 = sorted(sc.pickleFile('{},{}'.format(path1, path2), 5).collect())
>>> collected1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> collected2
[-10, -9, -8, -7, -6]
>>> collected3
[-10, -9, -8, -7, -6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]