pyspark.SparkContext.pickleFile#
- SparkContext.pickleFile(name, minPartitions=None)[source]#
Load an RDD previously saved using
RDD.saveAsPickleFile()
method.New in version 1.1.0.
- Parameters
- namestr
directory to the input data files, the path can be comma separated paths as a list of inputs
- minPartitionsint, optional
suggested minimum number of partitions for the resulting RDD
- Returns
RDD
RDD representing unpickled data from the file(s).
See also
Examples
>>> import os >>> import tempfile >>> with tempfile.TemporaryDirectory(prefix="pickleFile") as d: ... # Write a temporary pickled file ... path1 = os.path.join(d, "pickled1") ... sc.parallelize(range(10)).saveAsPickleFile(path1, 3) ... ... # Write another temporary pickled file ... path2 = os.path.join(d, "pickled2") ... sc.parallelize(range(-10, -5)).saveAsPickleFile(path2, 3) ... ... # Load picked file ... collected1 = sorted(sc.pickleFile(path1, 3).collect()) ... collected2 = sorted(sc.pickleFile(path2, 4).collect()) ... ... # Load two picked files together ... collected3 = sorted(sc.pickleFile('{},{}'.format(path1, path2), 5).collect())
>>> collected1 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> collected2 [-10, -9, -8, -7, -6] >>> collected3 [-10, -9, -8, -7, -6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]