pyspark.pandas.DataFrame¶
- 
class pyspark.pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)[source]¶
- pandas-on-Spark DataFrame that corresponds to pandas DataFrame logically. This holds Spark DataFrame internally. - Variables
- _internal – an internal immutable Frame to manage metadata. 
- Parameters
- datanumpy ndarray (structured or homogeneous), dict, pandas DataFrame, Spark DataFrame or pandas-on-Spark Series
- Dict can contain Series, arrays, constants, or list-like objects If data is a dict, argument order is maintained for Python 3.6 and later. Note that if data is a pandas DataFrame, a Spark DataFrame, and a pandas-on-Spark Series, other arguments should not be used. 
- indexIndex or array-like
- Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided 
- columnsIndex or array-like
- Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided 
- dtypedtype, default None
- Data type to force. Only a single dtype is allowed. If None, infer 
- copyboolean, default False
- Copy data from inputs. Only affects DataFrame / 2d ndarray input 
 
 - Examples - Constructing DataFrame from a dictionary. - >>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = ps.DataFrame(data=d, columns=['col1', 'col2']) >>> df col1 col2 0 1 3 1 2 4 - Constructing DataFrame from pandas DataFrame - >>> df = ps.DataFrame(pd.DataFrame(data=d, columns=['col1', 'col2'])) >>> df col1 col2 0 1 3 1 2 4 - Notice that the inferred dtype is int64. - >>> df.dtypes col1 int64 col2 int64 dtype: object - To enforce a single dtype: - >>> df = ps.DataFrame(data=d, dtype=np.int8) >>> df.dtypes col1 int8 col2 int8 dtype: object - Constructing DataFrame from numpy ndarray: - >>> df2 = ps.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)), ... columns=['a', 'b', 'c', 'd', 'e']) >>> df2 a b c d e 0 3 1 4 9 8 1 4 8 4 8 4 2 7 6 5 6 7 3 8 7 9 1 0 4 2 5 4 3 9 - Methods - abs()- Return a Series/DataFrame with absolute numeric value of each element. - add(other)- Get Addition of dataframe and other, element-wise (binary operator +). - add_prefix(prefix)- Prefix labels with string prefix. - add_suffix(suffix)- Suffix labels with string suffix. - agg(func)- Aggregate using one or more operations over the specified axis. - aggregate(func)- Aggregate using one or more operations over the specified axis. - align(other[, join, axis, copy])- Align two objects on their axes with the specified join method. - all([axis])- Return whether all elements are True. - any([axis])- Return whether any element is True. - append(other[, ignore_index, …])- Append rows of other to the end of caller, returning a new object. - apply(func[, axis, args])- Apply a function along an axis of the DataFrame. - applymap(func)- Apply a function to a Dataframe elementwise. - assign(**kwargs)- Assign new columns to a DataFrame. - astype(dtype)- Cast a pandas-on-Spark object to a specified dtype - dtype.- at_time(time[, asof, axis])- Select values at particular time of day (example: 9:30AM). - backfill([axis, inplace, limit])- Synonym for DataFrame.fillna() or Series.fillna() with - method=`bfill`.- between_time(start_time, end_time[, …])- Select values between particular times of the day (example: 9:00-9:30 AM). - bfill([axis, inplace, limit])- Synonym for DataFrame.fillna() or Series.fillna() with - method=`bfill`.- bool()- Return the bool of a single element in the current object. - clip([lower, upper])- Trim values at input threshold(s). - copy([deep])- Make a copy of this object’s indices and data. - corr([method])- Compute pairwise correlation of columns, excluding NA/null values. - count([axis, numeric_only])- Count non-NA cells for each column. - cummax([skipna])- Return cumulative maximum over a DataFrame or Series axis. - cummin([skipna])- Return cumulative minimum over a DataFrame or Series axis. - cumprod([skipna])- Return cumulative product over a DataFrame or Series axis. - cumsum([skipna])- Return cumulative sum over a DataFrame or Series axis. - describe([percentiles])- Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding - NaNvalues.- diff([periods, axis])- First discrete difference of element. - div(other)- Get Floating division of dataframe and other, element-wise (binary operator /). - divide(other)- Get Floating division of dataframe and other, element-wise (binary operator /). - dot(other)- Compute the matrix multiplication between the DataFrame and other. - drop([labels, axis, columns])- Drop specified labels from columns. - drop_duplicates([subset, keep, inplace])- Return DataFrame with duplicate rows removed, optionally only considering certain columns. - droplevel(level[, axis])- Return DataFrame with requested index / column level(s) removed. - dropna([axis, how, thresh, subset, inplace])- Remove missing values. - duplicated([subset, keep])- Return boolean Series denoting duplicate rows, optionally only considering certain columns. - eq(other)- Compare if the current value is equal to the other. - equals(other)- Compare if the current value is equal to the other. - eval(expr[, inplace])- Evaluate a string describing operations on DataFrame columns. - expanding([min_periods])- Provide expanding transformations. - explode(column)- Transform each element of a list-like to a row, replicating index values. - ffill([axis, inplace, limit])- Synonym for DataFrame.fillna() or Series.fillna() with - method=`ffill`.- fillna([value, method, axis, inplace, limit])- Fill NA/NaN values. - filter([items, like, regex, axis])- Subset rows or columns of dataframe according to labels in the specified index. - first(offset)- Select first periods of time series data based on a date offset. - Retrieves the index of the first valid value. - floordiv(other)- Get Integer division of dataframe and other, element-wise (binary operator //). - from_dict(data[, orient, dtype, columns])- Construct DataFrame from dict of array-like or dicts. - from_records(data[, index, exclude, …])- Convert structured or record ndarray to DataFrame. - ge(other)- Compare if the current value is greater than or equal to the other. - get(key[, default])- Get item from object for given key (DataFrame column, Panel slice, etc.). - get_dtype_counts()- Return counts of unique dtypes in this object. - groupby(by[, axis, as_index, dropna])- Group DataFrame or Series using a Series of columns. - gt(other)- Compare if the current value is greater than the other. - head([n])- Return the first n rows. - hist([bins])- Draw one histogram of the DataFrame’s columns. - idxmax([axis])- Return index of first occurrence of maximum over requested axis. - idxmin([axis])- Return index of first occurrence of minimum over requested axis. - info([verbose, buf, max_cols, null_counts])- Print a concise summary of a DataFrame. - insert(loc, column, value[, allow_duplicates])- Insert column into DataFrame at specified location. - isin(values)- Whether each element in the DataFrame is contained in values. - isna()- Detects missing values for items in the current Dataframe. - isnull()- Detects missing values for items in the current Dataframe. - items()- This is an alias of - iteritems.- Iterator over (column name, Series) pairs. - iterrows()- Iterate over DataFrame rows as (index, Series) pairs. - itertuples([index, name])- Iterate over DataFrame rows as namedtuples. - join(right[, on, how, lsuffix, rsuffix])- Join columns of another DataFrame. - kde([bw_method, ind])- Generate Kernel Density Estimate plot using Gaussian kernels. - keys()- Return alias for columns. - kurt([axis, numeric_only])- Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). - kurtosis([axis, numeric_only])- Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). - last(offset)- Select final periods of time series data based on a date offset. - Return index for last non-NA/null value. - le(other)- Compare if the current value is less than or equal to the other. - lt(other)- Compare if the current value is less than the other. - mad([axis])- Return the mean absolute deviation of values. - mask(cond[, other])- Replace values where the condition is True. - max([axis, numeric_only])- Return the maximum of the values. - mean([axis, numeric_only])- Return the mean of the values. - median([axis, numeric_only, accuracy])- Return the median of the values for the requested axis. - melt([id_vars, value_vars, var_name, value_name])- Unpivot a DataFrame from wide format to long format, optionally leaving identifier variables set. - merge(right[, how, on, left_on, right_on, …])- Merge DataFrame objects with a database-style join. - min([axis, numeric_only])- Return the minimum of the values. - mod(other)- Get Modulo of dataframe and other, element-wise (binary operator %). - mul(other)- Get Multiplication of dataframe and other, element-wise (binary operator *). - multiply(other)- Get Multiplication of dataframe and other, element-wise (binary operator *). - ne(other)- Compare if the current value is not equal to the other. - nlargest(n, columns)- Return the first n rows ordered by columns in descending order. - notna()- Detects non-missing values for items in the current Dataframe. - notnull()- Detects non-missing values for items in the current Dataframe. - nsmallest(n, columns)- Return the first n rows ordered by columns in ascending order. - nunique([axis, dropna, approx, rsd])- Return number of unique elements in the object. - pad([axis, inplace, limit])- Synonym for DataFrame.fillna() or Series.fillna() with - method=`ffill`.- pct_change([periods])- Percentage change between the current and a prior element. - pipe(func, *args, **kwargs)- Apply func(self, *args, **kwargs). - pivot([index, columns, values])- Return reshaped DataFrame organized by given index / column values. - pivot_table([values, index, columns, …])- Create a spreadsheet-style pivot table as a DataFrame. - pop(item)- Return item and drop from frame. - pow(other)- Get Exponential power of series of dataframe and other, element-wise (binary operator **). - prod([axis, numeric_only, min_count])- Return the product of the values. - product([axis, numeric_only, min_count])- Return the product of the values. - quantile([q, axis, numeric_only, accuracy])- Return value at the given quantile. - query(expr[, inplace])- Query the columns of a DataFrame with a boolean expression. - radd(other)- Get Addition of dataframe and other, element-wise (binary operator +). - rank([method, ascending])- Compute numerical data ranks (1 through n) along axis. - rdiv(other)- Get Floating division of dataframe and other, element-wise (binary operator /). - reindex([labels, index, columns, axis, …])- Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. - reindex_like(other[, copy])- Return a DataFrame with matching indices as other object. - rename([mapper, index, columns, axis, …])- Alter axes labels. - rename_axis([mapper, index, columns, axis, …])- Set the name of the axis for the index or columns. - replace([to_replace, value, inplace, limit, …])- Returns a new DataFrame replacing a value with another value. - reset_index([level, drop, inplace, …])- Reset the index, or a level of it. - rfloordiv(other)- Get Integer division of dataframe and other, element-wise (binary operator //). - rmod(other)- Get Modulo of dataframe and other, element-wise (binary operator %). - rmul(other)- Get Multiplication of dataframe and other, element-wise (binary operator *). - rolling(window[, min_periods])- Provide rolling transformations. - round([decimals])- Round a DataFrame to a variable number of decimal places. - rpow(other)- Get Exponential power of dataframe and other, element-wise (binary operator **). - rsub(other)- Get Subtraction of dataframe and other, element-wise (binary operator -). - rtruediv(other)- Get Floating division of dataframe and other, element-wise (binary operator /). - sample([n, frac, replace, random_state])- Return a random sample of items from an axis of object. - select_dtypes([include, exclude])- Return a subset of the DataFrame’s columns based on the column dtypes. - sem([axis, ddof, numeric_only])- Return unbiased standard error of the mean over requested axis. - set_index(keys[, drop, append, inplace])- Set the DataFrame index (row labels) using one or more existing columns. - shift([periods, fill_value])- Shift DataFrame by desired number of periods. - skew([axis, numeric_only])- Return unbiased skew normalized by N-1. - sort_index([axis, level, ascending, …])- Sort object by labels (along an axis) - sort_values(by[, ascending, inplace, …])- Sort by the values along either axis. - squeeze([axis])- Squeeze 1 dimensional axis objects into scalars. - stack()- Stack the prescribed level(s) from columns to index. - std([axis, ddof, numeric_only])- Return sample standard deviation. - sub(other)- Get Subtraction of dataframe and other, element-wise (binary operator -). - subtract(other)- Get Subtraction of dataframe and other, element-wise (binary operator -). - sum([axis, numeric_only, min_count])- Return the sum of the values. - swapaxes(i, j[, copy])- Interchange axes and swap values axes appropriately. - swaplevel([i, j, axis])- Swap levels i and j in a MultiIndex on a particular axis. - tail([n])- Return the last n rows. - take(indices[, axis])- Return the elements in the given positional indices along an axis. - to_clipboard([excel, sep])- Copy object to the system clipboard. - to_csv([path, sep, na_rep, columns, header, …])- Write object to a comma-separated values (csv) file. - to_delta(path[, mode, partition_cols, index_col])- Write the DataFrame out as a Delta Lake table. - to_dict([orient, into])- Convert the DataFrame to a dictionary. - to_excel(excel_writer[, sheet_name, na_rep, …])- Write object to an Excel sheet. - to_html([buf, columns, col_space, header, …])- Render a DataFrame as an HTML table. - to_json([path, compression, num_files, …])- Convert the object to a JSON string. - to_latex([buf, columns, col_space, header, …])- Render an object to a LaTeX tabular environment table. - to_markdown([buf, mode])- Print Series or DataFrame in Markdown-friendly format. - to_numpy()- A NumPy ndarray representing the values in this DataFrame or Series. - to_orc(path[, mode, partition_cols, index_col])- Write the DataFrame out as a ORC file or directory. - Return a pandas DataFrame. - to_parquet(path[, mode, partition_cols, …])- Write the DataFrame out as a Parquet file or directory. - to_records([index, column_dtypes, index_dtypes])- Convert DataFrame to a NumPy record array. - to_spark([index_col])- Spark related features. - to_spark_io([path, format, mode, …])- Write the DataFrame out to a Spark data source. - to_string([buf, columns, col_space, header, …])- Render a DataFrame to a console-friendly tabular output. - to_table(name[, format, mode, …])- Write the DataFrame into a Spark table. - transform(func[, axis])- Call - funcon self producing a Series with transformed values and that has the same length as its input.- Transpose index and columns. - truediv(other)- Get Floating division of dataframe and other, element-wise (binary operator /). - truncate([before, after, axis, copy])- Truncate a Series or DataFrame before and after some index value. - unstack()- Pivot the (necessarily hierarchical) index labels. - update(other[, join, overwrite])- Modify in place using non-NA values from another DataFrame. - var([axis, ddof, numeric_only])- Return unbiased variance. - where(cond[, other, axis])- Replace values where the condition is False. - xs(key[, axis, level])- Return cross-section from the DataFrame. - Attributes - Transpose index and columns. - Access a single value for a row/column label pair. - Return a list representing the axes of the DataFrame. - The column labels of the DataFrame. - Return the dtypes in the DataFrame. - Returns true if the current DataFrame is empty. - Access a single value for a row/column pair by integer position. - Purely integer-location based indexing for selection by position. - The index (row labels) Column of the DataFrame. - Access a group of rows and columns by label(s) or a boolean Series. - Return an int representing the number of array dimensions. - Return a tuple representing the dimensionality of the DataFrame. - Return an int representing the number of elements in this object. - Property returning a Styler object containing methods for building a styled HTML representation for the DataFrame. - Return a Numpy representation of the DataFrame or the Series.