Stemgraphic Modules¶
stemgraphic¶
stem_graphic.
Package implementing a complete toolkit for text and a graphical stem-and-leaf plots and other visualizations adapted to stem-and-leaf pair values, such as heatmaps and sunburst charts.
It also handles very large data sets through scaling, sampling, trimming and other techniques.
See research paper ( http://artchiv.es/pydata2016/stemgraphic ) for more technical details.
A command line utility was installed along with the package, allowing to process excel or csv files. See: stem -h
aliases
¶
Handy aliases for stem_graphic options.
-
stemgraphic.aliases.
stem_hist
(x, aggregation=False, alpha=1, asc=True, column=None, color='b', delimiter_color='r', display=300, flip_axes=True, legend_pos='short', outliers=False, trim=False)¶ stem_hist.
stem_hist builds a graphical histogram matching the stem-and-leaf plot, with the numbers hidden, as shown on the cover of the companion brochure.
- Parameters
legend_pos –
x – list, numpy array, time series, pandas or dask dataframe
aggregation – Boolean for sum, else specify function
alpha – opacity of the bars, median and outliers, defaults to 15%
asc – stem sorted in ascending order, defaults to True
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
color – the bar facecolor
delimiter_color – color of the line between aggregate and stem and stem and leaf
display – maximum number of data points to display, forces sampling if smaller than len(df)
flip_axes – X becomes Y and Y becomes X
outliers – this is NOP, for compatibility
trim – this is NOP, for compatibility
- Returns
matplotlib figure and axes instance
-
stemgraphic.aliases.
stem_kde
(x, **kw_args)¶ stem_kde buils a stem-and-leaf plot and adds an overlaid kde as secondary plot.
- Parameters
x – list, numpy array, time series, pandas or dask dataframe
kw_args –
- Returns
matplotlib figure and axes instance
-
stemgraphic.aliases.
stem_line
(x, aggregation=False, alpha=0, asc=True, column=None, color='k', delimiter_color='r', display=300, flip_axes=True, outliers=False, secondary_plot=None, trim=False)¶ stem_line builds a stem-and-leaf plot with lines instead of bars.
- Parameters
x – list, numpy array, time series, pandas or dask dataframe
aggregation – Boolean for sum, else specify function
alpha – opacity of the bars, median and outliers, defaults to 15%
asc – stem sorted in ascending order, defaults to True
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
color – the color of the line
delimiter_color – color of the line between aggregate and stem and stem and leaf
display – maximum number of data points to display, forces sampling if smaller than len(df)
flip_axes – X becomes Y and Y becomes X
outliers –
secondary_plot – One or more of ‘dot’, ‘kde’, ‘margin_kde’, ‘rug’ in a comma delimited string or None
trim – this is NOP, for compatibility
- Returns
matplotlib figure and axes instance
-
stemgraphic.aliases.
stem_symmetric_dot
(x, **kw_args)¶ stem_symmetric_dot.
stem_symmetric_dot builds a symmetric stem dot plot
Example:
stem_symmetric_dot(diamonds.price)
Output:
326 ¡ 0 | ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 1 | ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 2 | ●●●●●●●●●●●●●●●●●●●●●●●●●● 3 | ●●●●●●●●●●●●●●●●●●●●●●●●●●●● 4 | ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 5 | ●●●●●●●●●●●●●●●●●●●●●●● 6 | ●●●●●●●●●●●●●●●●●●●● 7 | ●●●●●●●●●●●●●●● 8 | ●●● 9 | ●●●●●●●● 10 | ●●●● 11 | ●●●●●●● 12 | ●●●●●● 13 | ●● 14 | ●● 15 | ●●● 16 | ●●● 17 | ●●●●●●●● 18 | ● ! 18823 Scale: 18|6 => 18.6x1000 = 18600.0
- Parameters
x – list, numpy array, time series, pandas or dask dataframe
kw_args – keyword args to stem_dot
- Returns
alpha
¶
stemgraphic.alpha.
BRAND NEW in V.0.5.0!
Stemgraphic provides a complete set of functions to handle everything related to stem-and-leaf plots. alpha is a module of the stemgraphic package to add support for categorical and text variables.
The module also adds functionality to handle whole words, beside stem-and-leaf bigrams and n-grams.
For example, for the word “alabaster”:
With word_ functions, we can look at the word frequency in a text, or compare it through a distance function (default to Levenshtein) to other words in a corpus
With stem_ functions, we can look at the fundamental stem-and-leaf, stem would be ‘a’ and leaf would be ‘l’, for a bigram ‘al’. With a stem_order of 1 and a leaf_order of 2, we would have ‘a’ and ‘la’, for a trigram ‘ala’, so on and so forth.
-
stemgraphic.alpha.
add_missing_letters
(mat, stem_order, leaf_order, letters=None)¶ Add missing stems based on LETTERS. defaults to a-z alphabet.
- Parameters
mat – matrix to modify
stem_order – how many stem characters per data point to display, defaults to 1
leaf_order – how many leaf characters per data point to display, defaults to 1
letters – letters that must be present as stems
- Returns
the modified matrix
-
stemgraphic.alpha.
heatmap
(src, alpha_only=False, annotate=False, asFigure=False, ax=None, caps=False, compact=True, display=None, flip_axes=False, interactive=True, leaf_order=1, leaf_skip=0, random_state=None, stem_order=1, stem_skip=0, stop_words=None, trim=None)¶ heatmap.
- The heatmap displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves,
they are left in their respective columns. Row ‘a’ and Column ‘b’ would have the count of words starting with ‘ab’. The heatmap is useful to look at patterns. For distribution, stem_graphic is better suited.
- Parameters
src – string, filename, url, list, numpy array, time series, pandas or dask dataframe
alpha_only – only use stems from a-z alphabet
annotate – display annotations (Z) on heatmap
asFigure – return plot as plotly figure (for web applications)
ax – matplotlib axes instance, usually from a figure or other plot
caps – bool, True to be case sensitive
compact – remove empty stems
display – maximum number of data points to display, forces sampling if smaller than len(df)
interactive – if cufflinks is loaded, renders as interactive plot in notebook
leaf_order – how many leaf characters per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
random_state – initial random seed for the sampling process, for reproducible research
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
trim – for compatibility
- Returns
-
stemgraphic.alpha.
heatmap_grid
(src1, src2, src3=None, src4=None, alpha_only=True, annot=False, caps=False, center=0, cmap=None, display=1000, leaf_order=1, leaf_skip=0, random_state=None, reverse=False, robust=False, stem_order=1, stem_skip=0, stop_words=None, threshold=0)¶ heatmap_grid.
With stem_graphic, it is possible to directly compare two different sources. In the case of a heatmap, two different data sets cannot be visualized directly on a single heatmap. For this task, we designed heatmap_grid to adapt to the number of sources to build a layout. It can take from 2 to 4 different source.
With 2 sources, a square grid will be generated, allowing for horizontal and vertical comparisons, with an extra heatmap showing the difference between the two matrices. It also computes a norm for that difference matrix. The smaller the value, the closer the two heatmaps are.
With 3 sources, it builds a triangular grid, with each source heatmap in a corner and the difference between each pair in between.
Finally, with 4 sources, a 3 x 3 grid is built, each source in a corner and the difference between each pair in between, with the center expressing the difference between top left and bottom right diagonal.
- Parameters
src1 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (required)
src2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (required)
src3 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional)
src4 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional)
alpha_only – only use stems from a-z alphabet
annot – display annotations (Z) on heatmap
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.
center – the center of the divergent color map for the difference heatmaps
cmap – color map for difference heatmap or None (default) to use the builtin red / blue divergent map
display – maximum number of data points to display, forces sampling if smaller than len(df)
leaf_order – how many leaf characters per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
robust – reduce effect of outliers on difference heatmap
random_state – initial random seed for the sampling process, for reproducible research
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
threshold – absolute value minimum count difference for a difference heatmap element to be visible
- Returns
-
stemgraphic.alpha.
heatmatrix
(src, alpha_only=False, caps=False, charset=None, column=None, compact=True, display=None, flip_axes=None, leaf_order=1, leaf_skip=0, outliers=None, persistence=None, random_state=None, scale=None, stem_order=1, stem_skip=0, stop_words=None, trim=None, trim_blank=None, unit='', zero_blank=True, zoom=None)¶ heatmatrix.
The heatmatrix displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves, they are left in their respective columns. Row ‘a’ and Column ‘b’ would have the count of words starting with ‘ab’. The heatmatrix is useful to look at patterns. For distribution, stem_graphic is better suited.
- Parameters
src – string, filename, url, list, numpy array, time series, pandas or dask dataframe
alpha_only – only use stems from a-z alphabet
caps – bool, True to be case sensitive
charset –
column – specify which column (string or number) of the dataframe to use, else the first is selected
compact – remove empty stems
display – maximum number of data points to display, forces sampling if smaller than len(df)
flip_axes – wide format
leaf_order – how many leaf characters per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
outliers – for compatibility with other text plots
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
random_state – initial random seed for the sampling process, for reproducible research
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
scale – force a specific scale for building the plot. Defaults to None (automatic).
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
trim_blank – remove the blank between the delimiter and the first leaf, defaults to True
unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’
zero_blank – replace zero digit with space
zoom – zoom level, on top of calculated scale (+1, -1 etc)
- Returns
count matrix, scale
-
stemgraphic.alpha.
matrix_difference
(mat1, mat2, thresh=0, ord=None)¶ matrix_difference.
- Parameters
mat1 – first heatmap dataframe
mat2 – second heatmap dataframe
thresh – : absolute value minimum count difference for a difference heatmap element to be visible
- Returns
difference matrix, norm and ratio of the sum of the first matrix over the second
-
stemgraphic.alpha.
ngram_data
(df, alpha_only=False, ascending=True, binary=False, break_on=None, caps=False, char_filter=None, column=None, compact=False, display=750, leaf_order=1, leaf_skip=0, persistence=None, random_state=None, remove_accents=False, reverse=False, rows_only=True, sort_by='len', stem_order=1, stem_skip=0, stop_words=None)¶ ngram_data.
This is the main text ingestion function for stemgraphic.alpha. It is used by most of the visualizations. It can also be used directly, to feed a pipeline, for example.
If selected (rows_only=False), the returned dataframe includes in each row a single word, the stem, the leaf and the ngram (stem + leaf) - the index is the ‘token’ position in the original source:
word stem leaf ngram
12 salut s a sa 13 chéri c h ch
- Parameters
df – list, numpy array, series, pandas or dask dataframe
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – bool if the sort is ascending
binary – bool if True forces counts to 1 for anything greater than 0
break_on – letter on which to break a row, or None (default)
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
char_filter – list of characters to ignore. If None (default) CHAR_FILTER list will be used
column – specify which column (string or number) of the dataframe to use, or group of columns (stems) else the frame is assumed to only have one column with words.
compact – remove empty stems
display – maximum number of data points to display, forces sampling if smaller than len(df)
leaf_order – how many leaf characters per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
persistence – will save the sampled datafrae to filename (with csv or pkl extension) or None
random_state – initial random seed for the sampling process, for reproducible research
remove_accents – bool if True strips accents (NA on dataframe)
rows_only – bool by default returns only the stem and leaf rows. If false, also the matrix and dataframe
sort_by – default to ‘len’, can also be ‘alpha’
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
- Returns
ordered rows if rows_only, else also returns the matrix and dataframe
-
stemgraphic.alpha.
plot_sunburst_level
(normalized, ax, label=True, level=0, offset=0, ngram=False, plot=True, stem=None, vis=0)¶ plot_sunburst_level.
utility function for sunburst function.
- Parameters
normalized –
ax –
label –
level –
ngram –
offset –
plot –
stem –
vis –
- Returns
-
stemgraphic.alpha.
polar_word_plot
(ax, word, words, label, min_dist, max_dist, metric, offset, step)¶ polar_word_plot.
Utility function for radar plot.
- Parameters
ax – matplotlib ax
word – string, the reference word that will be placed in the middle
words – list of words to compare
label – bool if True display words centered at coordinate
min_dist – minimum distance based on metric to include a word for display
max_dist – maximum distance for a given section
metric – any metric function accepting two values and returning that metric in a range from 0 to x
offset – where to start plotting in degrees
step – how many degrees to step between plots
- Returns
-
stemgraphic.alpha.
radar
(word, comparisons, ascending=True, display=100, label=True, metric=None, min_distance=1, max_distance=None, random_state=None, sort_by='alpha')¶ radar.
The radar plot compares a reference word with a corpus. By default, it calculates the levenshtein distance between the reference word and each words in the corpus. An alternate distance or metric function can be provided. Each word is then plotted around the center based on 3 criteria.
If the word length is longer, it is plotted on the left side, else on the right side.
Distance from center is based on the distance function.
the words are equidistant, and their order defined alphabetically or by count (only applicable if the corpus is a text and not a list of unique words, such as a password dictionary).
Stem-and-leaf support is upcoming.
- Parameters
word – string, the reference word that will be placed in the middle
comparisons – external file, list or string or dataframe of words
ascending – bool if the sort is ascending
display – maximum number of data points to display, forces sampling if smaller than len(df)
label – bool if True display words centered at coordinate
metric – Levenshtein (default), or any metric function accepting two values and returning that metric
min_distance – minimum distance based on metric to include a word for display
max_distance – maximum distance based on metric to include a word for display
random_state – initial random seed for the sampling process, for reproducible research
sort_by – default to ‘alpha’, can also be ‘len’
- Returns
-
stemgraphic.alpha.
scatter
(src1, src2, src3=None, alpha=0.5, alpha_only=True, ascending=True, asFigure=False, ax=None, caps=False, compact=True, display=None, fig_xy=None, interactive=True, jitter=False, label=False, leaf_order=1, leaf_skip=0, log_scale=True, normalize=None, percentage=None, project=False, project_only=False, random_state=None, size=5, sort_by='alpha', stem_order=1, stem_skip=0, stop_words=None, whole=False)¶ scatter.
With 2 sources:
Scatter compares the word frequency of two sources, on each axis. Each data point Z value is the word or stem-and-leaf value, while the X axis reflects that word/ngram count in one source and the Y axis reflect the same word/ngram count in the other source, in two different colors. If one word/ngram is more common on the first source it will be displayed in one color, and if it is more common in the second source, it will be displayed in a different color. The values that are the same for both sources will be displayed in a third color (default colors are blue, black and pink.
With 3 sources:
The scatter will compare in 3d the word frequency of three sources, on each axis. Each data point hover value is the word or stem-and-leaf value, while the X axis reflects that word/ngram count in the 1st source, the Y axis reflects the same word/ngram count in the 2nd source, and the Z axis the 3rd source, each in a different color. If one word/ngram is more common on the 1st source it will be displayed in one color, in the 2nd source as a second color and if it is more common in the 3rd source, it will be displayed in a third color. The values that are the same for both sources will be displayed in a 4th color (default colors are blue, black, purple and pink.
In interactive mode, hovering the data point will give the precise counts on each axis along with the word itself, and filtering by category is done by clicking on the category in the legend. Double clicking a category will show only that category.
- Parameters
src1 – string, filename, url, list, numpy array, time series, pandas or dask dataframe
src2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe
src3 – string, filename, url, list, numpy array, time series, pandas or dask dataframe, optional
alpha: – opacity of the dots, defaults to 50%
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – word/stem count sorted in ascending order, defaults to True
asFigure – return plot as plotly figure (for web applications)
ax – matplotlib axes instance, usually from a figure or other plot
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
fig_xy – tuple for matplotlib figsize, defaults to (20,20)
interactive – if cufflinks is loaded, renders as interactive plot in notebook
jitter – random noise added to help see multiple data points sharing the same coordinate
label – bool if True display words centered at coordinate
leaf_order – how many leaf digits per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
log_scale – bool if True (default) uses log scale axes (NA in 3d due to open issues with mpl, cufflinks)
normalize – bool if True normalize frequencies in src2 and src3 relative to src1 length
percentage – coordinates in percentage of maximum word/ngram count (in non interactive mode)
project – project src1/src2 and src1/src3 comparisons on X=0 and Z=0 planes
project_only – only show the projection (NA if project is False)
random_state – initial random seed for the sampling process, for reproducible research
sort_by – sort by ‘alpha’ (default) or ‘count’
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
whole – for normalized or percentage, use whole integer values (round)
- Returns
matplotlib ax, dataframe with categories
-
stemgraphic.alpha.
stem_freq_plot
(df, alpha_only=False, asFigure=False, column=None, compact=True, caps=False, display=2600, interactive=True, kind='barh', leaf_order=1, leaf_skip=0, random_state=None, stem_order=1, stem_skip=0, stop_words=None)¶ stem_freq_plot.
Word frequency plot is the most common visualization in NLP. In this version it supports stem-and-leaf / n-grams.
Each row is the stem, and similar leaves are grouped together and each different group is stacked in bar charts.
Default is horizontal bar chart, but vertical, histograms, area charts and even pie charts are supported by this one visualization.
- Parameters
df – string, filename, url, list, numpy array, time series, pandas or dask dataframe
alpha_only – only use stems from a-z alphabet (NA on dataframe)
asFigure – return plot as plotly figure (for web applications)
column – specify which column (string or number) of the dataframe to use, or group of columns (stems) else the frame is assumed to only have one column with words.
compact – do not display empty stem rows (with no leaves), defaults to False
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
display – maximum number of data points to display, forces sampling if smaller than len(df)
interactive – if cufflinks is loaded, renders as interactive plot in nebook
kind – defaults to ‘barh’. One of ‘bar’,’barh’,’area’,’hist’. Non-interactive also supports ‘pie’
leaf_order – how many leaf digits per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
random_state – initial random seed for the sampling process, for reproducible research
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
- Returns
-
stemgraphic.alpha.
stem_graphic
(df, df2=None, aggregation=True, alpha=0.1, alpha_only=True, ascending=False, ax=None, ax2=None, bar_color='C0', bar_outline=None, break_on=None, caps=True, column=None, combined=None, compact=False, delimiter_color='C3', display=750, figure_only=True, flip_axes=False, font_kw=None, leaf_color='k', leaf_order=1, leaf_skip=0, legend_pos='best', median_color='C4', mirror=False, persistence=None, primary_kw=None, random_state=None, remove_accents=False, reverse=False, secondary=False, show_stem=True, sort_by='len', stop_words=None, stem_order=1, stem_skip=0, title=None, trim_blank=False, underline_color=None)¶ stem_graphic.
The principal visualization of stemgraphic.alpha is stem_graphic. It offers all the options of stem_text (3.1) and adds automatic title, mirroring, flipping of axes, export (to pdf, svg, png, through fig.savefig) and many more options to change the visual appearance of the plot (font size, color, background color, underlining and more).
By providing a secondary text source, the plot will enable comparison through a back-to-back display
- Parameters
df – string, filename, url, list, numpy array, time series, pandas or dask dataframe
df2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional). for back 2 back stem-and-leaf plots
aggregation – Boolean for sum, else specify function
alpha – opacity of the bars, median and outliers, defaults to 10%
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – stem sorted in ascending order, defaults to True
ax – matplotlib axes instance, usually from a figure or other plot
ax2 – matplotlib axes instance, usually from a figure or other plot for back to back
bar_color – the fill color of the bar representing the leaves
bar_outline – the outline color of the bar representing the leaves
break_on – force a break of the leaves at that letter, the rest of the leaves will appear on the next line
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
column – specify which column (string or number) of the dataframe to use, or group of columns (stems) else the frame is assumed to only have one column with words.
combined – list (specific subset to automatically include, say, for comparisons), or None
compact – do not display empty stem rows (with no leaves), defaults to False
delimiter_color – color of the line between aggregate and stem and stem and leaf
display – maximum number of data points to display, forces sampling if smaller than len(df)
figure_only – bool if True (default) returns matplotlib (fig,ax), False returns (fig,ax,df)
flip_axes – X becomes Y and Y becomes X
font_kw – keyword dictionary, font parameters
leaf_color – font color of the leaves
leaf_order – how many leaf digits per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.
median_color – color of the box representing the median
mirror – mirror the plot in the axis of the delimiters
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
primary_kw – stem-and-leaf plot additional arguments
random_state – initial random seed for the sampling process, for reproducible research
remove_accents – bool if True strips accents (NA on dataframe)
reverse – bool if True look at words from right to left
secondary – bool if True, this is a secondary plot - mostly used for back-to-back plots
show_stem – bool if True (default) displays the stems
sort_by – default to ‘len’, can also be ‘alpha’
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
title – string, or None. When None and source is a file, filename will be used.
trim_blank – remove the blank between the delimiter and the first leaf, defaults to True
underline_color – color of the horizontal line under the leaves, None for no display
- Returns
matplotlib figure and axes instance, and dataframe if figure_only is False
-
stemgraphic.alpha.
stem_scatter
(src1, src2, src3=None, alpha=0.5, alpha_only=True, ascending=True, asFigure=False, ax=None, caps=False, compact=True, display=None, fig_xy=None, interactive=True, jitter=False, label=False, leaf_order=1, leaf_skip=0, log_scale=True, normalize=None, percentage=None, project=False, project_only=False, random_state=None, sort_by='alpha', stem_order=1, stem_skip=0, stop_words=None, whole=False)¶ stem_scatter.
stem_scatter compares the word frequency of two sources, on each axis. Each data point Z value is the word or stem-and-leaf value, while the X axis reflects that word/ngram count in one source and the Y axis reflect the same word/ngram count in the other source, in two different colors. If one word/ngram is more common on the first source it will be displayed in one color, and if it is more common in the second source, it will be displayed in a different color. The values that are the same for both sources will be displayed in a third color (default colors are blue, black and pink. In interactive mode, hovering the data point will give the precise counts on each axis along with the word itself, and filtering by category is done by clicking on the category in the legend.
- Parameters
src1 – string, filename, url, list, numpy array, time series, pandas or dask dataframe
src2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe
src3 – string, filename, url, list, numpy array, time series, pandas or dask dataframe, optional
alpha: – opacity of the dots, defaults to 50%
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – stem sorted in ascending order, defaults to True
asFigure – return plot as plotly figure (for web applications)
ax – matplotlib axes instance, usually from a figure or other plot
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
fig_xy – tuple for matplotlib figsize, defaults to (20,20)
interactive – if cufflinks is loaded, renders as interactive plot in notebook
jitter – random noise added to help see multiple data points sharing the same coordinate
label – bool if True display words centered at coordinate
leaf_order – how many leaf digits per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
log_scale – bool if True (default) uses log scale axes (NA in 3d due to open issues with mpl, cufflinks)
normalize – bool if True normalize frequencies in src2 and src3 relative to src1 length
percentage – coordinates in percentage of maximum word/ngram count
random_state – initial random seed for the sampling process, for reproducible research
sort_by – sort by ‘alpha’ (default) or ‘count’
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
whole – for normalized or percentage, use whole integer values (round)
- Returns
matplotlib polar ax, dataframe
-
stemgraphic.alpha.
stem_sunburst
(words, alpha_only=True, ascending=False, caps=False, compact=True, display=None, hole=True, label=True, leaf_order=1, leaf_skip=0, median=True, ngram=False, random_state=None, sort_by='alpha', statistics=True, stem_order=1, stem_skip=0, stop_words=None, top=0)¶ stem_sunburst.
Stem-and-leaf based sunburst. See sunburst for details
- Parameters
words – string, filename, url, list, numpy array, time series, pandas or dask dataframe
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – stem sorted in ascending order, defaults to True
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
hole – bool if True (default) leave space in middle for statistics
label – bool if True display words centered at coordinate
leaf_order – how many leaf digits per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
median – bool if True (default) display an origin and a median mark
ngram – bool if True display full n-gram as leaf label
random_state – initial random seed for the sampling process, for reproducible research
sort_by – sort by ‘alpha’ (default) or ‘count’
statistics – bool if True (default) displays statistics in center - hole has to be True
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
top – how many different words to count by order frequency. If negative, this will be the least frequent
- Returns
-
stemgraphic.alpha.
stem_text
(df, aggr=False, alpha_only=True, ascending=True, binary=False, break_on=None, caps=True, charset=None, column=None, compact=False, display=750, legend_pos='top', leaf_order=1, leaf_skip=0, persistence=None, remove_accents=False, reverse=False, rows_only=False, sort_by='len', stem_order=1, stem_skip=0, stop_words=None, random_state=None)¶ stem_text.
Tukey’s original stem-and-leaf plot was text, with a vertical delimiter to separate stem from leaves. Just as stemgraphic implements a text version of the plot for numbers, stemgraphic.alpha implements a text version for words. This type of plot serves a similar purpose as a stacked bar chart with each data point annotated.
It also displays some basic statistics on the whole text (or subset if using column).
- Parameters
df – list, numpy array, time series, pandas or dask dataframe
aggr – bool if True display the aggregated count of leaves by row
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – bool if the sort is ascending
binary – bool if True forces counts to 1 for anything greater than 0
break_on – force a break of the leaves at that letter, the rest of the leaves will appear on the next line
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
column – specify which column (string or number) of the dataframe to use, or group of columns (stems) else the frame is assumed to only have one column with words.
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
leaf_order – how many leaf characters per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
legend_pos – where to put the legend: ‘top’ (default), ‘bottom’ or None
persistence – will save the sampled datafrae to filename (with csv or pkl extension) or None
random_state – initial random seed for the sampling process, for reproducible research
remove_accents – bool if True strips accents (NA on dataframe)
reverse – bool if True look at words from right to left
rows_only – by default returns only the stem and leaf rows. If false, also return the matrix and dataframe
sort_by – default to ‘len’, can also be ‘alpha’
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
-
stemgraphic.alpha.
sunburst
(words, alpha_only=True, ascending=False, caps=False, compact=True, display=None, hole=True, label=True, leaf_order=1, leaf_skip=0, median=True, ngram=True, random_state=None, sort_by='alpha', statistics=True, stem_order=1, stem_skip=0, stop_words=None, top=40)¶ sunburst.
Word sunburst charts are similar to pie or donut charts, but add some statistics in the middle of the chart, including the percentage of total words targeted for a given
number of unique words (ie. top 50 words, 48`%` coverage).
With stem-and-leaf, the first level of the sunburst represents the stem and the second level subdivides each stem by leaves.
- Parameters
words – string, filename, url, list, numpy array, time series, pandas or dask dataframe
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – stem sorted in ascending order, defaults to True
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
hole – bool if True (default) leave space in middle for statistics
label – bool if True display words centered at coordinate
leaf_order – how many leaf digits per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
median – bool if True (default) display an origin and a median mark
ngram – bool if True (default) display full n-gram as leaf label
random_state – initial random seed for the sampling process, for reproducible research
statistics – bool if True (default) displays statistics in center - hole has to be True
sort_by – sort by ‘alpha’ (default) or ‘count’
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
top – how many different words to count by order frequency. If negative, this will be the least frequent
- Returns
matplotlib polar ax, dataframe
-
stemgraphic.alpha.
text_heatmap
(df, caps=True, charset=None, column=None, compact=True, display=900, flip_axes=False, leaf_order=1, outliers=None, persistence=None, random_state=None, scale=None, trim=False, trim_blank=True, unit='', zero_blank=True, zoom=None)¶ text heatmap.
The heatmap displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves, they are left in their respective columns. Row ‘42’ and Column ‘7’ would have the count of numbers starting with ‘427’ of the given scale. The difference with the heatmatrix is that by default it doesn’t show zero values and it present a compact form by not showing whole empty rows either. Set compact = True to display those empty rows.
The heatmap is useful to look at patterns. For distribution, stem_graphic is better suited.
- Parameters
df – list, numpy array, time series, pandas or dask dataframe
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
flip_axes – wide format
leaf_order – how many leaf digits per data point to display, defaults to 1
outliers – for compatibility with other text plots
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
trim_blank – remove the blank between the delimiter and the first leaf, defaults to True
unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’
zero_blank – replace zero digit with space
zoom – zoom level, on top of calculated scale (+1, -1 etc)
- Returns
count matrix, scale
-
stemgraphic.alpha.
word_freq_plot
(src, alpha_only=False, ascending=False, asFigure=False, caps=False, display=None, interactive=True, kind='barh', random_state=None, sort_by='count', stop_words=None, top=100)¶ word frequency bar chart.
This function creates a classical word frequency bar chart.
- Parameters
src – Either a filename including path, a url or a ready to process text in a dataframe or a tokenized format.
alpha_only – words only if True, words and numbers if False
ascending – stem sorted in ascending order, defaults to True
asFigure – if interactive, the function will return a plotly figure instead of a matplotlib ax
caps – keep capitalization (True, False)
display – if specified, sample that quantity of words
interactive – interactive graphic (True, False)
kind – horizontal bar chart (barh) - also ‘bar’, ‘area’, ‘hist’ and non interactive ‘kde’ and ‘pie’
random_state – initial random seed for the sampling process, for reproducible research
sort_by – default to ‘count’, can also be ‘alpha’
stop_words – a list of words to ignore
top – how many different words to count by order frequency. If negative, this will be the least frequent
- Returns
text as dataframe and plotly figure or matplotlib ax
-
stemgraphic.alpha.
word_radar
(word, comparisons, ascending=True, display=100, label=True, metric=None, min_distance=1, max_distance=None, random_state=None, sort_by='alpha')¶ word_radar.
Radar plot based on words. Currently, the only type of radar plot supported. See `radar’ for more detail.
- Parameters
word – string, the reference word that will be placed in the middle
comparisons – external file, list or string or dataframe of words
ascending – bool if the sort is ascending
display – maximum number of data points to display, forces sampling if smaller than len(df)
label – bool if True display words centered at coordinate
metric – any metric function accepting two values and returning that metric in a range from 0 to x
min_distance – minimum distance based on metric to include a word for display
max_distance – maximum distance based on metric to include a word for display
random_state – initial random seed for the sampling process, for reproducible research
sort_by – default to ‘alpha’, can also be ‘len’
- Returns
-
stemgraphic.alpha.
word_scatter
(src1, src2, src3=None, alpha=0.5, alpha_only=True, ascending=True, asFigure=False, ax=None, caps=False, compact=True, display=None, fig_xy=None, interactive=True, jitter=False, label=False, leaf_order=None, leaf_skip=0, log_scale=True, normalize=None, percentage=None, random_state=None, sort_by='alpha', stem_order=None, stem_skip=0, stop_words=None, whole=False)¶ word_scatter.
Scatter compares the word frequency of two sources, on each axis. Each data point Z value is the word or stem-and-leaf value, while the X axis reflects that word count in one source and the Y axis re- flect the same word count in the other source, in two different colors. If one word is more common on the first source it will be displayed in one color, and if it is more common in the second source, it will be displayed in a different color. The values that are the same for both sources will be displayed in a third color (default colors are blue, black and pink. In interactive mode, hovering the data point will give the precise counts on each axis along with the word itself, and filtering by category is done by clicking on the category in the legend.
- Parameters
src1 – string, filename, url, list, numpy array, time series, pandas or dask dataframe
src2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe
src3 – string, filename, url, list, numpy array, time series, pandas or dask dataframe, optional
alpha – opacity of the bars, median and outliers, defaults to 10%
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – stem sorted in ascending order, defaults to True
asFigure – return plot as plotly figure (for web applications)
ax – matplotlib axes instance, usually from a figure or other plot
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
fig_xy – tuple for matplotlib figsize, defaults to (20,20)
interactive – if cufflinks is loaded, renders as interactive plot in notebook
jitter – random noise added to help see multiple data points sharing the same coordinate
label – bool if True display words centered at coordinate
leaf_order – how many leaf digits per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
log_scale – bool if True (default) uses log scale axes
random_state – initial random seed for the sampling process, for reproducible research
sort_by – sort by ‘alpha’ or ‘count’ (default)
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
whole – for normalized or percentage, use whole integer values (round)
- Returns
matplotlib polar ax, dataframe
-
stemgraphic.alpha.
word_sunburst
(words, alpha_only=True, ascending=False, caps=False, compact=True, display=None, hole=True, label=True, leaf_order=None, leaf_skip=0, median=True, ngram=True, random_state=None, sort_by='alpha', statistics=True, stem_order=None, stem_skip=0, stop_words=None, top=40)¶ word_sunburst.
Word based sunburst. See sunburst for details
- Parameters
words – string, filename, url, list, numpy array, time series, pandas or dask dataframe
alpha_only – only use stems from a-z alphabet (NA on dataframe)
ascending – stem sorted in ascending order, defaults to True
caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
hole – bool if True (default) leave space in middle for statistics
label – bool if True display words centered at coordinate
leaf_order – how many leaf digits per data point to display, defaults to 1
leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’
median – bool if True (default) display an origin and a median mark
ngram – bool if True (default) display full n-gram as leaf label
random_state – initial random seed for the sampling process, for reproducible research
statistics – bool if True (default) displays statistics in center - hole has to be True
sort_by – sort by ‘alpha’ (default) or ‘count’
stem_order – how many stem characters per data point to display, defaults to 1
stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter
stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)
top – how many different words to count by order frequency. If negative, this will be the least frequent
- Returns
graphic
¶
Stemgraphic.graphic.
Stemgraphic provides a complete set of functions to handle everything related to stem-and-leaf plots. Stemgraphic.graphic is a module implementing a graphical stem-and-leaf plot function and a stem-and-leaf heatmap plot function for numerical data. It also provides a density_plot
-
stemgraphic.graphic.
density_plot
(df, var=None, ax=None, bins=None, box=None, density=True, density_fill=True, display=1000, fig_only=True, fit=None, hist=None, hues=None, hue_labels=None, jitter=None, kind=None, leaf_order=1, legend=True, limit_var=False, norm_hist=None, random_state=None, rug=None, scale=None, singular=True, strip=None, swarm=None, title=None, violin=None, x_min=0, x_max=None, y_axis_label=True)¶ density_plot.
Various density and distribution plots conveniently packaged into one function. Density plot normally forces tails at each end which might go beyond the data. To force min/max to be driven by the data, use limit_var. To specify min and max use x_min and x_max instead. Nota Bene: defaults to _decimation_ and _quantization_ mode.
See density_plot notebook for examples of the different combinations of plots.
Why this instead of seaborn:
Stem-and-leaf plots naturally quantize data. The amount of loss is based on scale and leaf_order and on the data itself. This function which wraps several seaborn distribution plots was added in order to compare various measures of density and distributions based on various levels of decimation (sampling, set through display) and of quantization (set through scale and leaf_order). Also, there is no option in seaborn to fill the area under the curve…
- Parameters
df – list, numpy array, time series, pandas or dask dataframe
var – variable to plot, required if df is a dataframe
ax – matplotlib axes instance, usually from a figure or other plot
bins – Specification of hist bins, or None to use Freedman-Diaconis rule
box – bool, if True plots a box plot. Similar to using violin, use one or the other
density – bool, if True (default) plots a density plot
density_fill – bool, if True (default) fill the area under the density curve
display – maximum number rows to use (1000 default) for calculations, forces sampling if < len(df)
fig_only – bool, if True (default) returns fig, ax, else returns fix, ax, max_peak, true_min, true_max
fit – object with fit method, returning a tuple that can be passed to a pdf method
hist – bool, if True plot a histogram
hues – optional, a categorical variable for multiple plots
hue_labels – optional, if using a column that is an object and/or categorical needing translation
jitter – for strip plots only, add jitter. strip + jitter is similar to using swarm, use one or the other
leaf_order – the order of magnitude of the leaf. The higher the order, the less quantization.
legend – bool, if True plots a legend
limit_var – use min / max from the data, not density plot
norm_hist – bool, if True histogram will be normed
random_state – initial random seed for the sampling process, for reproducible research
rug – bool, if True plot a rug plot
scale – force a specific scale for building the plot. Defaults to None (automatic).
singular – force display of a density plot using a singular value, by simulating values of each side
strip – bool, if True displays a strip plot
swarm – swarm plot, similar to strip plot. use one or the other
title – if present, adds a title to the plot
violin – bool, if True plots a violin plot. Similar to using box, use one or the other
x_min – force X axis minimum value. See also limit_var
x_max – force Y axis minimum value. See also limit_var
y_axis_label – bool, if True displays y axis ticks and label
- Returns
see fig_only
-
stemgraphic.graphic.
heatmap
(df, annotate=False, asFigure=False, ax=None, caps=None, column=None, compact=False, display=900, flip_axes=False, interactive=True, leaf_order=1, persistence=None, random_state=None, scale=None, trim=False, trim_blank=True, unit='', zoom=None)¶ heatmap.
The heatmap displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves, they are left in their respective columns. Row ‘42’ and Column ‘7’ would have the count of numbers starting with ‘427’ of the given scale. by opposition to the text heatmap, the graphical heatmap does not remove empty rows by default. To activate this feature, use compact=True.
The heatmap is useful to look at patterns. For distribution, stem_graphic is better suited.
- Parameters
df – list, numpy array, time series, pandas or dask dataframe
annotate – display annotations (Z) on heatmap
asFigure – return plot as plotly figure (for web applications)
ax – matplotlib axes instance, usually from a figure or other plot
caps – for compatibility
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
flip_axes – bool, default is False
interactive – if cufflinks is loaded, renders as interactive plot in notebook
leaf_order – how many leaf digits per data point to display, defaults to 1
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
trim_blank – remove the blank between the delimiter and the first leaf, defaults to True
unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’
zoom – zoom level, on top of calculated scale (+1, -1 etc)
- Returns
count matrix, scale and matplotlib ax or figure if interactive and asFigure are True
-
stemgraphic.graphic.
leaf_scatter
(df, alpha=0.1, asc=True, ax=None, break_on=None, column=None, compact=False, delimiter_color='C3', display=900, figure_only=True, flip_axes=False, font_kw=None, grid=False, interactive=True, leaf_color='k', leaf_jitter=False, leaf_order=1, legend_pos='best', mirror=False, persistence=None, primary_kw=None, random_state=None, scale=None, scaled_leaf=True, zoom=None)¶ leaf_scatter.
Scatter for numerical values based on leaf for X axis (scaled or not) and stem for Y axis.
- Parameters
df – list, numpy array, time series, pandas or dask dataframe
alpha – opacity of the dots, defaults to 10%
asc – stem (Y axis) sorted in ascending order, defaults to True
ax – matplotlib axes instance, usually from a figure or other plot
break_on – force a break of the leaves at x in (5, 10), defaults to 10
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
compact – do not display empty stem rows (with no leaves), defaults to False
delimiter_color – color of the line between aggregate and stem and stem and leaf
display – maximum number of data points to display, forces sampling if smaller than len(df)
figure_only – bool if True (default) returns matplotlib (fig,ax), False returns (fig,ax,df)
flip_axes – X becomes Y and Y becomes X
font_kw – keyword dictionary, font parameters
grid – show grid
interactive – if plotly is available, renders as interactive plot in notebook. False to render image.
leaf_color – font color of the leaves
leaf_jitter – add jitter to see density of each specific stem/leaf combo
leaf_order – how many leaf digits per data point to display, defaults to 1
legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.
mirror – mirror the plot in the axis of the delimiters
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
primary_kw – stem-and-leaf plot additional arguments
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
scaled_leaf – scale leafs, bool
zoom – zoom level, on top of calculated scale (+1, -1 etc)
- Returns
-
stemgraphic.graphic.
stem_graphic
(df, df2=None, aggregation=True, alpha=0.1, asc=True, ax=None, ax2=None, bar_color='C0', bar_outline=None, break_on=None, column=None, combined=None, compact=False, delimiter_color='C3', display=900, figure_only=True, flip_axes=False, font_kw=None, leaf_color='k', leaf_order=1, legend_pos='best', median_alpha=0.25, median_color='C4', mirror=False, outliers=None, outliers_color='C3', persistence=None, primary_kw=None, random_state=None, scale=None, secondary=False, secondary_kw=None, secondary_plot=None, show_stem=True, title=None, trim=False, trim_blank=True, underline_color=None, unit='', zoom=None)¶ stem_graphic.
A graphical stem and leaf plot. stem_graphic provides horizontal, vertical or mirrored layouts, sorted in ascending or descending order, with sane default settings for the visuals, legend, median and outliers.
- Parameters
df – list, numpy array, time series, pandas or dask dataframe
df2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional). for back 2 back stem-and-leaf plots
aggregation – Boolean for sum, else specify function
alpha – opacity of the bars, median and outliers, defaults to 10%
asc – stem sorted in ascending order, defaults to True
ax – matplotlib axes instance, usually from a figure or other plot
ax2 – matplotlib axes instance, usually from a figure or other plot for back to back
bar_color – the fill color of the bar representing the leaves
bar_outline – the outline color of the bar representing the leaves
break_on – force a break of the leaves at x in (5, 10), defaults to 10
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
combined – list (specific subset to automatically include, say, for comparisons), or None
compact – do not display empty stem rows (with no leaves), defaults to False
delimiter_color – color of the line between aggregate and stem and stem and leaf
display – maximum number of data points to display, forces sampling if smaller than len(df)
figure_only – bool if True (default) returns matplotlib (fig,ax), False returns (fig,ax,df)
flip_axes – X becomes Y and Y becomes X
font_kw – keyword dictionary, font parameters
leaf_color – font color of the leaves
leaf_order – how many leaf digits per data point to display, defaults to 1
legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.
median_alpha – opacity of median and outliers, defaults to 25%
median_color – color of the box representing the median
mirror – mirror the plot in the axis of the delimiters
outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto
outliers_color – background color for the outlier boxes
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
primary_kw – stem-and-leaf plot additional arguments
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
secondary – bool if True, this is a secondary plot - mostly used for back-to-back plots
secondary_kw – any matplotlib keyword supported by .plot(), for the secondary plot
secondary_plot – One or more of ‘dot’, ‘kde’, ‘margin_kde’, ‘rug’ in a comma delimited string or None
show_stem – bool if True (default) displays the stems
title – string to display as title
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
trim_blank – remove the blank between the delimiter and the first leaf, defaults to True
underline_color – color of the horizontal line under the leaves, None for no display
unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’
zoom – zoom level, on top of calculated scale (+1, -1 etc)
- Returns
matplotlib figure and axes instance
helpers
¶
helpers.py.
Helper functions for stemgraphic.
-
stemgraphic.helpers.
APOSTROPHE
= '’'¶ Typographical apostrophe - ex: I’m, l’arbre
-
stemgraphic.helpers.
CHAR_FILTER
= ['\t', '\n', '\\', '/', '`', '*', '_', '{', '}', '[', ']', '(', ')', '<', '>', '#', '=', '+', '- ', '–', '.', ';', ':', '!', '?', '|', '$', "'", '"', '…']¶ Characters to filter. Does a relatively good job on a majority of texts ‘- ‘ and ‘–’ is to skip quotes in many plays and dialogues in books, especially French.
-
stemgraphic.helpers.
DOUBLE_QUOTE
= '"'¶ Double straight quote mark
-
stemgraphic.helpers.
EMPTY
= b' '¶ empty
-
stemgraphic.helpers.
LETTERS
= 'abcdefghijklmnopqrstuvwxyz'¶ Default definition of standard letters remove_accent has to be called explicitly for any of these letters to match their accented counterparts
-
stemgraphic.helpers.
NON_ALPHA
= ['-', '+', '/', '[', ']', '_', '£', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', ';', "'", '"', '’', b' ', b'\xd6\xb1', '?', '¡', '¿', '«', '»', '“', '”', '-', '—']¶ List of non alpha characters. Temporary - I want to balance flexibility with convenience, but still looking at options.
-
stemgraphic.helpers.
NO_PERIOD_FILTER
= ['\t', '\n', '\\', '/', '`', '*', '_', '{', '}', '[', ']', '(', ')', '<', '>', '#', '=', '+', '- ', '–', ';', ':', '!', '?', '|', '$', "'", '"']¶ Similar purpose to CHAR_FILTER, ut keeps the period. The last word of each sentence will end with a ‘.’ Useful for manipulating the dataframe returned by the various visualizations and ngram_data, to break down frequencies by sentence instead of the full text or list.
-
stemgraphic.helpers.
OVER
= b'\xd6\xb1'¶ for typesetting overlap
-
stemgraphic.helpers.
QUOTE
= "'"¶ Straight quote mark - ex: ‘INCONCEIVABLE’
-
stemgraphic.helpers.
alpha_mapping
= {'bold': '𝐀𝐁𝐂𝐃𝐄𝐅𝐆𝐇𝐈𝐉𝐊𝐋𝐌𝐍𝐎𝐏𝐐𝐑𝐒𝐓𝐔𝐕𝐖𝐗𝐘𝐙𝐚𝐛𝐜𝐝𝐞𝐟𝐠𝐡𝐢𝐣𝐤𝐥𝐦𝐧𝐨𝐩𝐪𝐫𝐬𝐭𝐮𝐯𝐰𝐱𝐲𝐳', 'boldsans': '𝗔𝗕𝗖𝗗𝗘𝗙𝗚𝗛𝗜𝗝𝗞𝗟𝗠𝗡𝗢𝗣𝗤𝗥𝗦𝗧𝗨𝗩𝗪𝗫𝗬𝗭𝗮𝗯𝗰𝗱𝗲𝗳𝗴𝗵𝗶𝗷𝗸𝗹𝗺𝗻𝗼𝗽𝗾𝗿𝘀𝘁𝘂𝘃𝘄𝘅𝘆𝘇', 'circle': 'ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏⓐⓑⓒⓓⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩ', 'cursive': '𝒜𝐵𝒞𝒟𝐸𝐹𝒢𝐻𝐼𝒥𝒦𝐿𝑀𝒩𝒪𝒫𝒬𝑅𝒮𝒯𝒰𝒱𝒲𝒳𝒴𝒵𝒶𝒷𝒸𝒹𝑒𝒻𝑔𝒽𝒾𝒿𝓀𝓁𝓂𝓃𝑜𝓅𝓆𝓇𝓈𝓉𝓊𝓋𝓌𝓍𝓎𝓏', 'default': 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', 'doublestruck': '𝔸𝔹ℂ𝔻𝔼𝔽𝔾ℍ𝕀𝕁𝕂𝕃𝕄ℕ𝕆ℙℚℝ𝕊𝕋𝕌𝕍𝕎𝕏𝕐ℤ𝕒𝕓𝕔𝕕𝕖𝕗𝕘𝕙𝕚𝕛𝕜𝕝𝕞𝕟𝕠𝕡𝕢𝕣𝕤𝕥𝕦𝕧𝕨𝕩𝕪𝕫', 'italicbold': '𝑨𝑩𝑪𝑫𝑬𝑭𝑮𝑯𝑰𝑱𝑲𝑳𝑴𝑵𝑶𝑷𝑸𝑹𝑺𝑻𝑼𝑽𝑾𝑿𝒀𝒁𝒂𝒃𝒄𝒅𝒆𝒇𝒈𝒉𝒊𝒋𝒌𝒍𝒎𝒏𝒐𝒑𝒒𝒓𝒔𝒕𝒖𝒗𝒘𝒙𝒚𝒛', 'italicboldsans': '𝘼𝘽𝘾𝘿𝙀𝙁𝙂𝙃𝙄𝙅𝙆𝙇𝙈𝙉𝙊𝙋𝙌𝙍𝙎𝙏𝙐𝙑𝙒𝙓𝙔𝙕𝙖𝙗𝙘𝙙𝙚𝙛𝙜𝙝𝙞𝙟𝙠𝙡𝙢𝙣𝙤𝙥𝙦𝙧𝙨𝙩𝙪𝙫𝙬𝙭𝙮𝙯', 'medieval': '𝔄𝔅ℭ𝔇𝔈𝔉𝔊ℌℑ𝔍𝔎𝔏𝔐𝔑𝔒𝔓𝔔ℜ𝔖𝔗𝔘𝔙𝔚𝔛𝔜ℨ𝔞𝔟𝔠𝔡𝔢𝔣𝔤𝔥𝔦𝔧𝔨𝔩𝔪𝔫𝔬𝔭𝔮𝔯𝔰𝔱𝔲𝔳𝔴𝔵𝔶𝔷', 'medievalbold': '𝕬𝕭𝕮𝕯𝕰𝕱𝕲𝕳𝕴𝕵𝕶𝕷𝕸𝕹𝕺𝕻𝕼𝕽𝕾𝕿𝖀𝖁𝖂𝖃𝖄𝖅𝖆𝖇𝖈𝖉𝖊𝖋𝖌𝖍𝖎𝖏𝖐𝖑𝖒𝖓𝖔𝖕𝖖𝖗𝖘𝖙𝖚𝖛𝖜𝖝𝖞𝖟', 'square': '🄰🄱🄲🄳🄴🄵🄶🄷🄸🄹🄺🄻🄼🄽🄾🄿🅀🅁🅂🅃🅄🅅🅆🅇🅈🅉🄰🄱🄲🄳🄴🄵🄶🄷🄸🄹🄺🄻🄼🄽🄾🄿🅀🅁🅂🅃🅄🅅🅆🅇🅈🅉', 'square_inverted': '🅰🅱🅲🅳🅴🅵🅶🅷🅸🅹🅺🅻🅼🅽🅾🅿🆀🆁🆂🆃🆄🆅🆆🆇🆈🆉🅰🅱🅲🅳🅴🅵🅶🅷🅸🅹🅺🅻🅼🅽🅾🅿🆀🆁🆂🆃🆄🆅🆆🆇🆈🆉', 'typewriter': '𝙰𝙱𝙲𝙳𝙴𝙵𝙶𝙷𝙸𝙹𝙺𝙻𝙼𝙽𝙾𝙿𝚀𝚁𝚂𝚃𝚄𝚅𝚆𝚇𝚈𝚉𝚊𝚋𝚌𝚍𝚎𝚏𝚐𝚑𝚒𝚓𝚔𝚕𝚖𝚗𝚘𝚙𝚚𝚛𝚜𝚝𝚞𝚟𝚠𝚡𝚢𝚣'}¶ Alphabet unicode mapping
-
stemgraphic.helpers.
available_alpha_charsets
()¶ available_alpha_charsets.
All supported unicode alphabet charsets, such as ‘doublestruck’ where A looks like: 𝔸
- Returns
list of charset names
-
stemgraphic.helpers.
available_charsets
()¶ available_alpha_charsets.
All supported unicode digit charsets, such as ‘doublestruck’ where 0 looks like: 𝟘
- Returns
list of charset names
-
stemgraphic.helpers.
jitter
(data, scale)¶ jitter.
Adds jitter to data, for display purpose
- Parameters
data – numpy or pandas dataframe
scale –
- Returns
-
stemgraphic.helpers.
key_calc
(stem, leaf, scale)¶ key_calc.
Calculates a value from a stem, a leaf and a scale.
- Parameters
stem –
leaf –
scale –
- Returns
calculated values
-
stemgraphic.helpers.
legend
(ax, x, y, asc, flip_axes, mirror, stem, leaf, scale, delimiter_color, aggregation=True, cur_font=None, display=10, pos='best', unit='')¶ legend.
Builds a graphical legend for numerical stem-and-leaf plots.
- Parameters
display –
cur_font –
ax –
x –
y –
pos –
asc –
flip_axes –
mirror –
stem –
leaf –
scale –
delimiter_color –
unit –
aggregation –
-
stemgraphic.helpers.
mapping
= {'arabic': {'0': '٠', '1': '١', '2': '٢', '3': '٣', '4': '٤', '5': '٥', '6': '٦', '7': '٧', '8': '٨', '9': '٩'}, 'arabic_r': {'0': '٠', '1': '١', '2': '٢', '3': '٣', '4': '٤', '5': '٥', '6': '٦', '7': '٧', '8': '٨', '9': '٩'}, 'bold': {'0': '𝟎', '1': '𝟏', '2': '𝟐', '3': '𝟑', '4': '𝟒', '5': '𝟓', '6': '𝟔', '7': '𝟕', '8': '𝟖', '9': '𝟗'}, 'circled': {'0': '⓪', '1': '①', '2': '②', '3': '③', '4': '④', '5': '⑤', '6': '⑥', '7': '⑦', '8': '⑧', '9': '⑨'}, 'default': {'0': '0', '1': '1', '2': '2', '3': '3', '4': '4', '5': '5', '6': '6', '7': '7', '8': '8', '9': '9'}, 'doublestruck': {'0': '𝟘', '1': '𝟙', '2': '𝟚', '3': '𝟛', '4': '𝟜', '5': '𝟝', '6': '𝟞', '7': '𝟟', '8': '𝟠', '9': '𝟡'}, 'fullwidth': {'0': '0', '1': '1', '2': '2', '3': '3', '4': '4', '5': '5', '6': '6', '7': '7', '8': '8', '9': '9'}, 'gurmukhi': {'0': '੦', '1': '੧', '2': '੨', '3': '੩', '4': '੪', '5': '੫', '6': '੬', '7': '੭', '8': '੮', '9': '੯'}, 'mono': {'0': '𝟶', '1': '𝟷', '2': '𝟸', '3': '𝟹', '4': '𝟺', '5': '𝟻', '6': '𝟼', '7': '𝟽', '8': '𝟾', '9': '𝟿'}, 'nko': {'0': '߀', '1': '߁', '2': '߂', '3': '߃', '4': '߄', '5': '߅', '6': '߆', '7': '߇', '8': '߈', '9': '߉'}, 'rod': {'0': '◯', '1': '𝍩', '2': '𝍪', '3': '𝍫', '4': '𝍬', '5': '𝍭', '6': '𝍮', '7': '𝍯', '8': '𝍰', '9': '𝍱'}, 'roman': {'0': '.', '1': 'Ⅰ', '2': 'Ⅱ', '3': 'Ⅲ', '4': 'Ⅳ', '5': 'Ⅴ', '6': 'Ⅵ', '7': 'Ⅶ', '8': 'Ⅷ', '9': 'Ⅸ'}, 'sans': {'0': '𝟢', '1': '𝟣', '2': '𝟤', '3': '𝟥', '4': '𝟦', '5': '𝟧', '6': '𝟨', '7': '𝟩', '8': '𝟪', '9': '𝟫'}, 'sansbold': {'0': '𝟬', '1': '𝟭', '2': '𝟮', '3': '𝟯', '4': '𝟰', '5': '𝟱', '6': '𝟲', '7': '𝟳', '8': '𝟴', '9': '𝟵'}, 'square': {'0': '🞌', '1': '🞍', '2': '■', '3': '⬛', '4': '🞓', '5': '🞒', '6': '🞑', '7': '🞐', '8': '🞏', '9': '🞎'}, 'subscript': {'0': '₀', '1': '₁', '2': '₂', '3': '₃', '4': '₄', '5': '₅', '6': '₆', '7': '₇', '8': '₈', '9': '₉'}, 'tamil': {'0': '௦', '1': '௧', '2': '௨', '3': '௩', '4': '௪', '5': '௫', '6': '௬', '7': '௭', '8': '௮', '9': '௯'}}¶ Charset unicode digit mappings
-
stemgraphic.helpers.
min_max_count
(x, column=0)¶ min_max_count.
Handles min, max and count. This works on numpy, lists, pandas and dask dataframes.
- Parameters
x – list, numpy array, series, pandas or dask dataframe
column – future use
- Returns
min, max and count
-
stemgraphic.helpers.
na_count
(x, column=0)¶ min_max_count.
Handles min, max and count. This works on numpy, lists, pandas and dask dataframes.
- Parameters
x – list, numpy array, series, pandas or dask dataframe
column – future use
- Returns
all numpy nan count
-
stemgraphic.helpers.
npy_load
(path)¶ npy_load.
load numpy array (npy) file from disk.
- Parameters
path – path to pickle file
- Returns
numpy array
-
stemgraphic.helpers.
npy_save
(path, array)¶ npy_save.
saves numpy array to npy file on disk.
- Parameters
path – path where to save npy file
array – numpy array
- Returns
path
-
stemgraphic.helpers.
percentile
(data, alpha)¶ percentile.
- Parameters
data – list, numpy array, time series or pandas dataframe
alpha – between 0 and 0.5 proportion to select on each side of the distribution
- Returns
the actual value at that percentile
-
stemgraphic.helpers.
pkl_load
(path)¶ pkl_load.
load matrix or dataframe pickle (pkl) file from disk.
- Parameters
path – path to pickle file
- Returns
matrix or dataframe
-
stemgraphic.helpers.
pkl_save
(path, array)¶ pkl_save.
saves matrix or dataframe to pkl file on disk.
- Parameters
path – path where to save pickle file
array – matrix (array) or dataframe
- Returns
path
-
stemgraphic.helpers.
savefig
(plt)¶ savefig.
Allows displaying a matplotlib figure to the console terminal. This requires pysixel to be pip installed. It also requires a terminal with Sixel graphic support, like DEC with graphic support, Linux xterm (started with -ti 340), MLTerm (multilingual terminal, available on Windows, Linux etc).
This is called by the command line stem tool when using -o stdout and can also be used in an ipython session.
- Parameters
plt – matplotlib pyplot
- Returns
-
stemgraphic.helpers.
square_scale
()¶ square_scale.
Ordered key for 0-9 mapping to squares from tiny filled square to large hollow square.
- Returns
scale from 0 to 9
-
stemgraphic.helpers.
stack_columns
(row)¶ stack_columns.
stack multiple columns into a single stacked value
- Parameters
row – a row of letters
- Returns
stacked string
-
stemgraphic.helpers.
translate_alpha_representation
(text, charset=None)¶ translate_alpha_representation.
Replace the default (ASCII type) charset in a string with the equivalent in a different unicode charset.
- Parameters
text – input string
charset – unicode character set as defined by available_alpha_charsets
- Returns
translated string
-
stemgraphic.helpers.
translate_representation
(text, charset=None, index=None, zero_blank=None)¶ translate_representation.
Replace the default (ASCII type) digit glyphs in a string with the equivalent in a different unicode charset.
- Parameters
text – input string
charset – unicode character set as defined by available_alpha_charsets
index – correspond to which item in a list we are looking at, for zero_blank
zero_blank – will blank 0 if True, unless we are looking at header (row index < 2)
- Returns
translated string
num
¶
stemgraphic.num.
BRAND NEW in V.0.5.0!
Stemgraphic provides a complete set of functions to handle everything related to stem-and-leaf plots. num is a module of the stemgraphic package to handle numerical variables.
This module structure is new as of v.0.5.0 to match the addition of stemgraphic.alpha.
The shorthand from previous versions of stemgraphic is still available and defaults to the numerical functions:
from stemgraphic import stem_graphic, stem_text, heatmap
stopwords
¶
stopwords.
This module includes 4 lists of stop words: EN (main English list), ALT_EN (alternate English list), FR (French) and SP (Spanish).
A PT (Portuguese) list is in the works.
-
stemgraphic.stopwords.
ALT_EN
= ['a', 'am', 'an', 'and', 'are', 'as', 'at', 'been', 'for', 'from', 'in', 'is', 'of', 'on', 'or', 'out', 'so', 'such', 'that', 'the', 'these', 'this', 'those', 'to', 'upon', 'was', 'were']¶ ALT_ENglish stopwords
-
stemgraphic.stopwords.
EN
= ['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'inside', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves']¶ ENglish stop words
-
stemgraphic.stopwords.
ES
= ['a', 'alguna', 'algunas', 'alguno', 'algunos', 'algún', 'ambas', 'ambos', 'ampleamos', 'ante', 'antes', 'aquel', 'aquellas', 'aquellos', 'aqui', 'arriba', 'atras', 'aun', 'bajo', 'bastante', 'bien', 'cada', 'cierta', 'ciertas', 'cierto', 'ciertos', 'como', 'con', 'conseguimos', 'conseguir', 'consigo', 'consigue', 'consiguen', 'consigues', 'cual', 'cuando', 'dentro', 'desde', 'donde', 'dos', 'el', 'ella', 'ellas', 'ellos', 'empleais', 'emplean', 'emplear', 'empleas', 'empleo', 'en', 'encima', 'entonces', 'entre', 'era', 'eramos', 'eran', 'eras', 'eres', 'es', 'esta', 'estaba', 'estado', 'estais', 'estamos', 'estan', 'estoy', 'fin', 'fue', 'fueron', 'fui', 'fuimos', 'ha', 'hace', 'haceis', 'hacemos', 'hacen', 'hacer', 'haces', 'hago', 'incluso', 'intenta', 'intentais', 'intentamos', 'intentan', 'intentar', 'intentas', 'intento', 'ir', 'la', 'largo', 'las', 'lo', 'los', 'mientras', 'mio', 'modo', 'muchos', 'muy', 'nos', 'nosotros', 'otro', 'para', 'pero', 'podeis', 'podemos', 'poder', 'podria', 'podriais', 'podriamos', 'podrian', 'podrias', 'por', 'porque', 'primero', 'puede', 'pueden', 'puedo', 'quien', 'sabe', 'sabeis', 'sabemos', 'saben', 'saber', 'sabes', 'ser', 'sea', 'sean', 'si', 'siendo', 'sin', 'sobre', 'sois', 'solamente', 'solo', 'somos', 'soy', 'su', 'sus', 'también', 'teneis', 'tenemos', 'tener', 'tengo', 'tiempo', 'tiene', 'tienen', 'todo', 'trabaja', 'trabajais', 'trabajamos', 'trabajan', 'trabajar', 'trabajas', 'trabajo', 'tras', 'tu', 'tuyo', 'ultimo', 'un', 'una', 'unas', 'uno', 'unos', 'usa', 'usais', 'usamos', 'usan', 'usar', 'usas', 'uso', 'va', 'vais', 'valor', 'vamos', 'van', 'vaya', 'verdad', 'verdadera', 'verdadero', 'vosotras', 'vosotros', 'voy', 'yo']¶ Spanish (ESpanol) stop words
-
stemgraphic.stopwords.
FR
= ['a', 'alors', 'au', 'aucuns', 'aussi', 'autre', 'autres', 'aux', 'avant', 'avec', 'avoir', 'bon', 'car', 'ce', 'cela', 'ces', 'ceux', 'chacun', 'chacune', 'chaque', 'ci', 'comme', 'comment', 'dans', 'de', 'dedans', 'dehors', 'depuis', 'derrière', 'des', 'dessus', 'devant', 'devrait', 'doit', 'donc', 'dos', 'du', 'début', 'elle', 'elles', 'en', 'encore', 'essai', 'est', 'et', 'eu', 'fait', 'faites', 'fois', 'font', 'hors', 'ici', 'il', 'ils', 'je', 'juste', 'la', 'le', 'les', 'leur', 'là', 'ma', 'maintenant', 'mais', 'mes', 'mine', 'moins', 'mon', 'mot', 'même', 'ni', 'nommés', 'notre', 'nous', 'ou', 'où', 'par', 'parce', 'pas', 'peu', 'peut', 'plupart', 'pour', 'pourquoi', 'quand', 'que', 'quel', 'quelle', 'quelles', 'quelque', 'quelques', 'quels', 'qui', 'sa', 'sans', 'ses', 'seulement', 'si', 'sien', 'son', 'sont', 'sous', 'soyez', 'sujet', 'sur', 'ta', 'tandis', 'tellement', 'tels', 'tes', 'ton', 'tous', 'tout', 'trop', 'très', 'tu', 'voient', 'vont', 'votre', 'vous', 'vu', 'ça', 'étaient', 'état', 'étions', 'été', 'être']¶ French (FRancais) stop words
-
stemgraphic.stopwords.
VOCALES
= ['a', 'á', 'e', 'é', 'i', 'í', 'o', 'ó', 'u', 'ú', 'ü']¶ Spanish vowels
-
stemgraphic.stopwords.
VOWELS
= ['a', 'e', 'i', 'o', 'u']¶ English vowels
-
stemgraphic.stopwords.
VOYELLES
= ['a', 'â', 'ä', 'à', 'æ', 'e', 'ê', 'ë', 'é', 'è', 'i', 'î', 'ï', 'o', 'ô', 'ö', 'œ', 'u', 'û', 'ü', 'ù', 'y']¶ French vowels
text
¶
stemgraphic.text. visualizations for text.
-
stemgraphic.text.
heatmap
(df, caps=None, charset=None, column=None, compact=True, display=900, flip_axes=False, leaf_order=1, outliers=None, persistence=None, random_state=None, scale=None, trim=False, trim_blank=True, unit='', zero_blank=True, zoom=None)¶ heatmap.
The heatmap displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves, they are left in their respective columns. Row ‘42’ and Column ‘7’ would have the count of numbers starting with ‘427’ of the given scale. The difference with the heatmatrix is that by default it doesn’t show zero values and it present a compact form by not showing whole empty rows either. Set compact = True to display those empty rows.
The heatmap is useful to look at patterns. For distribution, stem_graphic is better suited.
Example:
heatmap(diamonds.carat, charset='bold');
Output:
Stem-and-leaf heatmap (30.1 x 0.1 ) 𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖 𝟗 stem 𝟐 𝟔 𝟒 𝟏 𝟑 𝟏𝟎 𝟑 𝟑 𝟒𝟖 𝟒𝟔 𝟑𝟑 𝟐𝟔 𝟐𝟏 𝟏𝟐 𝟖 𝟔 𝟏𝟏 𝟔 𝟒 𝟏𝟒 𝟐𝟎 𝟏𝟐 𝟏𝟎 𝟐 𝟒 𝟓 𝟑 𝟓 𝟏𝟗 𝟏𝟖 𝟏𝟐 𝟏𝟒 𝟗 𝟔 𝟔 𝟓 𝟑 𝟐 𝟔 𝟕 𝟑 𝟏 𝟒 𝟏 𝟕 𝟑𝟑 𝟏𝟒 𝟏𝟓 𝟏𝟏 𝟔 𝟔 𝟑 𝟑 𝟔 𝟒 𝟖 𝟒 𝟓 𝟏 𝟐 𝟗 𝟑𝟐 𝟕 𝟑 𝟏 𝟏 𝟏 𝟏 𝟑 𝟏𝟎 𝟐𝟓 𝟑𝟒 𝟏𝟐 𝟏𝟑 𝟕 𝟕 𝟓 𝟑 𝟏 𝟓 𝟏𝟏 𝟖 𝟏 𝟔 𝟓 𝟓 𝟒 𝟑 𝟑 𝟐 𝟏𝟐 𝟏𝟒 𝟓 𝟓 𝟓 𝟔 𝟏 𝟒 𝟑 𝟏 𝟏 𝟏𝟑 𝟏 𝟑 𝟐 𝟐 𝟏 𝟐 𝟏 𝟏 𝟏𝟒 𝟏 𝟏 𝟏𝟓 𝟗 𝟏𝟐 𝟕 𝟔 𝟓 𝟓 𝟏 𝟏 𝟐 𝟏𝟔 𝟏 𝟏 𝟏 𝟏𝟕 𝟑 𝟒 𝟐 𝟑 𝟐 𝟏 𝟏 𝟏𝟖 𝟏 𝟏 𝟏𝟗 𝟏 𝟏 𝟐𝟎 𝟔 𝟖 𝟏 𝟏 𝟑 𝟏 𝟐𝟏 𝟐 𝟏 𝟏 𝟏 𝟏 𝟐𝟐 𝟐 𝟏 𝟏 𝟐𝟑 𝟏 𝟐 𝟑𝟎 𝟏
- Parameters
df – list, numpy array, time series, pandas or dask dataframe
charset – valid unicode digit character set, as returned by helpers.available_charsets()
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
flip_axes – wide format
leaf_order – how many leaf digits per data point to display, defaults to 1
outliers – for compatibility with other text plots
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
trim_blank – remove the blank between the delimiter and the first leaf, defaults to True
unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’
zero_blank – replace zero digit with space
zoom – zoom level, on top of calculated scale (+1, -1 etc)
- Returns
count matrix, scale
-
stemgraphic.text.
heatmatrix
(df, caps=None, charset=None, column=None, compact=False, display=900, flip_axes=False, leaf_order=1, outliers=None, persistence=None, random_state=None, scale=None, trim=False, trim_blank=True, unit='', zero_blank=False, zoom=None)¶ heatmatrix.
The heatmatrix displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves, they are left in their respective columns. Row ‘42’ and Column ‘7’ would have the count of numbers starting with ‘427’ of the given scale.
The heatmatrix is useful to look at patterns. For distribution, stem_graphic is better suited.
Example:
heatmatrix(diamonds.carat, charset='bold');
Output:
Stem-and-leaf heatmap (24.0 x 0.1 ) 𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖 𝟗 stem 𝟐 𝟏 𝟎 𝟏 𝟓 𝟒 𝟏 𝟓 𝟎 𝟒 𝟐 𝟑 𝟒𝟓 𝟒𝟎 𝟐𝟔 𝟏𝟕 𝟏𝟒 𝟕 𝟖 𝟒 𝟏𝟐 𝟕 𝟒 𝟑𝟎 𝟑𝟏 𝟏𝟖 𝟖 𝟑 𝟐 𝟑 𝟎 𝟏 𝟏 𝟓 𝟐𝟑 𝟐𝟎 𝟖 𝟓 𝟖 𝟏𝟑 𝟖 𝟔 𝟓 𝟕 𝟔 𝟔 𝟒 𝟐 𝟎 𝟑 𝟎 𝟎 𝟎 𝟎 𝟎 𝟕 𝟏𝟖 𝟐𝟐 𝟏𝟐 𝟕 𝟕 𝟖 𝟒 𝟑 𝟒 𝟐 𝟖 𝟓 𝟒 𝟑 𝟓 𝟎 𝟏 𝟓 𝟏 𝟎 𝟎 𝟗 𝟏𝟗 𝟏𝟒 𝟐 𝟐 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟏𝟎 𝟐𝟖 𝟑𝟔 𝟏𝟎 𝟖 𝟗 𝟏𝟎 𝟏 𝟏𝟒 𝟒 𝟓 𝟏𝟏 𝟕 𝟒 𝟒 𝟑 𝟒 𝟎 𝟔 𝟏 𝟏 𝟏 𝟏𝟐 𝟏𝟏 𝟗 𝟗 𝟒 𝟕 𝟐 𝟏 𝟐 𝟐 𝟏 𝟏𝟑 𝟔 𝟏 𝟒 𝟐 𝟐 𝟎 𝟎 𝟎 𝟏 𝟎 𝟏𝟒 𝟎 𝟎 𝟎 𝟎 𝟏 𝟎 𝟎 𝟎 𝟎 𝟎 𝟏𝟓 𝟏𝟎 𝟏𝟔 𝟒 𝟑 𝟑 𝟓 𝟐 𝟑 𝟐 𝟏 𝟏𝟔 𝟐 𝟏 𝟏 𝟏 𝟎 𝟏 𝟎 𝟏 𝟎 𝟎 𝟏𝟕 𝟔 𝟓 𝟎 𝟏 𝟏 𝟏 𝟎 𝟎 𝟏 𝟏 𝟏𝟖 𝟏 𝟎 𝟏 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟏𝟗 𝟏 𝟏 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟐𝟎 𝟑 𝟗 𝟒 𝟑 𝟐 𝟐 𝟏 𝟏 𝟏 𝟎 𝟐𝟏 𝟎 𝟏 𝟎 𝟎 𝟏 𝟎 𝟎 𝟏 𝟎 𝟎 𝟐𝟐 𝟎 𝟐 𝟏 𝟎 𝟎 𝟏 𝟎 𝟎 𝟎 𝟎 𝟐𝟑 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟎 𝟐𝟒 𝟏 𝟎 𝟏 𝟎 𝟐 𝟎 𝟎 𝟎 𝟎 𝟎
- Parameters
df – list, numpy array, time series, pandas or dask dataframe
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
flip_axes – wide format
leaf_order – how many leaf digits per data point to display, defaults to 1
outliers – for compatibility with other text plots
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
trim_blank – remove the blank between the delimiter and the first leaf, defaults to True
unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’
zero_blank – replace zero digit with space
zoom – zoom level, on top of calculated scale (+1, -1 etc)
- Returns
count matrix, scale
-
stemgraphic.text.
quantize
(df, column=None, display=750, leaf_order=1, random_state=None, scale=None, trim=None, zoom=None)¶ quantize.
Converts a series into stem-and-leaf and back into decimal. This has the potential effect of decimating (or truncating) values in a lossy way.
- Parameters
df – list, numpy array, time series, pandas or dask dataframe
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
display – maximum number of data points to display, forces sampling if smaller than len(df)
leaf_order – how many leaf digits per data point to display, defaults to 1
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
zoom – zoom level, on top of calculated scale (+1, -1 etc)
- Returns
decimated df
-
stemgraphic.text.
stem_data
(x, break_on=None, column=None, compact=False, display=300, full=False, leaf_order=1, omin=None, omax=None, outliers=False, persistence=None, random_state=None, scale=None, total_rows=None, trim=False, zoom=None)¶ stem_data.
Returns scale factor, key label and list of rows.
- Parameters
x – list, numpy array, time series, pandas or dask dataframe
break_on – force a break of the leaves at x in (5, 10), defaults to 10
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
full – bool, if True returns all interim results including sorted data and stems
leaf_order – how many leaf digits per data point to display, defaults to 1
outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto
omin – float, if already calculated, helps speed up the process for large data sets
omax – float, if already calculated, helps speed up the process for large data sets
persistence – persist sampled dataframe
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic)
total_rows – int, if already calculated, helps speed up the process for large data sets
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
zoom – zoom level, on top of calculated scale (+1, -1 etc)
-
stemgraphic.text.
stem_dot
(df, asc=True, break_on=None, column=None, compact=False, display=300, flip_axes=False, leaf_order=1, legend_pos='best', marker=None, outliers=True, persistence=None, random_state=None, scale=None, symmetric=False, trim=False, unit='', zoom=None)¶ stem_dot.
stem_dot builds a stem-and-leaf plot with dots instead of bars.
Example:
stem_dot(diamonds.price)
Output:
326 ¡ 0 |●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 1 |●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 2 |●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 3 |●●●●●●●●●●●●●●●●●●●●●●●●●●●● 4 |●●●●●●●●●●●●●●●●●●●●●●●●●●● 5 |●●●●●●●●●●●●●●● 6 |●●●●●●●●● 7 |●●● 8 |●●●●● 9 |●●●●●●● 10 |●● 11 |●●●● 12 |●●●●● 13 |●●●●● 14 |●● 15 |●●● 16 |●● 17 |●●●● ! 18823 Scale: 17|1 => 17.1x1000 = 17100.0
- Parameters
df – list, numpy array, time series, pandas or dask dataframe
asc – stem sorted in ascending order, defaults to True
break_on – force a break of the leaves at x in (5, 10), defaults to 10
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
flip_axes – bool, default is False
legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.
marker – char, symbol to use as marker. ‘●’ is default. Suggested alternatives: ‘*’, ‘+’, ‘x’, ‘.’, ‘o’
outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
symmetric – if True, dot plot will be distributed on both side of a center line
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’
zoom – zoom level, on top of calculated scale (+1, -1 etc)
-
stemgraphic.text.
stem_hist
(df, asc=True, break_on=None, column=None, compact=False, display=300, flip_axes=False, leaf_order=1, legend_pos='best', marker=None, outliers=True, persistence=None, random_state=None, scale=None, shade=None, symmetric=False, trim=False, unit='', zoom=None)¶ stem_hist.
stem_hist builds a histogram matching the stem-and-leaf plot, with the numbers hidden, as shown on the cover of the companion brochure.
Example:
stem_hist(diamonds.price, shade='medium')
Output:
0 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ 1 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ 2 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ 3 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ 4 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ 5 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒ 6 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒ 7 |▒▒▒▒▒▒▒▒▒▒▒▒ 8 |▒▒▒▒ 9 |▒▒▒▒ 10 |▒▒▒▒▒▒▒ 11 |▒▒ 12 |▒▒▒▒▒▒▒ 13 |▒▒▒▒ 14 | 15 |▒ 16 |▒ 17 | 18 |▒ Scale: 18|4 => 18.4x1000 = 18400.0
- Parameters
df – list, numpy array, time series, pandas or dask dataframe
asc – stem sorted in ascending order, defaults to True
break_on – force a break of the leaves at x in (5, 10), defaults to 10
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
flip_axes – bool, default is False
legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.
marker – char, symbol to use as marker. ‘O’ is default. Suggested alternatives: ‘*’, ‘+’, ‘x’, ‘.’, ‘o’
outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
shade – shade of marker: ‘none’,’light’,’medium’,’dark’,’full’
symmetric – if True, dot plot will be distributed on both side of a center line
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’
zoom – zoom level, on top of calculated scale (+1, -1 etc)
-
stemgraphic.text.
stem_tally
(df, asc=True, break_on=None, column=None, compact=False, display=300, flip_axes=False, legend_pos='best', outliers=True, persistence=None, random_state=None, scale=None, symmetric=False, trim=False, unit='', zoom=None)¶ stem_tally.
Stem-and-leaf plot using tally marks for leaf count, up to 5 per block.
Example:
stem_tally(diamonds.price) 326 ¡ 0 |卌卌卌卌卌卌卌卌卌卌卌卌卌卌卌𝍩 1 |卌卌卌卌卌卌卌卌卌卌卌卌 2 |卌卌卌卌卌卌𝍫 3 |卌卌卌卌𝍩 4 |卌卌卌卌卌𝍫 5 |卌卌卌卌卌𝍩 6 |卌卌卌𝍩 7 |卌卌卌𝍩 8 |卌卌𝍩 9 |𝍫 10 |𝍪 11 |𝍬 12 |卌𝍩 13 |𝍬 14 |𝍬 15 |𝍫 16 |𝍪 17 | 18 |𝍫 ! 18823 Key: 18|3 => 18.3x1000 = 18300.0
- Parameters
df – list, numpy array, time series, pandas or dask dataframe
asc – stem sorted in ascending order, defaults to True
break_on – force a break of the leaves at x in (5, 10), defaults to 10
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
flip_axes – bool, default is False
legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.
outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
symmetric – if True, dot plot will be distributed on both side of a center line
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’
zoom – zoom level, on top of calculated scale (+1, -1 etc)
-
stemgraphic.text.
stem_text
(df, asc=True, break_on=None, charset=None, column=None, compact=False, display=300, flip_axes=False, legend_pos='best', outliers=True, persistence=None, random_state=None, scale=None, symmetric=False, trim=False, unit='', zoom=None)¶ stem_text.
Classic text based stem-and-leaf plot.
- Parameters
df – list, numpy array, time series, pandas or dask dataframe
asc – stem sorted in ascending order, defaults to True
break_on – force a break of the leaves at x in (5, 10), defaults to 10
charset – (default to ascii), ‘roman’, ‘rod’, ‘arabic’, ‘circled’, ‘circled_inverted’
column – specify which column (string or number) of the dataframe to use, else the first numerical is selected
compact – do not display empty stem rows (with no leaves), defaults to False
display – maximum number of data points to display, forces sampling if smaller than len(df)
flip_axes – bool, default is False
legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.
outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto
persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)
random_state – initial random seed for the sampling process, for reproducible research
scale – force a specific scale for building the plot. Defaults to None (automatic).
symmetric – if True, dot plot will be distributed on both side of a center line
trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None
unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’
zoom – zoom level, on top of calculated scale (+1, -1 etc)