Stemgraphic Modules

stemgraphic

stem_graphic.

Package implementing a complete toolkit for text and a graphical stem-and-leaf plots and other visualizations adapted to stem-and-leaf pair values, such as heatmaps and sunburst charts.

It also handles very large data sets through scaling, sampling, trimming and other techniques.

See research paper ( http://artchiv.es/pydata2016/stemgraphic ) for more technical details.

A command line utility was installed along with the package, allowing to process excel or csv files. See: stem -h

aliases

Handy aliases for stem_graphic options.

stemgraphic.aliases.stem_hist(x, aggregation=False, alpha=1, asc=True, column=None, color='b', delimiter_color='r', display=300, flip_axes=True, legend_pos='short', outliers=False, trim=False)

stem_hist.

stem_hist builds a graphical histogram matching the stem-and-leaf plot, with the numbers hidden, as shown on the cover of the companion brochure.

Parameters
  • legend_pos

  • x – list, numpy array, time series, pandas or dask dataframe

  • aggregation – Boolean for sum, else specify function

  • alpha – opacity of the bars, median and outliers, defaults to 15%

  • asc – stem sorted in ascending order, defaults to True

  • column – specify which column (string or number) of the dataframe to use, else the first numerical is selected

  • color – the bar facecolor

  • delimiter_color – color of the line between aggregate and stem and stem and leaf

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • flip_axes – X becomes Y and Y becomes X

  • outliers – this is NOP, for compatibility

  • trim – this is NOP, for compatibility

Returns

matplotlib figure and axes instance

stemgraphic.aliases.stem_kde(x, **kw_args)

stem_kde buils a stem-and-leaf plot and adds an overlaid kde as secondary plot.

Parameters
  • x – list, numpy array, time series, pandas or dask dataframe

  • kw_args

Returns

matplotlib figure and axes instance

stemgraphic.aliases.stem_line(x, aggregation=False, alpha=0, asc=True, column=None, color='k', delimiter_color='r', display=300, flip_axes=True, outliers=False, secondary_plot=None, trim=False)

stem_line builds a stem-and-leaf plot with lines instead of bars.

Parameters
  • x – list, numpy array, time series, pandas or dask dataframe

  • aggregation – Boolean for sum, else specify function

  • alpha – opacity of the bars, median and outliers, defaults to 15%

  • asc – stem sorted in ascending order, defaults to True

  • column – specify which column (string or number) of the dataframe to use, else the first numerical is selected

  • color – the color of the line

  • delimiter_color – color of the line between aggregate and stem and stem and leaf

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • flip_axes – X becomes Y and Y becomes X

  • outliers

  • secondary_plot – One or more of ‘dot’, ‘kde’, ‘margin_kde’, ‘rug’ in a comma delimited string or None

  • trim – this is NOP, for compatibility

Returns

matplotlib figure and axes instance

stemgraphic.aliases.stem_symmetric_dot(x, **kw_args)

stem_symmetric_dot.

stem_symmetric_dot builds a symmetric stem dot plot

Example:

stem_symmetric_dot(diamonds.price)

Output:

326
    ¡
  0 | ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
  1 |            ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
  2 |                     ●●●●●●●●●●●●●●●●●●●●●●●●●●
  3 |                    ●●●●●●●●●●●●●●●●●●●●●●●●●●●●
  4 |                   ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
  5 |                       ●●●●●●●●●●●●●●●●●●●●●●●
  6 |                        ●●●●●●●●●●●●●●●●●●●●
  7 |                           ●●●●●●●●●●●●●●●
  8 |                                 ●●●
  9 |                              ●●●●●●●●
 10 |                                ●●●●
 11 |                               ●●●●●●●
 12 |                               ●●●●●●
 13 |                                 ●●
 14 |                                 ●●
 15 |                                 ●●●
 16 |                                 ●●●
 17 |                              ●●●●●●●●
 18 |                                  ●
    !
18823
Scale:
18|6 => 18.6x1000 = 18600.0
Parameters
  • x – list, numpy array, time series, pandas or dask dataframe

  • kw_args – keyword args to stem_dot

Returns

alpha

stemgraphic.alpha.

BRAND NEW in V.0.5.0!

Stemgraphic provides a complete set of functions to handle everything related to stem-and-leaf plots. alpha is a module of the stemgraphic package to add support for categorical and text variables.

The module also adds functionality to handle whole words, beside stem-and-leaf bigrams and n-grams.

For example, for the word “alabaster”:

With word_ functions, we can look at the word frequency in a text, or compare it through a distance function (default to Levenshtein) to other words in a corpus

With stem_ functions, we can look at the fundamental stem-and-leaf, stem would be ‘a’ and leaf would be ‘l’, for a bigram ‘al’. With a stem_order of 1 and a leaf_order of 2, we would have ‘a’ and ‘la’, for a trigram ‘ala’, so on and so forth.

stemgraphic.alpha.add_missing_letters(mat, stem_order, leaf_order, letters=None)

Add missing stems based on LETTERS. defaults to a-z alphabet.

Parameters
  • mat – matrix to modify

  • stem_order – how many stem characters per data point to display, defaults to 1

  • leaf_order – how many leaf characters per data point to display, defaults to 1

  • letters – letters that must be present as stems

Returns

the modified matrix

stemgraphic.alpha.heatmap(src, alpha_only=False, annotate=False, asFigure=False, ax=None, caps=False, compact=True, display=None, flip_axes=False, interactive=True, leaf_order=1, leaf_skip=0, random_state=None, stem_order=1, stem_skip=0, stop_words=None, trim=None)

heatmap.

The heatmap displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves,

they are left in their respective columns. Row ‘a’ and Column ‘b’ would have the count of words starting with ‘ab’. The heatmap is useful to look at patterns. For distribution, stem_graphic is better suited.

Parameters
  • src – string, filename, url, list, numpy array, time series, pandas or dask dataframe

  • alpha_only – only use stems from a-z alphabet

  • annotate – display annotations (Z) on heatmap

  • asFigure – return plot as plotly figure (for web applications)

  • ax – matplotlib axes instance, usually from a figure or other plot

  • caps – bool, True to be case sensitive

  • compact – remove empty stems

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • interactive – if cufflinks is loaded, renders as interactive plot in notebook

  • leaf_order – how many leaf characters per data point to display, defaults to 1

  • leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’

  • random_state – initial random seed for the sampling process, for reproducible research

  • stem_order – how many stem characters per data point to display, defaults to 1

  • stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter

  • stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

  • trim – for compatibility

Returns

stemgraphic.alpha.heatmap_grid(src1, src2, src3=None, src4=None, alpha_only=True, annot=False, caps=False, center=0, cmap=None, display=1000, leaf_order=1, leaf_skip=0, random_state=None, reverse=False, robust=False, stem_order=1, stem_skip=0, stop_words=None, threshold=0)

heatmap_grid.

With stem_graphic, it is possible to directly compare two different sources. In the case of a heatmap, two different data sets cannot be visualized directly on a single heatmap. For this task, we designed heatmap_grid to adapt to the number of sources to build a layout. It can take from 2 to 4 different source.

With 2 sources, a square grid will be generated, allowing for horizontal and vertical comparisons, with an extra heatmap showing the difference between the two matrices. It also computes a norm for that difference matrix. The smaller the value, the closer the two heatmaps are.

With 3 sources, it builds a triangular grid, with each source heatmap in a corner and the difference between each pair in between.

Finally, with 4 sources, a 3 x 3 grid is built, each source in a corner and the difference between each pair in between, with the center expressing the difference between top left and bottom right diagonal.

Parameters
  • src1 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (required)

  • src2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (required)

  • src3 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional)

  • src4 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional)

  • alpha_only – only use stems from a-z alphabet

  • annot – display annotations (Z) on heatmap

  • caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.

  • center – the center of the divergent color map for the difference heatmaps

  • cmap – color map for difference heatmap or None (default) to use the builtin red / blue divergent map

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • leaf_order – how many leaf characters per data point to display, defaults to 1

  • leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’

  • robust – reduce effect of outliers on difference heatmap

  • random_state – initial random seed for the sampling process, for reproducible research

  • stem_order – how many stem characters per data point to display, defaults to 1

  • stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter

  • stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

  • threshold – absolute value minimum count difference for a difference heatmap element to be visible

Returns

stemgraphic.alpha.heatmatrix(src, alpha_only=False, caps=False, charset=None, column=None, compact=True, display=None, flip_axes=None, leaf_order=1, leaf_skip=0, outliers=None, persistence=None, random_state=None, scale=None, stem_order=1, stem_skip=0, stop_words=None, trim=None, trim_blank=None, unit='', zero_blank=True, zoom=None)

heatmatrix.

The heatmatrix displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves, they are left in their respective columns. Row ‘a’ and Column ‘b’ would have the count of words starting with ‘ab’. The heatmatrix is useful to look at patterns. For distribution, stem_graphic is better suited.

Parameters
  • src – string, filename, url, list, numpy array, time series, pandas or dask dataframe

  • alpha_only – only use stems from a-z alphabet

  • caps – bool, True to be case sensitive

  • charset

  • column – specify which column (string or number) of the dataframe to use, else the first is selected

  • compact – remove empty stems

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • flip_axes – wide format

  • leaf_order – how many leaf characters per data point to display, defaults to 1

  • leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’

  • outliers – for compatibility with other text plots

  • persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)

  • random_state – initial random seed for the sampling process, for reproducible research

  • stem_order – how many stem characters per data point to display, defaults to 1

  • stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter

  • stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

  • scale – force a specific scale for building the plot. Defaults to None (automatic).

  • trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None

  • trim_blank – remove the blank between the delimiter and the first leaf, defaults to True

  • unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’

  • zero_blank – replace zero digit with space

  • zoom – zoom level, on top of calculated scale (+1, -1 etc)

Returns

count matrix, scale

stemgraphic.alpha.matrix_difference(mat1, mat2, thresh=0, ord=None)

matrix_difference.

Parameters
  • mat1 – first heatmap dataframe

  • mat2 – second heatmap dataframe

  • thresh – : absolute value minimum count difference for a difference heatmap element to be visible

Returns

difference matrix, norm and ratio of the sum of the first matrix over the second

stemgraphic.alpha.ngram_data(df, alpha_only=False, ascending=True, binary=False, break_on=None, caps=False, char_filter=None, column=None, compact=False, display=750, leaf_order=1, leaf_skip=0, persistence=None, random_state=None, remove_accents=False, reverse=False, rows_only=True, sort_by='len', stem_order=1, stem_skip=0, stop_words=None)

ngram_data.

This is the main text ingestion function for stemgraphic.alpha. It is used by most of the visualizations. It can also be used directly, to feed a pipeline, for example.

If selected (rows_only=False), the returned dataframe includes in each row a single word, the stem, the leaf and the ngram (stem + leaf) - the index is the ‘token’ position in the original source:

word stem leaf ngram

12 salut s a sa 13 chéri c h ch

Parameters
  • df – list, numpy array, series, pandas or dask dataframe

  • alpha_only – only use stems from a-z alphabet (NA on dataframe)

  • ascending – bool if the sort is ascending

  • binary – bool if True forces counts to 1 for anything greater than 0

  • break_on – letter on which to break a row, or None (default)

  • caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)

  • char_filter – list of characters to ignore. If None (default) CHAR_FILTER list will be used

  • column – specify which column (string or number) of the dataframe to use, or group of columns (stems) else the frame is assumed to only have one column with words.

  • compact – remove empty stems

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • leaf_order – how many leaf characters per data point to display, defaults to 1

  • leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’

  • persistence – will save the sampled datafrae to filename (with csv or pkl extension) or None

  • random_state – initial random seed for the sampling process, for reproducible research

  • remove_accents – bool if True strips accents (NA on dataframe)

  • rows_only – bool by default returns only the stem and leaf rows. If false, also the matrix and dataframe

  • sort_by – default to ‘len’, can also be ‘alpha’

  • stem_order – how many stem characters per data point to display, defaults to 1

  • stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter

  • stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

Returns

ordered rows if rows_only, else also returns the matrix and dataframe

stemgraphic.alpha.plot_sunburst_level(normalized, ax, label=True, level=0, offset=0, ngram=False, plot=True, stem=None, vis=0)

plot_sunburst_level.

utility function for sunburst function.

Parameters
  • normalized

  • ax

  • label

  • level

  • ngram

  • offset

  • plot

  • stem

  • vis

Returns

stemgraphic.alpha.polar_word_plot(ax, word, words, label, min_dist, max_dist, metric, offset, step)

polar_word_plot.

Utility function for radar plot.

Parameters
  • ax – matplotlib ax

  • word – string, the reference word that will be placed in the middle

  • words – list of words to compare

  • label – bool if True display words centered at coordinate

  • min_dist – minimum distance based on metric to include a word for display

  • max_dist – maximum distance for a given section

  • metric – any metric function accepting two values and returning that metric in a range from 0 to x

  • offset – where to start plotting in degrees

  • step – how many degrees to step between plots

Returns

stemgraphic.alpha.radar(word, comparisons, ascending=True, display=100, label=True, metric=None, min_distance=1, max_distance=None, random_state=None, sort_by='alpha')

radar.

The radar plot compares a reference word with a corpus. By default, it calculates the levenshtein distance between the reference word and each words in the corpus. An alternate distance or metric function can be provided. Each word is then plotted around the center based on 3 criteria.

  1. If the word length is longer, it is plotted on the left side, else on the right side.

  2. Distance from center is based on the distance function.

  3. the words are equidistant, and their order defined alphabetically or by count (only applicable if the corpus is a text and not a list of unique words, such as a password dictionary).

Stem-and-leaf support is upcoming.

Parameters
  • word – string, the reference word that will be placed in the middle

  • comparisons – external file, list or string or dataframe of words

  • ascending – bool if the sort is ascending

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • label – bool if True display words centered at coordinate

  • metric – Levenshtein (default), or any metric function accepting two values and returning that metric

  • min_distance – minimum distance based on metric to include a word for display

  • max_distance – maximum distance based on metric to include a word for display

  • random_state – initial random seed for the sampling process, for reproducible research

  • sort_by – default to ‘alpha’, can also be ‘len’

Returns

stemgraphic.alpha.scatter(src1, src2, src3=None, alpha=0.5, alpha_only=True, ascending=True, asFigure=False, ax=None, caps=False, compact=True, display=None, fig_xy=None, interactive=True, jitter=False, label=False, leaf_order=1, leaf_skip=0, log_scale=True, normalize=None, percentage=None, project=False, project_only=False, random_state=None, size=5, sort_by='alpha', stem_order=1, stem_skip=0, stop_words=None, whole=False)

scatter.

With 2 sources:

Scatter compares the word frequency of two sources, on each axis. Each data point Z value is the word or stem-and-leaf value, while the X axis reflects that word/ngram count in one source and the Y axis reflect the same word/ngram count in the other source, in two different colors. If one word/ngram is more common on the first source it will be displayed in one color, and if it is more common in the second source, it will be displayed in a different color. The values that are the same for both sources will be displayed in a third color (default colors are blue, black and pink.

With 3 sources:

The scatter will compare in 3d the word frequency of three sources, on each axis. Each data point hover value is the word or stem-and-leaf value, while the X axis reflects that word/ngram count in the 1st source, the Y axis reflects the same word/ngram count in the 2nd source, and the Z axis the 3rd source, each in a different color. If one word/ngram is more common on the 1st source it will be displayed in one color, in the 2nd source as a second color and if it is more common in the 3rd source, it will be displayed in a third color. The values that are the same for both sources will be displayed in a 4th color (default colors are blue, black, purple and pink.

In interactive mode, hovering the data point will give the precise counts on each axis along with the word itself, and filtering by category is done by clicking on the category in the legend. Double clicking a category will show only that category.

Parameters
  • src1 – string, filename, url, list, numpy array, time series, pandas or dask dataframe

  • src2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe

  • src3 – string, filename, url, list, numpy array, time series, pandas or dask dataframe, optional

  • alpha: – opacity of the dots, defaults to 50%

  • alpha_only – only use stems from a-z alphabet (NA on dataframe)

  • ascending – word/stem count sorted in ascending order, defaults to True

  • asFigure – return plot as plotly figure (for web applications)

  • ax – matplotlib axes instance, usually from a figure or other plot

  • caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • fig_xy – tuple for matplotlib figsize, defaults to (20,20)

  • interactive – if cufflinks is loaded, renders as interactive plot in notebook

  • jitter – random noise added to help see multiple data points sharing the same coordinate

  • label – bool if True display words centered at coordinate

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’

  • log_scale – bool if True (default) uses log scale axes (NA in 3d due to open issues with mpl, cufflinks)

  • normalize – bool if True normalize frequencies in src2 and src3 relative to src1 length

  • percentage – coordinates in percentage of maximum word/ngram count (in non interactive mode)

  • project – project src1/src2 and src1/src3 comparisons on X=0 and Z=0 planes

  • project_only – only show the projection (NA if project is False)

  • random_state – initial random seed for the sampling process, for reproducible research

  • sort_by – sort by ‘alpha’ (default) or ‘count’

  • stem_order – how many stem characters per data point to display, defaults to 1

  • stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter

  • stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

  • whole – for normalized or percentage, use whole integer values (round)

Returns

matplotlib ax, dataframe with categories

stemgraphic.alpha.stem_freq_plot(df, alpha_only=False, asFigure=False, column=None, compact=True, caps=False, display=2600, interactive=True, kind='barh', leaf_order=1, leaf_skip=0, random_state=None, stem_order=1, stem_skip=0, stop_words=None)

stem_freq_plot.

Word frequency plot is the most common visualization in NLP. In this version it supports stem-and-leaf / n-grams.

Each row is the stem, and similar leaves are grouped together and each different group is stacked in bar charts.

Default is horizontal bar chart, but vertical, histograms, area charts and even pie charts are supported by this one visualization.

Parameters
  • df – string, filename, url, list, numpy array, time series, pandas or dask dataframe

  • alpha_only – only use stems from a-z alphabet (NA on dataframe)

  • asFigure – return plot as plotly figure (for web applications)

  • column – specify which column (string or number) of the dataframe to use, or group of columns (stems) else the frame is assumed to only have one column with words.

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • interactive – if cufflinks is loaded, renders as interactive plot in nebook

  • kind – defaults to ‘barh’. One of ‘bar’,’barh’,’area’,’hist’. Non-interactive also supports ‘pie’

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’

  • random_state – initial random seed for the sampling process, for reproducible research

  • stem_order – how many stem characters per data point to display, defaults to 1

  • stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter

  • stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

Returns

stemgraphic.alpha.stem_graphic(df, df2=None, aggregation=True, alpha=0.1, alpha_only=True, ascending=False, ax=None, ax2=None, bar_color='C0', bar_outline=None, break_on=None, caps=True, column=None, combined=None, compact=False, delimiter_color='C3', display=750, figure_only=True, flip_axes=False, font_kw=None, leaf_color='k', leaf_order=1, leaf_skip=0, legend_pos='best', median_color='C4', mirror=False, persistence=None, primary_kw=None, random_state=None, remove_accents=False, reverse=False, secondary=False, show_stem=True, sort_by='len', stop_words=None, stem_order=1, stem_skip=0, title=None, trim_blank=False, underline_color=None)

stem_graphic.

The principal visualization of stemgraphic.alpha is stem_graphic. It offers all the options of stem_text (3.1) and adds automatic title, mirroring, flipping of axes, export (to pdf, svg, png, through fig.savefig) and many more options to change the visual appearance of the plot (font size, color, background color, underlining and more).

By providing a secondary text source, the plot will enable comparison through a back-to-back display

Parameters
  • df – string, filename, url, list, numpy array, time series, pandas or dask dataframe

  • df2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional). for back 2 back stem-and-leaf plots

  • aggregation – Boolean for sum, else specify function

  • alpha – opacity of the bars, median and outliers, defaults to 10%

  • alpha_only – only use stems from a-z alphabet (NA on dataframe)

  • ascending – stem sorted in ascending order, defaults to True

  • ax – matplotlib axes instance, usually from a figure or other plot

  • ax2 – matplotlib axes instance, usually from a figure or other plot for back to back

  • bar_color – the fill color of the bar representing the leaves

  • bar_outline – the outline color of the bar representing the leaves

  • break_on – force a break of the leaves at that letter, the rest of the leaves will appear on the next line

  • caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)

  • column – specify which column (string or number) of the dataframe to use, or group of columns (stems) else the frame is assumed to only have one column with words.

  • combined – list (specific subset to automatically include, say, for comparisons), or None

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • delimiter_color – color of the line between aggregate and stem and stem and leaf

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • figure_only – bool if True (default) returns matplotlib (fig,ax), False returns (fig,ax,df)

  • flip_axes – X becomes Y and Y becomes X

  • font_kw – keyword dictionary, font parameters

  • leaf_color – font color of the leaves

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’

  • legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.

  • median_color – color of the box representing the median

  • mirror – mirror the plot in the axis of the delimiters

  • persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)

  • primary_kw – stem-and-leaf plot additional arguments

  • random_state – initial random seed for the sampling process, for reproducible research

  • remove_accents – bool if True strips accents (NA on dataframe)

  • reverse – bool if True look at words from right to left

  • secondary – bool if True, this is a secondary plot - mostly used for back-to-back plots

  • show_stem – bool if True (default) displays the stems

  • sort_by – default to ‘len’, can also be ‘alpha’

  • stem_order – how many stem characters per data point to display, defaults to 1

  • stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter

  • stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

  • title – string, or None. When None and source is a file, filename will be used.

  • trim_blank – remove the blank between the delimiter and the first leaf, defaults to True

  • underline_color – color of the horizontal line under the leaves, None for no display

Returns

matplotlib figure and axes instance, and dataframe if figure_only is False

stemgraphic.alpha.stem_scatter(src1, src2, src3=None, alpha=0.5, alpha_only=True, ascending=True, asFigure=False, ax=None, caps=False, compact=True, display=None, fig_xy=None, interactive=True, jitter=False, label=False, leaf_order=1, leaf_skip=0, log_scale=True, normalize=None, percentage=None, project=False, project_only=False, random_state=None, sort_by='alpha', stem_order=1, stem_skip=0, stop_words=None, whole=False)

stem_scatter.

stem_scatter compares the word frequency of two sources, on each axis. Each data point Z value is the word or stem-and-leaf value, while the X axis reflects that word/ngram count in one source and the Y axis reflect the same word/ngram count in the other source, in two different colors. If one word/ngram is more common on the first source it will be displayed in one color, and if it is more common in the second source, it will be displayed in a different color. The values that are the same for both sources will be displayed in a third color (default colors are blue, black and pink. In interactive mode, hovering the data point will give the precise counts on each axis along with the word itself, and filtering by category is done by clicking on the category in the legend.

Parameters
  • src1 – string, filename, url, list, numpy array, time series, pandas or dask dataframe

  • src2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe

  • src3 – string, filename, url, list, numpy array, time series, pandas or dask dataframe, optional

  • alpha: – opacity of the dots, defaults to 50%

  • alpha_only – only use stems from a-z alphabet (NA on dataframe)

  • ascending – stem sorted in ascending order, defaults to True

  • asFigure – return plot as plotly figure (for web applications)

  • ax – matplotlib axes instance, usually from a figure or other plot

  • caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • fig_xy – tuple for matplotlib figsize, defaults to (20,20)

  • interactive – if cufflinks is loaded, renders as interactive plot in notebook

  • jitter – random noise added to help see multiple data points sharing the same coordinate

  • label – bool if True display words centered at coordinate

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’

  • log_scale – bool if True (default) uses log scale axes (NA in 3d due to open issues with mpl, cufflinks)

  • normalize – bool if True normalize frequencies in src2 and src3 relative to src1 length

  • percentage – coordinates in percentage of maximum word/ngram count

  • random_state – initial random seed for the sampling process, for reproducible research

  • sort_by – sort by ‘alpha’ (default) or ‘count’

  • stem_order – how many stem characters per data point to display, defaults to 1

  • stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter

  • stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

  • whole – for normalized or percentage, use whole integer values (round)

Returns

matplotlib polar ax, dataframe

stemgraphic.alpha.stem_sunburst(words, alpha_only=True, ascending=False, caps=False, compact=True, display=None, hole=True, label=True, leaf_order=1, leaf_skip=0, median=True, ngram=False, random_state=None, sort_by='alpha', statistics=True, stem_order=1, stem_skip=0, stop_words=None, top=0)

stem_sunburst.

Stem-and-leaf based sunburst. See sunburst for details

Parameters
  • words – string, filename, url, list, numpy array, time series, pandas or dask dataframe

  • alpha_only – only use stems from a-z alphabet (NA on dataframe)

  • ascending – stem sorted in ascending order, defaults to True

  • caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • hole – bool if True (default) leave space in middle for statistics

  • label – bool if True display words centered at coordinate

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’

  • median – bool if True (default) display an origin and a median mark

  • ngram – bool if True display full n-gram as leaf label

  • random_state – initial random seed for the sampling process, for reproducible research

  • sort_by – sort by ‘alpha’ (default) or ‘count’

  • statistics – bool if True (default) displays statistics in center - hole has to be True

  • stem_order – how many stem characters per data point to display, defaults to 1

  • stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter

  • stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

  • top – how many different words to count by order frequency. If negative, this will be the least frequent

Returns

stemgraphic.alpha.stem_text(df, aggr=False, alpha_only=True, ascending=True, binary=False, break_on=None, caps=True, charset=None, column=None, compact=False, display=750, legend_pos='top', leaf_order=1, leaf_skip=0, persistence=None, remove_accents=False, reverse=False, rows_only=False, sort_by='len', stem_order=1, stem_skip=0, stop_words=None, random_state=None)

stem_text.

Tukey’s original stem-and-leaf plot was text, with a vertical delimiter to separate stem from leaves. Just as stemgraphic implements a text version of the plot for numbers, stemgraphic.alpha implements a text version for words. This type of plot serves a similar purpose as a stacked bar chart with each data point annotated.

It also displays some basic statistics on the whole text (or subset if using column).

Parameters
  • df – list, numpy array, time series, pandas or dask dataframe

  • aggr – bool if True display the aggregated count of leaves by row

  • alpha_only – only use stems from a-z alphabet (NA on dataframe)

  • ascending – bool if the sort is ascending

  • binary – bool if True forces counts to 1 for anything greater than 0

  • break_on – force a break of the leaves at that letter, the rest of the leaves will appear on the next line

  • caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)

  • column – specify which column (string or number) of the dataframe to use, or group of columns (stems) else the frame is assumed to only have one column with words.

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • leaf_order – how many leaf characters per data point to display, defaults to 1

  • leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’

  • legend_pos – where to put the legend: ‘top’ (default), ‘bottom’ or None

  • persistence – will save the sampled datafrae to filename (with csv or pkl extension) or None

  • random_state – initial random seed for the sampling process, for reproducible research

  • remove_accents – bool if True strips accents (NA on dataframe)

  • reverse – bool if True look at words from right to left

  • rows_only – by default returns only the stem and leaf rows. If false, also return the matrix and dataframe

  • sort_by – default to ‘len’, can also be ‘alpha’

  • stem_order – how many stem characters per data point to display, defaults to 1

  • stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter

  • stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

stemgraphic.alpha.sunburst(words, alpha_only=True, ascending=False, caps=False, compact=True, display=None, hole=True, label=True, leaf_order=1, leaf_skip=0, median=True, ngram=True, random_state=None, sort_by='alpha', statistics=True, stem_order=1, stem_skip=0, stop_words=None, top=40)

sunburst.

Word sunburst charts are similar to pie or donut charts, but add some statistics in the middle of the chart, including the percentage of total words targeted for a given

number of unique words (ie. top 50 words, 48`%` coverage).

With stem-and-leaf, the first level of the sunburst represents the stem and the second level subdivides each stem by leaves.

Parameters
  • words – string, filename, url, list, numpy array, time series, pandas or dask dataframe

  • alpha_only – only use stems from a-z alphabet (NA on dataframe)

  • ascending – stem sorted in ascending order, defaults to True

  • caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • hole – bool if True (default) leave space in middle for statistics

  • label – bool if True display words centered at coordinate

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’

  • median – bool if True (default) display an origin and a median mark

  • ngram – bool if True (default) display full n-gram as leaf label

  • random_state – initial random seed for the sampling process, for reproducible research

  • statistics – bool if True (default) displays statistics in center - hole has to be True

  • sort_by – sort by ‘alpha’ (default) or ‘count’

  • stem_order – how many stem characters per data point to display, defaults to 1

  • stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter

  • stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

  • top – how many different words to count by order frequency. If negative, this will be the least frequent

Returns

matplotlib polar ax, dataframe

stemgraphic.alpha.text_heatmap(df, caps=True, charset=None, column=None, compact=True, display=900, flip_axes=False, leaf_order=1, outliers=None, persistence=None, random_state=None, scale=None, trim=False, trim_blank=True, unit='', zero_blank=True, zoom=None)

text heatmap.

The heatmap displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves, they are left in their respective columns. Row ‘42’ and Column ‘7’ would have the count of numbers starting with ‘427’ of the given scale. The difference with the heatmatrix is that by default it doesn’t show zero values and it present a compact form by not showing whole empty rows either. Set compact = True to display those empty rows.

The heatmap is useful to look at patterns. For distribution, stem_graphic is better suited.

Parameters
  • df – list, numpy array, time series, pandas or dask dataframe

  • column – specify which column (string or number) of the dataframe to use, else the first numerical is selected

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • flip_axes – wide format

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • outliers – for compatibility with other text plots

  • persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)

  • random_state – initial random seed for the sampling process, for reproducible research

  • scale – force a specific scale for building the plot. Defaults to None (automatic).

  • trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None

  • trim_blank – remove the blank between the delimiter and the first leaf, defaults to True

  • unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’

  • zero_blank – replace zero digit with space

  • zoom – zoom level, on top of calculated scale (+1, -1 etc)

Returns

count matrix, scale

stemgraphic.alpha.word_freq_plot(src, alpha_only=False, ascending=False, asFigure=False, caps=False, display=None, interactive=True, kind='barh', random_state=None, sort_by='count', stop_words=None, top=100)

word frequency bar chart.

This function creates a classical word frequency bar chart.

Parameters
  • src – Either a filename including path, a url or a ready to process text in a dataframe or a tokenized format.

  • alpha_only – words only if True, words and numbers if False

  • ascending – stem sorted in ascending order, defaults to True

  • asFigure – if interactive, the function will return a plotly figure instead of a matplotlib ax

  • caps – keep capitalization (True, False)

  • display – if specified, sample that quantity of words

  • interactive – interactive graphic (True, False)

  • kind – horizontal bar chart (barh) - also ‘bar’, ‘area’, ‘hist’ and non interactive ‘kde’ and ‘pie’

  • random_state – initial random seed for the sampling process, for reproducible research

  • sort_by – default to ‘count’, can also be ‘alpha’

  • stop_words – a list of words to ignore

  • top – how many different words to count by order frequency. If negative, this will be the least frequent

Returns

text as dataframe and plotly figure or matplotlib ax

stemgraphic.alpha.word_radar(word, comparisons, ascending=True, display=100, label=True, metric=None, min_distance=1, max_distance=None, random_state=None, sort_by='alpha')

word_radar.

Radar plot based on words. Currently, the only type of radar plot supported. See `radar’ for more detail.

Parameters
  • word – string, the reference word that will be placed in the middle

  • comparisons – external file, list or string or dataframe of words

  • ascending – bool if the sort is ascending

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • label – bool if True display words centered at coordinate

  • metric – any metric function accepting two values and returning that metric in a range from 0 to x

  • min_distance – minimum distance based on metric to include a word for display

  • max_distance – maximum distance based on metric to include a word for display

  • random_state – initial random seed for the sampling process, for reproducible research

  • sort_by – default to ‘alpha’, can also be ‘len’

Returns

stemgraphic.alpha.word_scatter(src1, src2, src3=None, alpha=0.5, alpha_only=True, ascending=True, asFigure=False, ax=None, caps=False, compact=True, display=None, fig_xy=None, interactive=True, jitter=False, label=False, leaf_order=None, leaf_skip=0, log_scale=True, normalize=None, percentage=None, random_state=None, sort_by='alpha', stem_order=None, stem_skip=0, stop_words=None, whole=False)

word_scatter.

Scatter compares the word frequency of two sources, on each axis. Each data point Z value is the word or stem-and-leaf value, while the X axis reflects that word count in one source and the Y axis re- flect the same word count in the other source, in two different colors. If one word is more common on the first source it will be displayed in one color, and if it is more common in the second source, it will be displayed in a different color. The values that are the same for both sources will be displayed in a third color (default colors are blue, black and pink. In interactive mode, hovering the data point will give the precise counts on each axis along with the word itself, and filtering by category is done by clicking on the category in the legend.

Parameters
  • src1 – string, filename, url, list, numpy array, time series, pandas or dask dataframe

  • src2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe

  • src3 – string, filename, url, list, numpy array, time series, pandas or dask dataframe, optional

  • alpha – opacity of the bars, median and outliers, defaults to 10%

  • alpha_only – only use stems from a-z alphabet (NA on dataframe)

  • ascending – stem sorted in ascending order, defaults to True

  • asFigure – return plot as plotly figure (for web applications)

  • ax – matplotlib axes instance, usually from a figure or other plot

  • caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • fig_xy – tuple for matplotlib figsize, defaults to (20,20)

  • interactive – if cufflinks is loaded, renders as interactive plot in notebook

  • jitter – random noise added to help see multiple data points sharing the same coordinate

  • label – bool if True display words centered at coordinate

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’

  • log_scale – bool if True (default) uses log scale axes

  • random_state – initial random seed for the sampling process, for reproducible research

  • sort_by – sort by ‘alpha’ or ‘count’ (default)

  • stem_order – how many stem characters per data point to display, defaults to 1

  • stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter

  • stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

  • whole – for normalized or percentage, use whole integer values (round)

Returns

matplotlib polar ax, dataframe

stemgraphic.alpha.word_sunburst(words, alpha_only=True, ascending=False, caps=False, compact=True, display=None, hole=True, label=True, leaf_order=None, leaf_skip=0, median=True, ngram=True, random_state=None, sort_by='alpha', statistics=True, stem_order=None, stem_skip=0, stop_words=None, top=40)

word_sunburst.

Word based sunburst. See sunburst for details

Parameters
  • words – string, filename, url, list, numpy array, time series, pandas or dask dataframe

  • alpha_only – only use stems from a-z alphabet (NA on dataframe)

  • ascending – stem sorted in ascending order, defaults to True

  • caps – bool, True to be case sensitive, defaults to False, recommended for comparisons.(NA on dataframe)

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • hole – bool if True (default) leave space in middle for statistics

  • label – bool if True display words centered at coordinate

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • leaf_skip – how many leaf characters to skip, defaults to 0 - useful w/shared bigrams: ‘wol’,’wor’,’woo’

  • median – bool if True (default) display an origin and a median mark

  • ngram – bool if True (default) display full n-gram as leaf label

  • random_state – initial random seed for the sampling process, for reproducible research

  • statistics – bool if True (default) displays statistics in center - hole has to be True

  • sort_by – sort by ‘alpha’ (default) or ‘count’

  • stem_order – how many stem characters per data point to display, defaults to 1

  • stem_skip – how many stem characters to skip, defaults to 0 - useful to zoom in on a single root letter

  • stop_words – stop words to remove. None (default), list or builtin EN (English), ES (Spanish) or FR (French)

  • top – how many different words to count by order frequency. If negative, this will be the least frequent

Returns

graphic

Stemgraphic.graphic.

Stemgraphic provides a complete set of functions to handle everything related to stem-and-leaf plots. Stemgraphic.graphic is a module implementing a graphical stem-and-leaf plot function and a stem-and-leaf heatmap plot function for numerical data. It also provides a density_plot

stemgraphic.graphic.density_plot(df, var=None, ax=None, bins=None, box=None, density=True, density_fill=True, display=1000, fig_only=True, fit=None, hist=None, hues=None, hue_labels=None, jitter=None, kind=None, leaf_order=1, legend=True, limit_var=False, norm_hist=None, random_state=None, rug=None, scale=None, singular=True, strip=None, swarm=None, title=None, violin=None, x_min=0, x_max=None, y_axis_label=True)

density_plot.

Various density and distribution plots conveniently packaged into one function. Density plot normally forces tails at each end which might go beyond the data. To force min/max to be driven by the data, use limit_var. To specify min and max use x_min and x_max instead. Nota Bene: defaults to _decimation_ and _quantization_ mode.

See density_plot notebook for examples of the different combinations of plots.

Why this instead of seaborn:

Stem-and-leaf plots naturally quantize data. The amount of loss is based on scale and leaf_order and on the data itself. This function which wraps several seaborn distribution plots was added in order to compare various measures of density and distributions based on various levels of decimation (sampling, set through display) and of quantization (set through scale and leaf_order). Also, there is no option in seaborn to fill the area under the curve…

Parameters
  • df – list, numpy array, time series, pandas or dask dataframe

  • var – variable to plot, required if df is a dataframe

  • ax – matplotlib axes instance, usually from a figure or other plot

  • bins – Specification of hist bins, or None to use Freedman-Diaconis rule

  • box – bool, if True plots a box plot. Similar to using violin, use one or the other

  • density – bool, if True (default) plots a density plot

  • density_fill – bool, if True (default) fill the area under the density curve

  • display – maximum number rows to use (1000 default) for calculations, forces sampling if < len(df)

  • fig_only – bool, if True (default) returns fig, ax, else returns fix, ax, max_peak, true_min, true_max

  • fit – object with fit method, returning a tuple that can be passed to a pdf method

  • hist – bool, if True plot a histogram

  • hues – optional, a categorical variable for multiple plots

  • hue_labels – optional, if using a column that is an object and/or categorical needing translation

  • jitter – for strip plots only, add jitter. strip + jitter is similar to using swarm, use one or the other

  • leaf_order – the order of magnitude of the leaf. The higher the order, the less quantization.

  • legend – bool, if True plots a legend

  • limit_var – use min / max from the data, not density plot

  • norm_hist – bool, if True histogram will be normed

  • random_state – initial random seed for the sampling process, for reproducible research

  • rug – bool, if True plot a rug plot

  • scale – force a specific scale for building the plot. Defaults to None (automatic).

  • singular – force display of a density plot using a singular value, by simulating values of each side

  • strip – bool, if True displays a strip plot

  • swarm – swarm plot, similar to strip plot. use one or the other

  • title – if present, adds a title to the plot

  • violin – bool, if True plots a violin plot. Similar to using box, use one or the other

  • x_min – force X axis minimum value. See also limit_var

  • x_max – force Y axis minimum value. See also limit_var

  • y_axis_label – bool, if True displays y axis ticks and label

Returns

see fig_only

stemgraphic.graphic.heatmap(df, annotate=False, asFigure=False, ax=None, caps=None, column=None, compact=False, display=900, flip_axes=False, interactive=True, leaf_order=1, persistence=None, random_state=None, scale=None, trim=False, trim_blank=True, unit='', zoom=None)

heatmap.

The heatmap displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves, they are left in their respective columns. Row ‘42’ and Column ‘7’ would have the count of numbers starting with ‘427’ of the given scale. by opposition to the text heatmap, the graphical heatmap does not remove empty rows by default. To activate this feature, use compact=True.

The heatmap is useful to look at patterns. For distribution, stem_graphic is better suited.

Parameters
  • df – list, numpy array, time series, pandas or dask dataframe

  • annotate – display annotations (Z) on heatmap

  • asFigure – return plot as plotly figure (for web applications)

  • ax – matplotlib axes instance, usually from a figure or other plot

  • caps – for compatibility

  • column – specify which column (string or number) of the dataframe to use, else the first numerical is selected

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • flip_axes – bool, default is False

  • interactive – if cufflinks is loaded, renders as interactive plot in notebook

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)

  • random_state – initial random seed for the sampling process, for reproducible research

  • scale – force a specific scale for building the plot. Defaults to None (automatic).

  • trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None

  • trim_blank – remove the blank between the delimiter and the first leaf, defaults to True

  • unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’

  • zoom – zoom level, on top of calculated scale (+1, -1 etc)

Returns

count matrix, scale and matplotlib ax or figure if interactive and asFigure are True

stemgraphic.graphic.leaf_scatter(df, alpha=0.1, asc=True, ax=None, break_on=None, column=None, compact=False, delimiter_color='C3', display=900, figure_only=True, flip_axes=False, font_kw=None, grid=False, interactive=True, leaf_color='k', leaf_jitter=False, leaf_order=1, legend_pos='best', mirror=False, persistence=None, primary_kw=None, random_state=None, scale=None, scaled_leaf=True, zoom=None)

leaf_scatter.

Scatter for numerical values based on leaf for X axis (scaled or not) and stem for Y axis.

Parameters
  • df – list, numpy array, time series, pandas or dask dataframe

  • alpha – opacity of the dots, defaults to 10%

  • asc – stem (Y axis) sorted in ascending order, defaults to True

  • ax – matplotlib axes instance, usually from a figure or other plot

  • break_on – force a break of the leaves at x in (5, 10), defaults to 10

  • column – specify which column (string or number) of the dataframe to use, else the first numerical is selected

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • delimiter_color – color of the line between aggregate and stem and stem and leaf

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • figure_only – bool if True (default) returns matplotlib (fig,ax), False returns (fig,ax,df)

  • flip_axes – X becomes Y and Y becomes X

  • font_kw – keyword dictionary, font parameters

  • grid – show grid

  • interactive – if plotly is available, renders as interactive plot in notebook. False to render image.

  • leaf_color – font color of the leaves

  • leaf_jitter – add jitter to see density of each specific stem/leaf combo

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.

  • mirror – mirror the plot in the axis of the delimiters

  • persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)

  • primary_kw – stem-and-leaf plot additional arguments

  • random_state – initial random seed for the sampling process, for reproducible research

  • scale – force a specific scale for building the plot. Defaults to None (automatic).

  • scaled_leaf – scale leafs, bool

  • zoom – zoom level, on top of calculated scale (+1, -1 etc)

Returns

stemgraphic.graphic.stem_graphic(df, df2=None, aggregation=True, alpha=0.1, asc=True, ax=None, ax2=None, bar_color='C0', bar_outline=None, break_on=None, column=None, combined=None, compact=False, delimiter_color='C3', display=900, figure_only=True, flip_axes=False, font_kw=None, leaf_color='k', leaf_order=1, legend_pos='best', median_alpha=0.25, median_color='C4', mirror=False, outliers=None, outliers_color='C3', persistence=None, primary_kw=None, random_state=None, scale=None, secondary=False, secondary_kw=None, secondary_plot=None, show_stem=True, title=None, trim=False, trim_blank=True, underline_color=None, unit='', zoom=None)

stem_graphic.

A graphical stem and leaf plot. stem_graphic provides horizontal, vertical or mirrored layouts, sorted in ascending or descending order, with sane default settings for the visuals, legend, median and outliers.

Parameters
  • df – list, numpy array, time series, pandas or dask dataframe

  • df2 – string, filename, url, list, numpy array, time series, pandas or dask dataframe (optional). for back 2 back stem-and-leaf plots

  • aggregation – Boolean for sum, else specify function

  • alpha – opacity of the bars, median and outliers, defaults to 10%

  • asc – stem sorted in ascending order, defaults to True

  • ax – matplotlib axes instance, usually from a figure or other plot

  • ax2 – matplotlib axes instance, usually from a figure or other plot for back to back

  • bar_color – the fill color of the bar representing the leaves

  • bar_outline – the outline color of the bar representing the leaves

  • break_on – force a break of the leaves at x in (5, 10), defaults to 10

  • column – specify which column (string or number) of the dataframe to use, else the first numerical is selected

  • combined – list (specific subset to automatically include, say, for comparisons), or None

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • delimiter_color – color of the line between aggregate and stem and stem and leaf

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • figure_only – bool if True (default) returns matplotlib (fig,ax), False returns (fig,ax,df)

  • flip_axes – X becomes Y and Y becomes X

  • font_kw – keyword dictionary, font parameters

  • leaf_color – font color of the leaves

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.

  • median_alpha – opacity of median and outliers, defaults to 25%

  • median_color – color of the box representing the median

  • mirror – mirror the plot in the axis of the delimiters

  • outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto

  • outliers_color – background color for the outlier boxes

  • persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)

  • primary_kw – stem-and-leaf plot additional arguments

  • random_state – initial random seed for the sampling process, for reproducible research

  • scale – force a specific scale for building the plot. Defaults to None (automatic).

  • secondary – bool if True, this is a secondary plot - mostly used for back-to-back plots

  • secondary_kw – any matplotlib keyword supported by .plot(), for the secondary plot

  • secondary_plot – One or more of ‘dot’, ‘kde’, ‘margin_kde’, ‘rug’ in a comma delimited string or None

  • show_stem – bool if True (default) displays the stems

  • title – string to display as title

  • trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None

  • trim_blank – remove the blank between the delimiter and the first leaf, defaults to True

  • underline_color – color of the horizontal line under the leaves, None for no display

  • unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’

  • zoom – zoom level, on top of calculated scale (+1, -1 etc)

Returns

matplotlib figure and axes instance

helpers

helpers.py.

Helper functions for stemgraphic.

stemgraphic.helpers.APOSTROPHE = '’'

Typographical apostrophe - ex: I’m, l’arbre

stemgraphic.helpers.CHAR_FILTER = ['\t', '\n', '\\', '/', '`', '*', '_', '{', '}', '[', ']', '(', ')', '<', '>', '#', '=', '+', '- ', '–', '.', ';', ':', '!', '?', '|', '$', "'", '"', '…']

Characters to filter. Does a relatively good job on a majority of texts ‘- ‘ and ‘–’ is to skip quotes in many plays and dialogues in books, especially French.

stemgraphic.helpers.DOUBLE_QUOTE = '"'

Double straight quote mark

stemgraphic.helpers.EMPTY = b' '

empty

stemgraphic.helpers.LETTERS = 'abcdefghijklmnopqrstuvwxyz'

Default definition of standard letters remove_accent has to be called explicitly for any of these letters to match their accented counterparts

stemgraphic.helpers.NON_ALPHA = ['-', '+', '/', '[', ']', '_', '£', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', ';', "'", '"', '’', b' ', b'\xd6\xb1', '?', '¡', '¿', '«', '»', '“', '”', '-', '—']

List of non alpha characters. Temporary - I want to balance flexibility with convenience, but still looking at options.

stemgraphic.helpers.NO_PERIOD_FILTER = ['\t', '\n', '\\', '/', '`', '*', '_', '{', '}', '[', ']', '(', ')', '<', '>', '#', '=', '+', '- ', '–', ';', ':', '!', '?', '|', '$', "'", '"']

Similar purpose to CHAR_FILTER, ut keeps the period. The last word of each sentence will end with a ‘.’ Useful for manipulating the dataframe returned by the various visualizations and ngram_data, to break down frequencies by sentence instead of the full text or list.

stemgraphic.helpers.OVER = b'\xd6\xb1'

for typesetting overlap

stemgraphic.helpers.QUOTE = "'"

Straight quote mark - ex: ‘INCONCEIVABLE’

stemgraphic.helpers.alpha_mapping = {'bold': '𝐀𝐁𝐂𝐃𝐄𝐅𝐆𝐇𝐈𝐉𝐊𝐋𝐌𝐍𝐎𝐏𝐐𝐑𝐒𝐓𝐔𝐕𝐖𝐗𝐘𝐙𝐚𝐛𝐜𝐝𝐞𝐟𝐠𝐡𝐢𝐣𝐤𝐥𝐦𝐧𝐨𝐩𝐪𝐫𝐬𝐭𝐮𝐯𝐰𝐱𝐲𝐳', 'boldsans': '𝗔𝗕𝗖𝗗𝗘𝗙𝗚𝗛𝗜𝗝𝗞𝗟𝗠𝗡𝗢𝗣𝗤𝗥𝗦𝗧𝗨𝗩𝗪𝗫𝗬𝗭𝗮𝗯𝗰𝗱𝗲𝗳𝗴𝗵𝗶𝗷𝗸𝗹𝗺𝗻𝗼𝗽𝗾𝗿𝘀𝘁𝘂𝘃𝘄𝘅𝘆𝘇', 'circle': 'ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏⓐⓑⓒⓓⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩ', 'cursive': '𝒜𝐵𝒞𝒟𝐸𝐹𝒢𝐻𝐼𝒥𝒦𝐿𝑀𝒩𝒪𝒫𝒬𝑅𝒮𝒯𝒰𝒱𝒲𝒳𝒴𝒵𝒶𝒷𝒸𝒹𝑒𝒻𝑔𝒽𝒾𝒿𝓀𝓁𝓂𝓃𝑜𝓅𝓆𝓇𝓈𝓉𝓊𝓋𝓌𝓍𝓎𝓏', 'default': 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', 'doublestruck': '𝔸𝔹ℂ𝔻𝔼𝔽𝔾ℍ𝕀𝕁𝕂𝕃𝕄ℕ𝕆ℙℚℝ𝕊𝕋𝕌𝕍𝕎𝕏𝕐ℤ𝕒𝕓𝕔𝕕𝕖𝕗𝕘𝕙𝕚𝕛𝕜𝕝𝕞𝕟𝕠𝕡𝕢𝕣𝕤𝕥𝕦𝕧𝕨𝕩𝕪𝕫', 'italicbold': '𝑨𝑩𝑪𝑫𝑬𝑭𝑮𝑯𝑰𝑱𝑲𝑳𝑴𝑵𝑶𝑷𝑸𝑹𝑺𝑻𝑼𝑽𝑾𝑿𝒀𝒁𝒂𝒃𝒄𝒅𝒆𝒇𝒈𝒉𝒊𝒋𝒌𝒍𝒎𝒏𝒐𝒑𝒒𝒓𝒔𝒕𝒖𝒗𝒘𝒙𝒚𝒛', 'italicboldsans': '𝘼𝘽𝘾𝘿𝙀𝙁𝙂𝙃𝙄𝙅𝙆𝙇𝙈𝙉𝙊𝙋𝙌𝙍𝙎𝙏𝙐𝙑𝙒𝙓𝙔𝙕𝙖𝙗𝙘𝙙𝙚𝙛𝙜𝙝𝙞𝙟𝙠𝙡𝙢𝙣𝙤𝙥𝙦𝙧𝙨𝙩𝙪𝙫𝙬𝙭𝙮𝙯', 'medieval': '𝔄𝔅ℭ𝔇𝔈𝔉𝔊ℌℑ𝔍𝔎𝔏𝔐𝔑𝔒𝔓𝔔ℜ𝔖𝔗𝔘𝔙𝔚𝔛𝔜ℨ𝔞𝔟𝔠𝔡𝔢𝔣𝔤𝔥𝔦𝔧𝔨𝔩𝔪𝔫𝔬𝔭𝔮𝔯𝔰𝔱𝔲𝔳𝔴𝔵𝔶𝔷', 'medievalbold': '𝕬𝕭𝕮𝕯𝕰𝕱𝕲𝕳𝕴𝕵𝕶𝕷𝕸𝕹𝕺𝕻𝕼𝕽𝕾𝕿𝖀𝖁𝖂𝖃𝖄𝖅𝖆𝖇𝖈𝖉𝖊𝖋𝖌𝖍𝖎𝖏𝖐𝖑𝖒𝖓𝖔𝖕𝖖𝖗𝖘𝖙𝖚𝖛𝖜𝖝𝖞𝖟', 'square': '🄰🄱🄲🄳🄴🄵🄶🄷🄸🄹🄺🄻🄼🄽🄾🄿🅀🅁🅂🅃🅄🅅🅆🅇🅈🅉🄰🄱🄲🄳🄴🄵🄶🄷🄸🄹🄺🄻🄼🄽🄾🄿🅀🅁🅂🅃🅄🅅🅆🅇🅈🅉', 'square_inverted': '🅰🅱🅲🅳🅴🅵🅶🅷🅸🅹🅺🅻🅼🅽🅾🅿🆀🆁🆂🆃🆄🆅🆆🆇🆈🆉🅰🅱🅲🅳🅴🅵🅶🅷🅸🅹🅺🅻🅼🅽🅾🅿🆀🆁🆂🆃🆄🆅🆆🆇🆈🆉', 'typewriter': '𝙰𝙱𝙲𝙳𝙴𝙵𝙶𝙷𝙸𝙹𝙺𝙻𝙼𝙽𝙾𝙿𝚀𝚁𝚂𝚃𝚄𝚅𝚆𝚇𝚈𝚉𝚊𝚋𝚌𝚍𝚎𝚏𝚐𝚑𝚒𝚓𝚔𝚕𝚖𝚗𝚘𝚙𝚚𝚛𝚜𝚝𝚞𝚟𝚠𝚡𝚢𝚣'}

Alphabet unicode mapping

stemgraphic.helpers.available_alpha_charsets()

available_alpha_charsets.

All supported unicode alphabet charsets, such as ‘doublestruck’ where A looks like: 𝔸

Returns

list of charset names

stemgraphic.helpers.available_charsets()

available_alpha_charsets.

All supported unicode digit charsets, such as ‘doublestruck’ where 0 looks like: 𝟘

Returns

list of charset names

stemgraphic.helpers.jitter(data, scale)

jitter.

Adds jitter to data, for display purpose

Parameters
  • data – numpy or pandas dataframe

  • scale

Returns

stemgraphic.helpers.key_calc(stem, leaf, scale)

key_calc.

Calculates a value from a stem, a leaf and a scale.

Parameters
  • stem

  • leaf

  • scale

Returns

calculated values

stemgraphic.helpers.legend(ax, x, y, asc, flip_axes, mirror, stem, leaf, scale, delimiter_color, aggregation=True, cur_font=None, display=10, pos='best', unit='')

legend.

Builds a graphical legend for numerical stem-and-leaf plots.

Parameters
  • display

  • cur_font

  • ax

  • x

  • y

  • pos

  • asc

  • flip_axes

  • mirror

  • stem

  • leaf

  • scale

  • delimiter_color

  • unit

  • aggregation

stemgraphic.helpers.mapping = {'arabic': {'0': '٠', '1': '١', '2': '٢', '3': '٣', '4': '٤', '5': '٥', '6': '٦', '7': '٧', '8': '٨', '9': '٩'}, 'arabic_r': {'0': '٠', '1': '١', '2': '٢', '3': '٣', '4': '٤', '5': '٥', '6': '٦', '7': '٧', '8': '٨', '9': '٩'}, 'bold': {'0': '𝟎', '1': '𝟏', '2': '𝟐', '3': '𝟑', '4': '𝟒', '5': '𝟓', '6': '𝟔', '7': '𝟕', '8': '𝟖', '9': '𝟗'}, 'circled': {'0': '⓪', '1': '①', '2': '②', '3': '③', '4': '④', '5': '⑤', '6': '⑥', '7': '⑦', '8': '⑧', '9': '⑨'}, 'default': {'0': '0', '1': '1', '2': '2', '3': '3', '4': '4', '5': '5', '6': '6', '7': '7', '8': '8', '9': '9'}, 'doublestruck': {'0': '𝟘', '1': '𝟙', '2': '𝟚', '3': '𝟛', '4': '𝟜', '5': '𝟝', '6': '𝟞', '7': '𝟟', '8': '𝟠', '9': '𝟡'}, 'fullwidth': {'0': '0', '1': '1', '2': '2', '3': '3', '4': '4', '5': '5', '6': '6', '7': '7', '8': '8', '9': '9'}, 'gurmukhi': {'0': '੦', '1': '੧', '2': '੨', '3': '੩', '4': '੪', '5': '੫', '6': '੬', '7': '੭', '8': '੮', '9': '੯'}, 'mono': {'0': '𝟶', '1': '𝟷', '2': '𝟸', '3': '𝟹', '4': '𝟺', '5': '𝟻', '6': '𝟼', '7': '𝟽', '8': '𝟾', '9': '𝟿'}, 'nko': {'0': '߀', '1': '߁', '2': '߂', '3': '߃', '4': '߄', '5': '߅', '6': '߆', '7': '߇', '8': '߈', '9': '߉'}, 'rod': {'0': '◯', '1': '𝍩', '2': '𝍪', '3': '𝍫', '4': '𝍬', '5': '𝍭', '6': '𝍮', '7': '𝍯', '8': '𝍰', '9': '𝍱'}, 'roman': {'0': '.', '1': 'Ⅰ', '2': 'Ⅱ', '3': 'Ⅲ', '4': 'Ⅳ', '5': 'Ⅴ', '6': 'Ⅵ', '7': 'Ⅶ', '8': 'Ⅷ', '9': 'Ⅸ'}, 'sans': {'0': '𝟢', '1': '𝟣', '2': '𝟤', '3': '𝟥', '4': '𝟦', '5': '𝟧', '6': '𝟨', '7': '𝟩', '8': '𝟪', '9': '𝟫'}, 'sansbold': {'0': '𝟬', '1': '𝟭', '2': '𝟮', '3': '𝟯', '4': '𝟰', '5': '𝟱', '6': '𝟲', '7': '𝟳', '8': '𝟴', '9': '𝟵'}, 'square': {'0': '🞌', '1': '🞍', '2': '■', '3': '⬛', '4': '🞓', '5': '🞒', '6': '🞑', '7': '🞐', '8': '🞏', '9': '🞎'}, 'subscript': {'0': '₀', '1': '₁', '2': '₂', '3': '₃', '4': '₄', '5': '₅', '6': '₆', '7': '₇', '8': '₈', '9': '₉'}, 'tamil': {'0': '௦', '1': '௧', '2': '௨', '3': '௩', '4': '௪', '5': '௫', '6': '௬', '7': '௭', '8': '௮', '9': '௯'}}

Charset unicode digit mappings

stemgraphic.helpers.min_max_count(x, column=0)

min_max_count.

Handles min, max and count. This works on numpy, lists, pandas and dask dataframes.

Parameters
  • x – list, numpy array, series, pandas or dask dataframe

  • column – future use

Returns

min, max and count

stemgraphic.helpers.na_count(x, column=0)

min_max_count.

Handles min, max and count. This works on numpy, lists, pandas and dask dataframes.

Parameters
  • x – list, numpy array, series, pandas or dask dataframe

  • column – future use

Returns

all numpy nan count

stemgraphic.helpers.npy_load(path)

npy_load.

load numpy array (npy) file from disk.

Parameters

path – path to pickle file

Returns

numpy array

stemgraphic.helpers.npy_save(path, array)

npy_save.

saves numpy array to npy file on disk.

Parameters
  • path – path where to save npy file

  • array – numpy array

Returns

path

stemgraphic.helpers.percentile(data, alpha)

percentile.

Parameters
  • data – list, numpy array, time series or pandas dataframe

  • alpha – between 0 and 0.5 proportion to select on each side of the distribution

Returns

the actual value at that percentile

stemgraphic.helpers.pkl_load(path)

pkl_load.

load matrix or dataframe pickle (pkl) file from disk.

Parameters

path – path to pickle file

Returns

matrix or dataframe

stemgraphic.helpers.pkl_save(path, array)

pkl_save.

saves matrix or dataframe to pkl file on disk.

Parameters
  • path – path where to save pickle file

  • array – matrix (array) or dataframe

Returns

path

stemgraphic.helpers.savefig(plt)

savefig.

Allows displaying a matplotlib figure to the console terminal. This requires pysixel to be pip installed. It also requires a terminal with Sixel graphic support, like DEC with graphic support, Linux xterm (started with -ti 340), MLTerm (multilingual terminal, available on Windows, Linux etc).

This is called by the command line stem tool when using -o stdout and can also be used in an ipython session.

Parameters

plt – matplotlib pyplot

Returns

stemgraphic.helpers.square_scale()

square_scale.

Ordered key for 0-9 mapping to squares from tiny filled square to large hollow square.

Returns

scale from 0 to 9

stemgraphic.helpers.stack_columns(row)

stack_columns.

stack multiple columns into a single stacked value

Parameters

row – a row of letters

Returns

stacked string

stemgraphic.helpers.translate_alpha_representation(text, charset=None)

translate_alpha_representation.

Replace the default (ASCII type) charset in a string with the equivalent in a different unicode charset.

Parameters
  • text – input string

  • charset – unicode character set as defined by available_alpha_charsets

Returns

translated string

stemgraphic.helpers.translate_representation(text, charset=None, index=None, zero_blank=None)

translate_representation.

Replace the default (ASCII type) digit glyphs in a string with the equivalent in a different unicode charset.

Parameters
  • text – input string

  • charset – unicode character set as defined by available_alpha_charsets

  • index – correspond to which item in a list we are looking at, for zero_blank

  • zero_blank – will blank 0 if True, unless we are looking at header (row index < 2)

Returns

translated string

num

stemgraphic.num.

BRAND NEW in V.0.5.0!

Stemgraphic provides a complete set of functions to handle everything related to stem-and-leaf plots. num is a module of the stemgraphic package to handle numerical variables.

This module structure is new as of v.0.5.0 to match the addition of stemgraphic.alpha.

The shorthand from previous versions of stemgraphic is still available and defaults to the numerical functions:

from stemgraphic import stem_graphic, stem_text, heatmap

stopwords

stopwords.

This module includes 4 lists of stop words: EN (main English list), ALT_EN (alternate English list), FR (French) and SP (Spanish).

A PT (Portuguese) list is in the works.

stemgraphic.stopwords.ALT_EN = ['a', 'am', 'an', 'and', 'are', 'as', 'at', 'been', 'for', 'from', 'in', 'is', 'of', 'on', 'or', 'out', 'so', 'such', 'that', 'the', 'these', 'this', 'those', 'to', 'upon', 'was', 'were']

ALT_ENglish stopwords

stemgraphic.stopwords.EN = ['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'inside', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves']

ENglish stop words

stemgraphic.stopwords.ES = ['a', 'alguna', 'algunas', 'alguno', 'algunos', 'algún', 'ambas', 'ambos', 'ampleamos', 'ante', 'antes', 'aquel', 'aquellas', 'aquellos', 'aqui', 'arriba', 'atras', 'aun', 'bajo', 'bastante', 'bien', 'cada', 'cierta', 'ciertas', 'cierto', 'ciertos', 'como', 'con', 'conseguimos', 'conseguir', 'consigo', 'consigue', 'consiguen', 'consigues', 'cual', 'cuando', 'dentro', 'desde', 'donde', 'dos', 'el', 'ella', 'ellas', 'ellos', 'empleais', 'emplean', 'emplear', 'empleas', 'empleo', 'en', 'encima', 'entonces', 'entre', 'era', 'eramos', 'eran', 'eras', 'eres', 'es', 'esta', 'estaba', 'estado', 'estais', 'estamos', 'estan', 'estoy', 'fin', 'fue', 'fueron', 'fui', 'fuimos', 'ha', 'hace', 'haceis', 'hacemos', 'hacen', 'hacer', 'haces', 'hago', 'incluso', 'intenta', 'intentais', 'intentamos', 'intentan', 'intentar', 'intentas', 'intento', 'ir', 'la', 'largo', 'las', 'lo', 'los', 'mientras', 'mio', 'modo', 'muchos', 'muy', 'nos', 'nosotros', 'otro', 'para', 'pero', 'podeis', 'podemos', 'poder', 'podria', 'podriais', 'podriamos', 'podrian', 'podrias', 'por', 'porque', 'primero', 'puede', 'pueden', 'puedo', 'quien', 'sabe', 'sabeis', 'sabemos', 'saben', 'saber', 'sabes', 'ser', 'sea', 'sean', 'si', 'siendo', 'sin', 'sobre', 'sois', 'solamente', 'solo', 'somos', 'soy', 'su', 'sus', 'también', 'teneis', 'tenemos', 'tener', 'tengo', 'tiempo', 'tiene', 'tienen', 'todo', 'trabaja', 'trabajais', 'trabajamos', 'trabajan', 'trabajar', 'trabajas', 'trabajo', 'tras', 'tu', 'tuyo', 'ultimo', 'un', 'una', 'unas', 'uno', 'unos', 'usa', 'usais', 'usamos', 'usan', 'usar', 'usas', 'uso', 'va', 'vais', 'valor', 'vamos', 'van', 'vaya', 'verdad', 'verdadera', 'verdadero', 'vosotras', 'vosotros', 'voy', 'yo']

Spanish (ESpanol) stop words

stemgraphic.stopwords.FR = ['a', 'alors', 'au', 'aucuns', 'aussi', 'autre', 'autres', 'aux', 'avant', 'avec', 'avoir', 'bon', 'car', 'ce', 'cela', 'ces', 'ceux', 'chacun', 'chacune', 'chaque', 'ci', 'comme', 'comment', 'dans', 'de', 'dedans', 'dehors', 'depuis', 'derrière', 'des', 'dessus', 'devant', 'devrait', 'doit', 'donc', 'dos', 'du', 'début', 'elle', 'elles', 'en', 'encore', 'essai', 'est', 'et', 'eu', 'fait', 'faites', 'fois', 'font', 'hors', 'ici', 'il', 'ils', 'je', 'juste', 'la', 'le', 'les', 'leur', 'là', 'ma', 'maintenant', 'mais', 'mes', 'mine', 'moins', 'mon', 'mot', 'même', 'ni', 'nommés', 'notre', 'nous', 'ou', 'où', 'par', 'parce', 'pas', 'peu', 'peut', 'plupart', 'pour', 'pourquoi', 'quand', 'que', 'quel', 'quelle', 'quelles', 'quelque', 'quelques', 'quels', 'qui', 'sa', 'sans', 'ses', 'seulement', 'si', 'sien', 'son', 'sont', 'sous', 'soyez', 'sujet', 'sur', 'ta', 'tandis', 'tellement', 'tels', 'tes', 'ton', 'tous', 'tout', 'trop', 'très', 'tu', 'voient', 'vont', 'votre', 'vous', 'vu', 'ça', 'étaient', 'état', 'étions', 'été', 'être']

French (FRancais) stop words

stemgraphic.stopwords.VOCALES = ['a', 'á', 'e', 'é', 'i', 'í', 'o', 'ó', 'u', 'ú', 'ü']

Spanish vowels

stemgraphic.stopwords.VOWELS = ['a', 'e', 'i', 'o', 'u']

English vowels

stemgraphic.stopwords.VOYELLES = ['a', 'â', 'ä', 'à', 'æ', 'e', 'ê', 'ë', 'é', 'è', 'i', 'î', 'ï', 'o', 'ô', 'ö', 'œ', 'u', 'û', 'ü', 'ù', 'y']

French vowels

text

stemgraphic.text. visualizations for text.

stemgraphic.text.heatmap(df, caps=None, charset=None, column=None, compact=True, display=900, flip_axes=False, leaf_order=1, outliers=None, persistence=None, random_state=None, scale=None, trim=False, trim_blank=True, unit='', zero_blank=True, zoom=None)

heatmap.

The heatmap displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves, they are left in their respective columns. Row ‘42’ and Column ‘7’ would have the count of numbers starting with ‘427’ of the given scale. The difference with the heatmatrix is that by default it doesn’t show zero values and it present a compact form by not showing whole empty rows either. Set compact = True to display those empty rows.

The heatmap is useful to look at patterns. For distribution, stem_graphic is better suited.

Example:

heatmap(diamonds.carat, charset='bold');

Output:

Stem-and-leaf heatmap (30.1 x 0.1 )
       𝟎   𝟏   𝟐   𝟑   𝟒   𝟓  𝟔   𝟕   𝟖  𝟗
stem
  𝟐                𝟔   𝟒   𝟏  𝟑  𝟏𝟎   𝟑
  𝟑   𝟒𝟖  𝟒𝟔  𝟑𝟑  𝟐𝟔  𝟐𝟏  𝟏𝟐  𝟖   𝟔  𝟏𝟏  𝟔
  𝟒   𝟏𝟒  𝟐𝟎  𝟏𝟐  𝟏𝟎   𝟐   𝟒  𝟓   𝟑
  𝟓   𝟏𝟗  𝟏𝟖  𝟏𝟐  𝟏𝟒   𝟗   𝟔  𝟔   𝟓   𝟑  𝟐
  𝟔    𝟕       𝟑       𝟏   𝟒      𝟏
  𝟕   𝟑𝟑  𝟏𝟒  𝟏𝟓  𝟏𝟏   𝟔   𝟔  𝟑   𝟑   𝟔  𝟒
  𝟖    𝟒   𝟓           𝟏          𝟐
  𝟗   𝟑𝟐   𝟕   𝟑   𝟏   𝟏   𝟏  𝟏   𝟑
 𝟏𝟎   𝟐𝟓  𝟑𝟒  𝟏𝟐  𝟏𝟑   𝟕   𝟕  𝟓   𝟑   𝟏  𝟓
 𝟏𝟏    𝟖   𝟏   𝟔   𝟓   𝟓   𝟒  𝟑       𝟑  𝟐
 𝟏𝟐   𝟏𝟒   𝟓   𝟓   𝟓   𝟔   𝟏  𝟒   𝟑   𝟏  𝟏
 𝟏𝟑    𝟏   𝟑   𝟐   𝟐   𝟏   𝟐  𝟏       𝟏
 𝟏𝟒        𝟏                      𝟏
 𝟏𝟓    𝟗  𝟏𝟐   𝟕   𝟔   𝟓   𝟓  𝟏       𝟏  𝟐
 𝟏𝟔    𝟏               𝟏      𝟏
 𝟏𝟕    𝟑   𝟒   𝟐   𝟑   𝟐   𝟏             𝟏
 𝟏𝟖                𝟏                  𝟏
 𝟏𝟗                    𝟏          𝟏
 𝟐𝟎    𝟔   𝟖   𝟏   𝟏   𝟑      𝟏
 𝟐𝟏    𝟐                   𝟏      𝟏   𝟏  𝟏
 𝟐𝟐    𝟐                      𝟏          𝟏
 𝟐𝟑    𝟏                                 𝟐
 𝟑𝟎        𝟏
Parameters
  • df – list, numpy array, time series, pandas or dask dataframe

  • charset – valid unicode digit character set, as returned by helpers.available_charsets()

  • column – specify which column (string or number) of the dataframe to use, else the first numerical is selected

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • flip_axes – wide format

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • outliers – for compatibility with other text plots

  • persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)

  • random_state – initial random seed for the sampling process, for reproducible research

  • scale – force a specific scale for building the plot. Defaults to None (automatic).

  • trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None

  • trim_blank – remove the blank between the delimiter and the first leaf, defaults to True

  • unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’

  • zero_blank – replace zero digit with space

  • zoom – zoom level, on top of calculated scale (+1, -1 etc)

Returns

count matrix, scale

stemgraphic.text.heatmatrix(df, caps=None, charset=None, column=None, compact=False, display=900, flip_axes=False, leaf_order=1, outliers=None, persistence=None, random_state=None, scale=None, trim=False, trim_blank=True, unit='', zero_blank=False, zoom=None)

heatmatrix.

The heatmatrix displays the same underlying data as the stem-and-leaf plot, but instead of stacking the leaves, they are left in their respective columns. Row ‘42’ and Column ‘7’ would have the count of numbers starting with ‘427’ of the given scale.

The heatmatrix is useful to look at patterns. For distribution, stem_graphic is better suited.

Example:

heatmatrix(diamonds.carat, charset='bold');

Output:

Stem-and-leaf heatmap (24.0 x 0.1 )
       𝟎   𝟏   𝟐   𝟑   𝟒   𝟓  𝟔   𝟕   𝟖  𝟗
stem
  𝟐    𝟏   𝟎   𝟏   𝟓   𝟒   𝟏  𝟓   𝟎   𝟒  𝟐
  𝟑   𝟒𝟓  𝟒𝟎  𝟐𝟔  𝟏𝟕  𝟏𝟒   𝟕  𝟖   𝟒  𝟏𝟐  𝟕
  𝟒   𝟑𝟎  𝟑𝟏  𝟏𝟖   𝟖   𝟑   𝟐  𝟑   𝟎   𝟏  𝟏
  𝟓   𝟐𝟑  𝟐𝟎   𝟖   𝟓   𝟖  𝟏𝟑  𝟖   𝟔   𝟓  𝟕
  𝟔    𝟔   𝟒   𝟐   𝟎   𝟑   𝟎  𝟎   𝟎   𝟎  𝟎
  𝟕   𝟏𝟖  𝟐𝟐  𝟏𝟐   𝟕   𝟕   𝟖  𝟒   𝟑   𝟒  𝟐
  𝟖    𝟓   𝟒   𝟑   𝟓   𝟎   𝟏  𝟓   𝟏   𝟎  𝟎
  𝟗   𝟏𝟗  𝟏𝟒   𝟐   𝟐   𝟎   𝟎  𝟎   𝟎   𝟎  𝟎
 𝟏𝟎   𝟐𝟖  𝟑𝟔  𝟏𝟎   𝟖   𝟗  𝟏𝟎  𝟏  𝟏𝟒   𝟒  𝟓
 𝟏𝟏    𝟕   𝟒   𝟒   𝟑   𝟒   𝟎  𝟔   𝟏   𝟏  𝟏
 𝟏𝟐   𝟏𝟏   𝟗   𝟗   𝟒   𝟕   𝟐  𝟏   𝟐   𝟐  𝟏
 𝟏𝟑    𝟔   𝟏   𝟒   𝟐   𝟐   𝟎  𝟎   𝟎   𝟏  𝟎
 𝟏𝟒    𝟎   𝟎   𝟎   𝟎   𝟏   𝟎  𝟎   𝟎   𝟎  𝟎
 𝟏𝟓   𝟏𝟎  𝟏𝟔   𝟒   𝟑   𝟑   𝟓  𝟐   𝟑   𝟐  𝟏
 𝟏𝟔    𝟐   𝟏   𝟏   𝟏   𝟎   𝟏  𝟎   𝟏   𝟎  𝟎
 𝟏𝟕    𝟔   𝟓   𝟎   𝟏   𝟏   𝟏  𝟎   𝟎   𝟏  𝟏
 𝟏𝟖    𝟏   𝟎   𝟏   𝟎   𝟎   𝟎  𝟎   𝟎   𝟎  𝟎
 𝟏𝟗    𝟏   𝟏   𝟎   𝟎   𝟎   𝟎  𝟎   𝟎   𝟎  𝟎
 𝟐𝟎    𝟑   𝟗   𝟒   𝟑   𝟐   𝟐  𝟏   𝟏   𝟏  𝟎
 𝟐𝟏    𝟎   𝟏   𝟎   𝟎   𝟏   𝟎  𝟎   𝟏   𝟎  𝟎
 𝟐𝟐    𝟎   𝟐   𝟏   𝟎   𝟎   𝟏  𝟎   𝟎   𝟎  𝟎
 𝟐𝟑    𝟎   𝟎   𝟎   𝟎   𝟎   𝟎  𝟎   𝟎   𝟎  𝟎
 𝟐𝟒    𝟏   𝟎   𝟏   𝟎   𝟐   𝟎  𝟎   𝟎   𝟎  𝟎
Parameters
  • df – list, numpy array, time series, pandas or dask dataframe

  • column – specify which column (string or number) of the dataframe to use, else the first numerical is selected

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • flip_axes – wide format

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • outliers – for compatibility with other text plots

  • persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)

  • random_state – initial random seed for the sampling process, for reproducible research

  • scale – force a specific scale for building the plot. Defaults to None (automatic).

  • trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None

  • trim_blank – remove the blank between the delimiter and the first leaf, defaults to True

  • unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’

  • zero_blank – replace zero digit with space

  • zoom – zoom level, on top of calculated scale (+1, -1 etc)

Returns

count matrix, scale

stemgraphic.text.quantize(df, column=None, display=750, leaf_order=1, random_state=None, scale=None, trim=None, zoom=None)

quantize.

Converts a series into stem-and-leaf and back into decimal. This has the potential effect of decimating (or truncating) values in a lossy way.

Parameters
  • df – list, numpy array, time series, pandas or dask dataframe

  • column – specify which column (string or number) of the dataframe to use, else the first numerical is selected

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • random_state – initial random seed for the sampling process, for reproducible research

  • scale – force a specific scale for building the plot. Defaults to None (automatic).

  • trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None

  • zoom – zoom level, on top of calculated scale (+1, -1 etc)

Returns

decimated df

stemgraphic.text.stem_data(x, break_on=None, column=None, compact=False, display=300, full=False, leaf_order=1, omin=None, omax=None, outliers=False, persistence=None, random_state=None, scale=None, total_rows=None, trim=False, zoom=None)

stem_data.

Returns scale factor, key label and list of rows.

Parameters
  • x – list, numpy array, time series, pandas or dask dataframe

  • break_on – force a break of the leaves at x in (5, 10), defaults to 10

  • column – specify which column (string or number) of the dataframe to use, else the first numerical is selected

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • full – bool, if True returns all interim results including sorted data and stems

  • leaf_order – how many leaf digits per data point to display, defaults to 1

  • outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto

  • omin – float, if already calculated, helps speed up the process for large data sets

  • omax – float, if already calculated, helps speed up the process for large data sets

  • persistence – persist sampled dataframe

  • random_state – initial random seed for the sampling process, for reproducible research

  • scale – force a specific scale for building the plot. Defaults to None (automatic)

  • total_rows – int, if already calculated, helps speed up the process for large data sets

  • trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None

  • zoom – zoom level, on top of calculated scale (+1, -1 etc)

stemgraphic.text.stem_dot(df, asc=True, break_on=None, column=None, compact=False, display=300, flip_axes=False, leaf_order=1, legend_pos='best', marker=None, outliers=True, persistence=None, random_state=None, scale=None, symmetric=False, trim=False, unit='', zoom=None)

stem_dot.

stem_dot builds a stem-and-leaf plot with dots instead of bars.

Example:

stem_dot(diamonds.price)

Output:

326
    ¡
  0 |●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
  1 |●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
  2 |●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
  3 |●●●●●●●●●●●●●●●●●●●●●●●●●●●●
  4 |●●●●●●●●●●●●●●●●●●●●●●●●●●●
  5 |●●●●●●●●●●●●●●●
  6 |●●●●●●●●●
  7 |●●●
  8 |●●●●●
  9 |●●●●●●●
 10 |●●
 11 |●●●●
 12 |●●●●●
 13 |●●●●●
 14 |●●
 15 |●●●
 16 |●●
 17 |●●●●
    !
18823
Scale:
17|1 => 17.1x1000 = 17100.0
Parameters
  • df – list, numpy array, time series, pandas or dask dataframe

  • asc – stem sorted in ascending order, defaults to True

  • break_on – force a break of the leaves at x in (5, 10), defaults to 10

  • column – specify which column (string or number) of the dataframe to use, else the first numerical is selected

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • flip_axes – bool, default is False

  • legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.

  • marker – char, symbol to use as marker. ‘●’ is default. Suggested alternatives: ‘*’, ‘+’, ‘x’, ‘.’, ‘o’

  • outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto

  • persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)

  • random_state – initial random seed for the sampling process, for reproducible research

  • scale – force a specific scale for building the plot. Defaults to None (automatic).

  • symmetric – if True, dot plot will be distributed on both side of a center line

  • trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None

  • unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’

  • zoom – zoom level, on top of calculated scale (+1, -1 etc)

stemgraphic.text.stem_hist(df, asc=True, break_on=None, column=None, compact=False, display=300, flip_axes=False, leaf_order=1, legend_pos='best', marker=None, outliers=True, persistence=None, random_state=None, scale=None, shade=None, symmetric=False, trim=False, unit='', zoom=None)

stem_hist.

stem_hist builds a histogram matching the stem-and-leaf plot, with the numbers hidden, as shown on the cover of the companion brochure.

Example:

stem_hist(diamonds.price, shade='medium')

Output:

  0 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
  1 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
  2 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
  3 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
  4 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
  5 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒
  6 |▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
  7 |▒▒▒▒▒▒▒▒▒▒▒▒
  8 |▒▒▒▒
  9 |▒▒▒▒
 10 |▒▒▒▒▒▒▒
 11 |▒▒
 12 |▒▒▒▒▒▒▒
 13 |▒▒▒▒
 14 |
 15 |16 |17 |
 18 |▒
Scale:
18|4 => 18.4x1000 = 18400.0
Parameters
  • df – list, numpy array, time series, pandas or dask dataframe

  • asc – stem sorted in ascending order, defaults to True

  • break_on – force a break of the leaves at x in (5, 10), defaults to 10

  • column – specify which column (string or number) of the dataframe to use, else the first numerical is selected

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • flip_axes – bool, default is False

  • legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.

  • marker – char, symbol to use as marker. ‘O’ is default. Suggested alternatives: ‘*’, ‘+’, ‘x’, ‘.’, ‘o’

  • outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto

  • persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)

  • random_state – initial random seed for the sampling process, for reproducible research

  • scale – force a specific scale for building the plot. Defaults to None (automatic).

  • shade – shade of marker: ‘none’,’light’,’medium’,’dark’,’full’

  • symmetric – if True, dot plot will be distributed on both side of a center line

  • trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None

  • unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’

  • zoom – zoom level, on top of calculated scale (+1, -1 etc)

stemgraphic.text.stem_tally(df, asc=True, break_on=None, column=None, compact=False, display=300, flip_axes=False, legend_pos='best', outliers=True, persistence=None, random_state=None, scale=None, symmetric=False, trim=False, unit='', zoom=None)

stem_tally.

Stem-and-leaf plot using tally marks for leaf count, up to 5 per block.

Example:

stem_tally(diamonds.price)
326
    ¡
  0 |卌卌卌卌卌卌卌卌卌卌卌卌卌卌卌𝍩
  1 |卌卌卌卌卌卌卌卌卌卌卌卌
  2 |卌卌卌卌卌卌𝍫
  3 |卌卌卌卌𝍩
  4 |卌卌卌卌卌𝍫
  5 |卌卌卌卌卌𝍩
  6 |卌卌卌𝍩
  7 |卌卌卌𝍩
  8 |卌卌𝍩
  9 |𝍫
 10 |𝍪
 11 |𝍬
 12 |卌𝍩
 13 |𝍬
 14 |𝍬
 15 |𝍫
 16 |𝍪
 17 |
 18 |𝍫
    !
18823
Key:
18|3 => 18.3x1000 = 18300.0
Parameters
  • df – list, numpy array, time series, pandas or dask dataframe

  • asc – stem sorted in ascending order, defaults to True

  • break_on – force a break of the leaves at x in (5, 10), defaults to 10

  • column – specify which column (string or number) of the dataframe to use, else the first numerical is selected

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • flip_axes – bool, default is False

  • legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.

  • outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto

  • persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)

  • random_state – initial random seed for the sampling process, for reproducible research

  • scale – force a specific scale for building the plot. Defaults to None (automatic).

  • symmetric – if True, dot plot will be distributed on both side of a center line

  • trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None

  • unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’

  • zoom – zoom level, on top of calculated scale (+1, -1 etc)

stemgraphic.text.stem_text(df, asc=True, break_on=None, charset=None, column=None, compact=False, display=300, flip_axes=False, legend_pos='best', outliers=True, persistence=None, random_state=None, scale=None, symmetric=False, trim=False, unit='', zoom=None)

stem_text.

Classic text based stem-and-leaf plot.

Parameters
  • df – list, numpy array, time series, pandas or dask dataframe

  • asc – stem sorted in ascending order, defaults to True

  • break_on – force a break of the leaves at x in (5, 10), defaults to 10

  • charset – (default to ascii), ‘roman’, ‘rod’, ‘arabic’, ‘circled’, ‘circled_inverted’

  • column – specify which column (string or number) of the dataframe to use, else the first numerical is selected

  • compact – do not display empty stem rows (with no leaves), defaults to False

  • display – maximum number of data points to display, forces sampling if smaller than len(df)

  • flip_axes – bool, default is False

  • legend_pos – One of ‘top’, ‘bottom’, ‘best’ or None, defaults to ‘best’.

  • outliers – display outliers - these are from the full data set, not the sample. Defaults to Auto

  • persistence – filename. save sampled data to disk, either as pickle (.pkl) or csv (any other extension)

  • random_state – initial random seed for the sampling process, for reproducible research

  • scale – force a specific scale for building the plot. Defaults to None (automatic).

  • symmetric – if True, dot plot will be distributed on both side of a center line

  • trim – ranges from 0 to 0.5 (50%) to remove from each end of the data set, defaults to None

  • unit – specify a string for the unit (‘$’, ‘Kg’…). Used for outliers and for legend, defaults to ‘’

  • zoom – zoom level, on top of calculated scale (+1, -1 etc)