DescriptiveStats.OBeu
estimates the descriptive statistical measures, needed at OpenBudgets.eu. You can measure
central tendency and dispersion of numeric variables along with their
distributions and correlations and the frequencies of categorical
variables for a given dataset on OpenBudgets.eu data mining tool
platform.
The vignette provides an effective way to use functions of
DescriptiveStats.OBeu
with datasets including datasets of
OpenBudgets.eu.
tojson
parameter is used in ds.analysis
,
ds.statistics
, ds.hist
,
ds.boxplot
, ds.correlation
,
ds.frequency
, ds.kurtosis
,
ds.skewness
functions in order to specify if the resulted
object should be in json format.
First you have to load the library
The data in the package include the budget of Wuppertal for 2009 to
2020, as a data frame Wuppertal_df
and as a json link
Wuppertal_openspending
as well as a sample json link
sample_json_link_openspending
, which you can access them
using fromJSON
of jsonlite
package or copy
paste the link to a browser.
Wuppertal internal structure
## 'data.frame': 6225 obs. of 10 variables:
## $ ProduktNR : chr "1109020" "1109020" "3103040" "3103040" ...
## $ Kontotyp : Factor w/ 2 levels "Aufwendung","Ertrag": 2 1 2 1 2 1 2 1 2 1 ...
## $ Art : Factor w/ 2 levels "Ergebnis","Plan": 1 1 1 1 1 1 1 1 1 1 ...
## $ Year : Factor w/ 12 levels "2009","2010",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Amount : num 203228 219134 1926839 11433219 18658 ...
## $ ProduktbereichNR: chr "11" "11" "31" "31" ...
## $ ProduktgruppeNR : chr "1109" "1109" "3103" "3103" ...
## $ Produkt : chr "(entfallen in 2012) E-Government / Internet" "(entfallen in 2012) E-Government / Internet" "Nicht definiert" "Nicht definiert" ...
## $ Produktbereich : chr "Innere Verwaltung" "Innere Verwaltung" "Soziale Leistungen" "Soziale Leistungen" ...
## $ Produktgruppe : chr "Geschäftsbereichsleitung GB 4" "Geschäftsbereichsleitung GB 4" "Grundsicherung SGB II" "Grundsicherung SGB II" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 10
## .. ..$ ProduktNR : list()
## .. .. ..- attr(*, "class")= chr [1:2] "collector_character" "collector"
## .. ..$ Kontotyp : list()
## .. .. ..- attr(*, "class")= chr [1:2] "collector_character" "collector"
## .. ..$ Art : list()
## .. .. ..- attr(*, "class")= chr [1:2] "collector_character" "collector"
## .. ..$ Year : list()
## .. .. ..- attr(*, "class")= chr [1:2] "collector_integer" "collector"
## .. ..$ Amount : list()
## .. .. ..- attr(*, "class")= chr [1:2] "collector_double" "collector"
## .. ..$ ProduktbereichNR: list()
## .. .. ..- attr(*, "class")= chr [1:2] "collector_integer" "collector"
## .. ..$ ProduktgruppeNR : list()
## .. .. ..- attr(*, "class")= chr [1:2] "collector_integer" "collector"
## .. ..$ Produkt : list()
## .. .. ..- attr(*, "class")= chr [1:2] "collector_character" "collector"
## .. ..$ Produktbereich : list()
## .. .. ..- attr(*, "class")= chr [1:2] "collector_character" "collector"
## .. ..$ Produktgruppe : list()
## .. .. ..- attr(*, "class")= chr [1:2] "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr [1:2] "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
ds.analysis
is used to estimate minimum,
maximum, range, mean, median,
first and third quantiles, variance, standart
deviation, skewness and kurtosis,
boxplot, histogram parameters needed for visualization
of numeric variables and frequencies of factor variables of a
given vector, matrix or data frame of data.
Component | Output | Description |
---|---|---|
statistics |
|
|
boxplot |
|
|
histogram |
|
|
frequencies |
|
|
correlation |
|
|
ds.analysis
returns by default a list object, we set
tojson
parameter TRUE
, outliers
parameter FALSE
, fr.select = "Produktbereich"
.
Correlation component is empty because there is one numeric
variable.
wuppertalanalysis = ds.analysis(Wuppertal_df,outliers=FALSE, fr.select = "Produktbereich", tojson=TRUE) # json string format
jsonlite::prettify(wuppertalanalysis) # use prettify of jsonlite library to add indentation to the returned JSON string
## {
## "descriptives": {
## "Min": {
## "Amount": [
## -2040680.54
## ]
## },
## "Max": {
## "Amount": [
## 507995000
## ]
## },
## "Range": {
## "Amount": [
## 510035680.54
## ]
## },
## "Mean": {
## "Amount": [
## 6171229.3658
## ]
## },
## "Median": {
## "Amount": [
## 736038.09
## ]
## },
## "Quantiles": {
## "Amount": [
## 243696.13,
## 2653000
## ]
## },
## "Variance": {
## "Amount": [
## 777106882358169.12
## ]
## },
## "StandardDeviation": {
## "Amount": [
## 27876636.8552
## ]
## },
## "Kurtosis": [
## 160.1519
## ],
## "Skewness": [
## 11.4762
## ]
## },
## "boxplot": {
## "Amount": {
## "lo.whisker": [
## -2040680.54
## ],
## "lo.hinge": [
## 243696.13
## ],
## "median": [
## 736038.09
## ],
## "up.hinge": [
## 2653000
## ],
## "up.whisker": [
## 6243113.59
## ],
## "box.width": [
## 11.83
## ],
## "lo.out": {
##
## },
## "up.out": {
##
## },
## "n": [
## 6225
## ]
## }
## },
## "histogram": {
## "Amount": {
## "cuts": [
## -50000000,
## 0,
## 50000000,
## 100000000,
## 150000000,
## 200000000,
## 250000000,
## 300000000,
## 350000000,
## 400000000,
## 450000000,
## 500000000,
## 550000000
## ],
## "counts": [
## 46,
## 6032,
## 83,
## 30,
## 10,
## 0,
## 1,
## 11,
## 2,
## 4,
## 4,
## 2
## ],
## "mean": [
## 6171229.3658
## ],
## "median": [
## 736038.09
## ]
## }
## },
## "frequencies": {
## "frequencies": {
## "Produktbereich": [
## {
## "Var1": "Allgemeine Finanzwirtschaft",
## "Freq": 101
## },
## {
## "Var1": "Bauen und Wohnen",
## "Freq": 193
## },
## {
## "Var1": "Gesundheitsdienste",
## "Freq": 207
## },
## {
## "Var1": "Innere Verwaltung",
## "Freq": 1737
## },
## {
## "Var1": "Kinder-, Jugend- u. Familienhilfe",
## "Freq": 373
## },
## {
## "Var1": "Kultur und Wissenschaft",
## "Freq": 346
## },
## {
## "Var1": "Natur- und Landschaftspflege",
## "Freq": 256
## },
## {
## "Var1": "Räuml.Planung, Entw., Geoinfo.",
## "Freq": 463
## },
## {
## "Var1": "Schulträgeraufgaben",
## "Freq": 364
## },
## {
## "Var1": "Sicherheit und Ordnung",
## "Freq": 591
## },
## {
## "Var1": "Soziale Leistungen",
## "Freq": 663
## },
## {
## "Var1": "Sportförderung",
## "Freq": 224
## },
## {
## "Var1": "Stiftungen",
## "Freq": 31
## },
## {
## "Var1": "Umweltschutz",
## "Freq": 128
## },
## {
## "Var1": "Ver- und Entsorgung",
## "Freq": 155
## },
## {
## "Var1": "Verkehrsflächen/-anlagen,ÖPNV",
## "Freq": 261
## },
## {
## "Var1": "Wirtschaft und Tourismus",
## "Freq": 132
## }
## ]
## },
## "relative.frequencies": {
## "Produktbereich": [
## {
## "Var1": "Allgemeine Finanzwirtschaft",
## "Freq": 0.0162
## },
## {
## "Var1": "Bauen und Wohnen",
## "Freq": 0.031
## },
## {
## "Var1": "Gesundheitsdienste",
## "Freq": 0.0333
## },
## {
## "Var1": "Innere Verwaltung",
## "Freq": 0.279
## },
## {
## "Var1": "Kinder-, Jugend- u. Familienhilfe",
## "Freq": 0.0599
## },
## {
## "Var1": "Kultur und Wissenschaft",
## "Freq": 0.0556
## },
## {
## "Var1": "Natur- und Landschaftspflege",
## "Freq": 0.0411
## },
## {
## "Var1": "Räuml.Planung, Entw., Geoinfo.",
## "Freq": 0.0744
## },
## {
## "Var1": "Schulträgeraufgaben",
## "Freq": 0.0585
## },
## {
## "Var1": "Sicherheit und Ordnung",
## "Freq": 0.0949
## },
## {
## "Var1": "Soziale Leistungen",
## "Freq": 0.1065
## },
## {
## "Var1": "Sportförderung",
## "Freq": 0.036
## },
## {
## "Var1": "Stiftungen",
## "Freq": 0.005
## },
## {
## "Var1": "Umweltschutz",
## "Freq": 0.0206
## },
## {
## "Var1": "Ver- und Entsorgung",
## "Freq": 0.0249
## },
## {
## "Var1": "Verkehrsflächen/-anlagen,ÖPNV",
## "Freq": 0.0419
## },
## {
## "Var1": "Wirtschaft und Tourismus",
## "Freq": 0.0212
## }
## ]
## }
## },
## "correlation": {
##
## }
## }
##
ds.analysis
uses internally the functions
ds.statistics
,ds.hist
,ds.boxplot
,ds.correlation
and ds.frequency
. However, these functions can be used
independently and depends on the user requirements.
ds.statistics
is used to estimate minimum,
maximum, range, mean, median,
first and third quantiles, variance, standart
deviation, skewness and kurtosis values of a
given vector, matrix or data frame of data.
ds.statistics
returns by default a list object:
## $Min
## $Min$Amount
## [1] -2040681
##
##
## $Max
## $Max$Amount
## [1] 507995000
##
##
## $Range
## $Range$Amount
## [1] 510035681
##
##
## $Mean
## $Mean$Amount
## [1] 6171229
##
##
## $Median
## $Median$Amount
## [1] 736038.1
##
##
## $Quantiles
## $Quantiles$Amount
## 25% 75%
## 243696.1 2653000.0
##
##
## $Variance
## $Variance$Amount
## [1] 7.771069e+14
##
##
## $StandardDeviation
## $StandardDeviation$Amount
## [1] 27876637
##
##
## $Kurtosis
## Amount
## 160.1519
##
## $Skewness
## Amount
## 11.47621
The results can be extracted in json format for further use you
should set the parameter tojson
to TRUE
:
wuppertalstats = ds.statistics(Wuppertal_df, tojson = TRUE) # json format
jsonlite::prettify(wuppertalstats) # use prettify of jsonlite library to add indentation to the returned JSON string
## {
## "Min": {
## "Amount": [
## -2040680.54
## ]
## },
## "Max": {
## "Amount": [
## 507995000
## ]
## },
## "Range": {
## "Amount": [
## 510035680.54
## ]
## },
## "Mean": {
## "Amount": [
## 6171229.3658
## ]
## },
## "Median": {
## "Amount": [
## 736038.09
## ]
## },
## "Quantiles": {
## "Amount": [
## 243696.13,
## 2653000
## ]
## },
## "Variance": {
## "Amount": [
## 777106882358169.12
## ]
## },
## "StandardDeviation": {
## "Amount": [
## 27876636.8552
## ]
## },
## "Kurtosis": [
## 160.1519
## ],
## "Skewness": [
## 11.4762
## ]
## }
##
ds.hist
computes the parameters needed to visualize a
histogram of a numeric input vector, specifying the breaks
as in base hist
function.
## $cuts
## [1] -5.0e+07 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08 3.0e+08
## [9] 3.5e+08 4.0e+08 4.5e+08 5.0e+08 5.5e+08
##
## $counts
## [1] 46 6032 83 30 10 0 1 11 2 4 4 2
##
## $mean
## [1] 6171229
##
## $median
## [1] 736038.1
Return the results as json string:
wuppertalhist = ds.hist(Wuppertal_df$Amount, breaks= "Sturges", tojson=TRUE) # json format
jsonlite::prettify(wuppertalhist) # use prettify of jsonlite library to add indentation to the returned JSON string
## {
## "cuts": [
## -50000000,
## 0,
## 50000000,
## 100000000,
## 150000000,
## 200000000,
## 250000000,
## 300000000,
## 350000000,
## 400000000,
## 450000000,
## 500000000,
## 550000000
## ],
## "counts": [
## 46,
## 6032,
## 83,
## 30,
## 10,
## 0,
## 1,
## 11,
## 2,
## 4,
## 4,
## 2
## ],
## "mean": [
## 6171229.3658
## ],
## "median": [
## 736038.09
## ]
## }
##
The ds.boxplot
returns the parameters needed for a
boxplot visualization of an input vector, matrix or data frame.
If outl
is TRUE
the outliers will be
computed at the selected out.level
level (default is
1.5
times the Interquartile Range) and the width
level is determined 0.15 times the square root of the size of the input
data. ds.boxplot
uses the numeric variables of the input
data, you do not have to exclude factor or character variables.
wuppertalbox = ds.boxplot(Wuppertal_df, width = 0.15 , outl = FALSE, tojson=TRUE) # json format
jsonlite::prettify(wuppertalbox) # use prettify of jsonlite library to add indentation to the returned JSON string
## {
## "Amount": {
## "lo.whisker": [
## -2040680.54
## ],
## "lo.hinge": [
## 243696.13
## ],
## "median": [
## 736038.09
## ],
## "up.hinge": [
## 2653000
## ],
## "up.whisker": [
## 6243113.59
## ],
## "box.width": [
## 11.83
## ],
## "lo.out": {
##
## },
## "up.out": {
##
## },
## "n": [
## 6225
## ]
## }
## }
##
ds.correlation
estimate the correlation coefficient
(default is "pearson"
) of the input vectors, matrix or data
frame. In this example iris
dataset is used. Factor or
character variables in the input matrix or data frame will be filtered
out by default.
iriscorr = ds.correlation(iris, cor.method="pearson", tojson=TRUE) # json format
jsonlite::prettify(iriscorr) # use prettify of jsonlite library to add indentation to the returned JSON string
## [
## {
## "Sepal.Length": 1,
## "Sepal.Width": -0.12,
## "Petal.Length": 0.87,
## "Petal.Width": 0.82,
## "_row": "Sepal.Length"
## },
## {
## "Sepal.Length": 0,
## "Sepal.Width": 1,
## "Petal.Length": -0.43,
## "Petal.Width": -0.37,
## "_row": "Sepal.Width"
## },
## {
## "Sepal.Length": 0,
## "Sepal.Width": 0,
## "Petal.Length": 1,
## "Petal.Width": 0.96,
## "_row": "Petal.Length"
## },
## {
## "Sepal.Length": 0,
## "Sepal.Width": 0,
## "Petal.Length": 0,
## "Petal.Width": 1,
## "_row": "Petal.Width"
## }
## ]
##
Frequencies and relative frequencies of factors/characters of the
input dataset using ds.frequency
for
Produktbereich
from Wuppertal_df
dataset and
return as json string.
wuppertalfreq = ds.frequency(Wuppertal_df$Produktbereich, tojson = TRUE)
jsonlite::prettify(wuppertalfreq) # use prettify of jsonlite library to add indentation to the returned JSON string
## {
## "frequencies": {
## "data": [
## {
## "Var1": "Allgemeine Finanzwirtschaft",
## "Freq": 101
## },
## {
## "Var1": "Bauen und Wohnen",
## "Freq": 193
## },
## {
## "Var1": "Gesundheitsdienste",
## "Freq": 207
## },
## {
## "Var1": "Innere Verwaltung",
## "Freq": 1737
## },
## {
## "Var1": "Kinder-, Jugend- u. Familienhilfe",
## "Freq": 373
## },
## {
## "Var1": "Kultur und Wissenschaft",
## "Freq": 346
## },
## {
## "Var1": "Natur- und Landschaftspflege",
## "Freq": 256
## },
## {
## "Var1": "Räuml.Planung, Entw., Geoinfo.",
## "Freq": 463
## },
## {
## "Var1": "Schulträgeraufgaben",
## "Freq": 364
## },
## {
## "Var1": "Sicherheit und Ordnung",
## "Freq": 591
## },
## {
## "Var1": "Soziale Leistungen",
## "Freq": 663
## },
## {
## "Var1": "Sportförderung",
## "Freq": 224
## },
## {
## "Var1": "Stiftungen",
## "Freq": 31
## },
## {
## "Var1": "Umweltschutz",
## "Freq": 128
## },
## {
## "Var1": "Ver- und Entsorgung",
## "Freq": 155
## },
## {
## "Var1": "Verkehrsflächen/-anlagen,ÖPNV",
## "Freq": 261
## },
## {
## "Var1": "Wirtschaft und Tourismus",
## "Freq": 132
## }
## ]
## },
## "relative.frequencies": {
## "data": [
## {
## "Var1": "Allgemeine Finanzwirtschaft",
## "Freq": 0.0162
## },
## {
## "Var1": "Bauen und Wohnen",
## "Freq": 0.031
## },
## {
## "Var1": "Gesundheitsdienste",
## "Freq": 0.0333
## },
## {
## "Var1": "Innere Verwaltung",
## "Freq": 0.279
## },
## {
## "Var1": "Kinder-, Jugend- u. Familienhilfe",
## "Freq": 0.0599
## },
## {
## "Var1": "Kultur und Wissenschaft",
## "Freq": 0.0556
## },
## {
## "Var1": "Natur- und Landschaftspflege",
## "Freq": 0.0411
## },
## {
## "Var1": "Räuml.Planung, Entw., Geoinfo.",
## "Freq": 0.0744
## },
## {
## "Var1": "Schulträgeraufgaben",
## "Freq": 0.0585
## },
## {
## "Var1": "Sicherheit und Ordnung",
## "Freq": 0.0949
## },
## {
## "Var1": "Soziale Leistungen",
## "Freq": 0.1065
## },
## {
## "Var1": "Sportförderung",
## "Freq": 0.036
## },
## {
## "Var1": "Stiftungen",
## "Freq": 0.005
## },
## {
## "Var1": "Umweltschutz",
## "Freq": 0.0206
## },
## {
## "Var1": "Ver- und Entsorgung",
## "Freq": 0.0249
## },
## {
## "Var1": "Verkehrsflächen/-anlagen,ÖPNV",
## "Freq": 0.0419
## },
## {
## "Var1": "Wirtschaft und Tourismus",
## "Freq": 0.0212
## }
## ]
## }
## }
##
If the input is a dataframe and the select
parameter is
not specified, all the factor variables will be returned.
All the numeric variables of the input data are filtered out of the estimations internally.
This function calculates kurtosis of the input vector, matrix or data frame. Factor or character variables that may be included in the input matrix or data frame, will be omitted in the estimations.
## [160.1519]