no code implementations • 3 Jun 2024 • Mykola Trokhymovych, Indira Sen, Martin Gerlach
With over 60M articles, Wikipedia has become the largest platform for open and freely accessible knowledge.
1 code implementation • 30 Jun 2021 • Charles C. Hyland, Yuanming Tao, Lamiae Azizi, Martin Gerlach, Tiago P. Peixoto, Eduardo G. Altmann
We are interested in the widespread problem of clustering documents and finding topics in large collections of written documents in the presence of metadata and hyperlinks.
1 code implementation • 28 Jan 2019 • Hanyu Shi, Martin Gerlach, Isabel Diersen, Doug Downey, Luis A. N. Amaral
Topic models are in widespread use in natural language processing and beyond.
3 code implementations • 19 Dec 2018 • Martin Gerlach, Francesc Font-Clos
The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years.
1 code implementation • 4 Aug 2017 • Martin Gerlach, Tiago P. Peixoto, Eduardo G. Altmann
By adapting existing community-detection methods -- using a stochastic block model (SBM) with non-parametric priors -- we obtain a more versatile and principled framework for topic modeling (e. g., it automatically detects the number of topics and hierarchically clusters both the words and documents).
no code implementations • 11 Nov 2016 • Eduardo G. Altmann, Laercio Dias, Martin Gerlach
This finding allows us to identify the contribution of specific words (and word frequencies) for the different generalized entropies and also to estimate the size of the databases needed to obtain a reliable estimation of the divergences.
no code implementations • 1 Oct 2015 • Martin Gerlach, Francesc Font-Clos, Eduardo G. Altmann
Quantifying the similarity between symbolic sequences is a traditional problem in Information Theory which requires comparing the frequencies of symbols in different sequences.
no code implementations • 11 Feb 2015 • Eduardo G. Altmann, Martin Gerlach
Zipf's law is just one out of many universal laws proposed to describe statistical regularities in language.
no code implementations • 17 Jun 2014 • Martin Gerlach, Eduardo G. Altmann
In this paper we combine statistical analysis of large text databases and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies.
no code implementations • 17 Jun 2014 • Fakhteh Ghanbarnejad, Martin Gerlach, Jose M. Miotto, Eduardo G. Altmann
Combining data analysis with simulations of simple models (e. g., the Bass dynamics on complex networks) we identify signatures of endogenous and exogenous factors in the S-curves of adoption.
no code implementations • 6 Dec 2012 • Martin Gerlach, Eduardo G. Altmann
We propose a stochastic model for the number of different words in a given database which incorporates the dependence on the database size and historical changes.