Data Science applied to arca: development and availability of tools for information retrieval in the Institutional Repository of Fundação Oswaldo cruz


  • Marcel de Moraes Pedroso Fundação Oswaldo Cruz (FIOCRUZ), Instituto de Comunicação e Informação Científica e Tecnológica em Saúde (ICICT), Rio de Janeiro, RJ, Brasil
  • Jefferson da Costa Lima Fundação Oswaldo Cruz (FIOCRUZ), Instituto de Comunicação e Informação Científica e Tecnológica em Saúde (ICICT), Rio de Janeiro, RJ, Brasil
  • Vinicius Belchior Assef Neto Fundação Oswaldo Cruz (FIOCRUZ), Instituto de Comunicação e Informação Científica e Tecnológica em Saúde (ICICT), Rio de Janeiro, RJ, Brasil



Data Science, Information Storage and Retrieval, Data Mining, Machine Learning, Institutional Repositories.


The Arca institutional repository is the main instrument of open access at the Oswaldo Cruz Foundation, with the mission of gathering, hosting, preserving, making available and giving visibility to the institution’s intellectual production. The thematic diversity and institutional complexity of the Foundation foster a methodological challenge related to the classification and retrieval of deposited digital objects and the governance of the metadata recorded by the communities that make up the repository. In 2016, the Arca search engine counted more than 400 thousand queries. An Information Retrieval system is needed that meets the specificities of indexing the repository and the growing demand for information from users internal and external to Fiocruz. In this work we propose the use of Data Science tools, especially Data Mining and Machine Learning techniques, with the objective of improving Information Retrieval by means of automatic classification of digital objects deposited in the Arca and the development and availability of the system of IR based on quality metrics related to precision and recall concepts.



How to Cite

Pedroso, M. de M., Lima, J. da C., & Assef Neto, V. B. (2017). Data Science applied to arca: development and availability of tools for information retrieval in the Institutional Repository of Fundação Oswaldo cruz. Revista Eletrônica De Comunicação, Informação & Inovação Em Saúde, 11.



Pecha Kucha