Lázaro is an observatory of anglicism usage in the Spanish press. The purpose of this project is to apply a data-driven approach to the study of anglicisms (ie, unadapted lexical borrowings from English) in Spanish newspapers. Every day, Lázaro collects the latests news published in 22 Spanish news sources, analyzes them and extracts the anglicisms that have been used in the daily news.
This talk (in Spanish) summarizes how the project was built :
The observatory currently monitors the following sources :
Source | Topic |
---|---|
El País | General news |
elDiario.es | General news |
ABC | General news |
El Mundo | General news |
La Vanguardia | General news |
El Confidencial | General news |
20 Minutos | General news |
Agencia EFE | General news |
Agencia Sinc | Science & Tech |
Muy Interesante | Science & Tech |
La Marea | Politics |
El Salto | Politics |
El Economista | Economy |
Cinco Días | Economy |
JotDown | Culture |
El Mundo Today | Satirical news |
Marca | Sports |
Rolling Stones | Music |
Fotogramas | Cinema |
Diez Minutos | Gossip |
Men's Health | Lifestyle |
Elle | Lifestyle |
The core of the project is a Machine Learning model that extracts unadapted lexical borrowings (especially English lexical borrowings or anglicisms) from Spanish articles. The model is a BiLSTM-CRF model fed with bilingual EN-ES embeddings, along with subword embeddings (more info on the model can be found in the ACL paper). A previous version of the observatory that was active since April 2020 to August 2022 ran on a Conditional Random Field (CRF) model (more information about that previous model can be found here).
The code of the observatory and the training corpus are available in GitHub. The anglicism detection model can also be used through HuggingFace model hub or via pylazaro
Python library.
More information about the model and the training corpus can be found in the following publications:
@lazarobot
The Twitter bot @lazarobot tweets every day the new anglicisms extracted by the model (ie, anglicisms that have never been seen before by the model), along with its context and a link to the article where it was found.
The purpose of this project is to describe and analyze the usage of anglicisms in the Spanish press. This project seeks by no means to critizise or condemn the usage of anglicisms, or those that use them.
The motivation behind Lázaro Observatory is not to defend an alleged linguistic purity, but to study the phenomenon of lexical borrowing from a descriptive and data driven point of view
The name of the project, Lázaro, is a tribute to the Spanish linguist Lázaro Carreter, whose prescriptivist columns against the usage of the anglicisms in the Spanish press were extremely popular in Spain during the 1980s and the 1990s.
The project behind Lázaro Observatory has received the following awards:
Lázaro Observatorio has been featured in the following Spanish media:
Research and third-party projects that use the data provided by Lázaro Observatory:
Lázaro Observatory is a project created and developed by Elena Álvarez Mellado. The project was originally created within the Broadening Linguistic Technologies Lab at Brandeis University (Massachusetts) under the supervision of Constantine Lignos and is currently developed within the Natural Language Processing and Information Retrieval research group at UNED University in Madrid (Spain).