Buscador híbrido¶
Los búscadores híbridos son aquellos buscadores que nos permiten combinar diferentes tipos de búsqueda. Hoy hemos visto la búsqueda vectorial sin embargo existen otros dipos de búsqueda, como por ejemplo la búsqueda por índices. La búsqueda por índices es un tipo de búsqueda en la que el objetivo es buscar si una palabra está literalmente en el texto o no
Base de conocimiento¶
Corpus¶
In [ ]:
Copied!
import pandas as pd
df = pd.read_parquet("hf://datasets/spanish-ir/messirve/es/train-00000-of-00001.parquet")
df
import pandas as pd
df = pd.read_parquet("hf://datasets/spanish-ir/messirve/es/train-00000-of-00001.parquet")
df
/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn(
Out[ ]:
| id | query | docid | docid_text | query_date | answer_date | match_score | expanded_search | answer_type | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 3812723 | a como se saca el porcentaje | 59256#13 | El valor porcentual se calcula multiplicando e... | 2024-03-30 | 2024-04-19 | 0.9011 | False | feat_snip |
| 1 | 3812733 | a cual dedo se pone el anillo de compromiso | 956254#2 | Pero desde hace cientos de años, se dice que l... | 2024-03-30 | 2024-04-25 | 0.7400 | False | feat_snip |
| 2 | 3812734 | a cual lleva tilde | 254539#6 | Los pronombres interrogativos y exclamativos "... | 2024-03-30 | 2024-04-19 | 1.0000 | False | feat_snip |
| 3 | 3812757 | a cuales países emigran los dominicanos | 8005#46 | Todos los niveles de la educación se desplomar... | 2024-03-30 | 2024-04-19 | 1.0000 | False | feat_snip |
| 4 | 3812773 | a cuantas calorias equivale un julio | 40424#5 | Para convertir las kilocalorías en kilojulios ... | 2024-04-04 | 2024-04-19 | 1.0000 | False | feat_snip |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23471 | 3948801 | sobre qué cuerpo de agua caminó jesús | 7917425#1 | Según los Evangelios, un anochecer Jesús y sus... | 2024-03-30 | 2024-04-19 | 0.8008 | False | feat_snip |
| 23472 | 3948814 | tipo de uva para vino blanco | 7363536#2 | Las que producen principalmente vinos blancos ... | 2024-03-30 | 2024-04-25 | 0.9481 | False | feat_snip |
| 23473 | 3948829 | x donde orina la mujer | 28992#4 | En la mujer la uretra tiene una longitud de 3,... | 2024-03-30 | 2024-04-19 | 0.9375 | False | feat_snip |
| 23474 | 3948831 | xbox cual es el mas nuevo | 58676#4 | La Xbox fue la primera de una marca continua d... | 2024-03-30 | 2024-04-19 | 1.0000 | False | feat_snip |
| 23475 | 3948833 | xbox cual es el ultimo modelo | 6243734#18 | El 12 de diciembre de 2019, durante el evento ... | 2024-03-30 | 2024-04-25 | 0.9307 | False | feat_snip |
23476 rows × 9 columns
In [ ]:
Copied!
corpus = df['docid_text'].to_list()
queries = df['query'].to_list()
corpus = df['docid_text'].to_list()
queries = df['query'].to_list()
Vectorización¶
In [ ]:
Copied!
import nltk
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer
# El vectorizador de sklearn ya hace automáticamente la tokenizacion
vectorizer = TfidfVectorizer(
lowercase=True,
analyzer='word', # Tipo de tokenización: ‘word’, ‘char’, ‘char_wb’
stop_words= nltk.corpus.stopwords.words('spanish'), # Stop Words
ngram_range=(1, 3), # Ngramas generados
#max_df=200, # No tiene en cuenta las palabras que salen maś de N veces
#min_df=2, # No tiene en cuenta las palabras que salen menos de de N veces
max_features=5000, # Nº de dimensiones que genera el vectorizador
vocabulary=None, # Si queremos pasarle un vocabulario (wordnet)
).fit(corpus)
database = df[['id','docid_text']]
vectors = pd.DataFrame(vectorizer.transform(corpus).toarray())
database = database.join(vectors).set_index('id')
database
import nltk
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer
# El vectorizador de sklearn ya hace automáticamente la tokenizacion
vectorizer = TfidfVectorizer(
lowercase=True,
analyzer='word', # Tipo de tokenización: ‘word’, ‘char’, ‘char_wb’
stop_words= nltk.corpus.stopwords.words('spanish'), # Stop Words
ngram_range=(1, 3), # Ngramas generados
#max_df=200, # No tiene en cuenta las palabras que salen maś de N veces
#min_df=2, # No tiene en cuenta las palabras que salen menos de de N veces
max_features=5000, # Nº de dimensiones que genera el vectorizador
vocabulary=None, # Si queremos pasarle un vocabulario (wordnet)
).fit(corpus)
database = df[['id','docid_text']]
vectors = pd.DataFrame(vectorizer.transform(corpus).toarray())
database = database.join(vectors).set_index('id')
database
[nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Unzipping corpora/stopwords.zip.
Out[ ]:
| docid_text | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 4990 | 4991 | 4992 | 4993 | 4994 | 4995 | 4996 | 4997 | 4998 | 4999 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 3812723 | El valor porcentual se calcula multiplicando e... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3812733 | Pero desde hace cientos de años, se dice que l... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3812734 | Los pronombres interrogativos y exclamativos "... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3812757 | Todos los niveles de la educación se desplomar... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3812773 | Para convertir las kilocalorías en kilojulios ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3948801 | Según los Evangelios, un anochecer Jesús y sus... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3948814 | Las que producen principalmente vinos blancos ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3948829 | En la mujer la uretra tiene una longitud de 3,... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3948831 | La Xbox fue la primera de una marca continua d... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3948833 | El 12 de diciembre de 2019, durante el evento ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
23476 rows × 5001 columns
Reducción de dimensionalidad¶
In [ ]:
Copied!
from sklearn.decomposition import TruncatedSVD
dim_reducer = TruncatedSVD(
n_components=300,
n_iter=2,
random_state=42
).fit(vectors)
database = df[['id','docid_text']]
vectors_reduced = pd.DataFrame(dim_reducer.transform(vectors))
database = database.join(vectors_reduced).set_index('id')
database
from sklearn.decomposition import TruncatedSVD
dim_reducer = TruncatedSVD(
n_components=300,
n_iter=2,
random_state=42
).fit(vectors)
database = df[['id','docid_text']]
vectors_reduced = pd.DataFrame(dim_reducer.transform(vectors))
database = database.join(vectors_reduced).set_index('id')
database
Out[ ]:
| docid_text | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ... | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||||||||
| 3812723 | El valor porcentual se calcula multiplicando e... | 0.101333 | -0.048685 | -0.009112 | -0.006216 | -0.006795 | 0.012850 | 0.027118 | -0.006417 | 0.010441 | ... | -0.001412 | -0.008448 | -0.071117 | -0.015181 | -0.028041 | -0.026897 | 0.045657 | 0.019699 | -0.001128 | -0.016296 |
| 3812733 | Pero desde hace cientos de años, se dice que l... | 0.070896 | -0.017466 | -0.001265 | 0.021529 | -0.019737 | 0.023293 | -0.005873 | 0.063538 | 0.051257 | ... | -0.022752 | 0.005064 | 0.011659 | 0.017520 | 0.008474 | -0.004018 | 0.022399 | -0.012608 | 0.030998 | 0.024373 |
| 3812734 | Los pronombres interrogativos y exclamativos "... | 0.012753 | -0.004595 | -0.000479 | 0.003199 | -0.002317 | -0.005708 | -0.000023 | 0.011214 | 0.000748 | ... | 0.003718 | 0.011335 | -0.022292 | -0.018718 | 0.014676 | 0.023493 | 0.003396 | -0.019152 | -0.023918 | 0.010601 |
| 3812757 | Todos los niveles de la educación se desplomar... | 0.136740 | 0.045374 | 0.024625 | 0.044793 | -0.014669 | -0.017622 | 0.000171 | 0.094869 | -0.001124 | ... | -0.006863 | 0.002470 | -0.002823 | 0.023762 | -0.020675 | 0.008375 | -0.029880 | -0.000067 | -0.018833 | -0.000475 |
| 3812773 | Para convertir las kilocalorías en kilojulios ... | 0.009960 | -0.008380 | -0.001740 | -0.004949 | 0.001792 | 0.001492 | 0.004582 | 0.002994 | -0.000958 | ... | -0.074880 | 0.051426 | 0.005806 | 0.007208 | 0.008484 | 0.017190 | -0.041506 | 0.004952 | -0.017484 | 0.014560 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3948801 | Según los Evangelios, un anochecer Jesús y sus... | 0.093580 | 0.025883 | -0.000035 | 0.032461 | 0.075862 | 0.008044 | 0.026076 | 0.023474 | -0.013794 | ... | -0.006529 | 0.020348 | -0.021678 | -0.006694 | 0.001444 | 0.004102 | -0.024192 | -0.033301 | -0.004889 | -0.008134 |
| 3948814 | Las que producen principalmente vinos blancos ... | 0.060115 | -0.014904 | -0.012748 | -0.023614 | 0.005865 | 0.005664 | 0.042966 | 0.002643 | -0.019031 | ... | -0.006741 | 0.045930 | -0.003952 | 0.000411 | 0.013414 | 0.061696 | 0.001620 | 0.019851 | -0.009238 | 0.044953 |
| 3948829 | En la mujer la uretra tiene una longitud de 3,... | 0.063755 | 0.000795 | -0.017230 | -0.030829 | 0.007358 | 0.028072 | -0.005443 | 0.025751 | 0.014481 | ... | -0.014338 | -0.004415 | 0.019717 | 0.007655 | -0.019533 | -0.010370 | 0.016517 | -0.014694 | 0.032864 | 0.021345 |
| 3948831 | La Xbox fue la primera de una marca continua d... | 0.028335 | 0.012092 | 0.004249 | 0.031147 | -0.021678 | -0.023396 | 0.023673 | -0.011457 | -0.008859 | ... | -0.022090 | 0.030033 | -0.034493 | -0.032832 | -0.051353 | 0.060501 | -0.043464 | -0.009363 | -0.001162 | 0.062794 |
| 3948833 | El 12 de diciembre de 2019, durante el evento ... | 0.076779 | 0.037045 | 0.012580 | 0.064977 | -0.044422 | -0.050232 | 0.038795 | -0.026799 | -0.024790 | ... | -0.016883 | 0.031044 | -0.016681 | -0.024578 | -0.025334 | 0.067752 | -0.053906 | -0.020431 | -0.018708 | 0.070876 |
23476 rows × 301 columns
Buscador¶
In [ ]:
Copied!
from scipy.spatial.distance import cosine
def hibrid_search(
keywords = ["xbox"],
sample_query = "xbox cual es el ultimo modelo",
negative_words = ["microsoft"],
n_samples = 10,
):
# Filas que contienen los keywords
has_keyword = database['docid_text'].apply(
lambda x: all([keyword in x.lower().split() for keyword in keywords]))
database_filtered = database[has_keyword]
# Filas que no contienen los negative negative_words
has_keyword = database['docid_text'].apply(
lambda x: all([word not in x.lower().split() for word in negative_words]))
database_filtered = database[has_keyword]
# Filtrar por coseno
query = vectorizer.transform([sample_query]).toarray()
query = dim_reducer.transform(query).flatten()
database_filtered['distance'] = database_filtered.drop('docid_text',axis=1).apply(
lambda x: cosine(x,query), axis=1)
return database_filtered[['docid_text','distance']
].sort_values(ascending=True, by='distance'
).drop_duplicates(
).head(n_samples)
response = hibrid_search()
response
from scipy.spatial.distance import cosine
def hibrid_search(
keywords = ["xbox"],
sample_query = "xbox cual es el ultimo modelo",
negative_words = ["microsoft"],
n_samples = 10,
):
# Filas que contienen los keywords
has_keyword = database['docid_text'].apply(
lambda x: all([keyword in x.lower().split() for keyword in keywords]))
database_filtered = database[has_keyword]
# Filas que no contienen los negative negative_words
has_keyword = database['docid_text'].apply(
lambda x: all([word not in x.lower().split() for word in negative_words]))
database_filtered = database[has_keyword]
# Filtrar por coseno
query = vectorizer.transform([sample_query]).toarray()
query = dim_reducer.transform(query).flatten()
database_filtered['distance'] = database_filtered.drop('docid_text',axis=1).apply(
lambda x: cosine(x,query), axis=1)
return database_filtered[['docid_text','distance']
].sort_values(ascending=True, by='distance'
).drop_duplicates(
).head(n_samples)
response = hibrid_search()
response
/usr/local/lib/python3.12/dist-packages/scipy/spatial/distance.py:682: RuntimeWarning: invalid value encountered in scalar divide
dist = 1.0 - uv / math.sqrt(uu * vv)
/tmp/ipython-input-1478102247.py:26: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
database_filtered['distance'] = database_filtered.drop('docid_text',axis=1).apply(
Out[ ]:
| docid_text | distance | |
|---|---|---|
| id | ||
| 3948831 | La Xbox fue la primera de una marca continua d... | 0.111995 |
| 3929957 | Xbox Game Pass Ultimate permite tener Xbox Gam... | 0.190291 |
| 3840870 | Xbox One cuenta con una GPU integrada basada e... | 0.326214 |
| 3897096 | Su lanzamiento fue el 15 de noviembre de 2013 ... | 0.413698 |
| 3930089 | En el E3 2021, Capcom anunció que el contenido... | 0.531548 |
| 3929878 | Se trata de una recopilación de los dos primer... | 0.546959 |
| 3905564 | Sin embargo, IGN dio al juego un análisis de 8... | 0.617022 |
| 3881507 | Halo es una franquicia de videojuegos de cienc... | 0.659656 |
| 3933960 | La versión más reciente de Windows es Windows ... | 0.668302 |
| 3935143 | Minecraft es un videojuego de construcción de ... | 0.671576 |