refactor: flatten Microsoft skills from nested to flat directory structure
Rewrote sync_microsoft_skills.py (v4) to use each SKILL.md's frontmatter 'name' field as the flat directory name under skills/, replacing the nested skills/official/microsoft/<lang>/<category>/<service>/ hierarchy. This fixes CI failures caused by the indexing, validation, and catalog scripts expecting skills/<id>/SKILL.md (depth 1). Changes: - Rewrite scripts/sync_microsoft_skills.py for flat output with collision detection - Update scripts/tests/inspect_microsoft_repo.py for flat name mapping - Update scripts/tests/test_comprehensive_coverage.py for name uniqueness checks - Delete skills/official/ nested directory - Add 129 Microsoft skills as flat directories (e.g. skills/azure-mgmt-botservice-dotnet/) - Move attribution files to docs/ (LICENSE-MICROSOFT, microsoft-skills-attribution.json) - Rebuild skills_index.json, CATALOG.md, README.md (845 total skills)
This commit is contained in:
528
skills/azure-search-documents-py/SKILL.md
Normal file
528
skills/azure-search-documents-py/SKILL.md
Normal file
@@ -0,0 +1,528 @@
|
||||
---
|
||||
name: azure-search-documents-py
|
||||
description: |
|
||||
Azure AI Search SDK for Python. Use for vector search, hybrid search, semantic ranking, indexing, and skillsets.
|
||||
Triggers: "azure-search-documents", "SearchClient", "SearchIndexClient", "vector search", "hybrid search", "semantic search".
|
||||
package: azure-search-documents
|
||||
---
|
||||
|
||||
# Azure AI Search SDK for Python
|
||||
|
||||
Full-text, vector, and hybrid search with AI enrichment capabilities.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install azure-search-documents
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
```bash
|
||||
AZURE_SEARCH_ENDPOINT=https://<service-name>.search.windows.net
|
||||
AZURE_SEARCH_API_KEY=<your-api-key>
|
||||
AZURE_SEARCH_INDEX_NAME=<your-index-name>
|
||||
```
|
||||
|
||||
## Authentication
|
||||
|
||||
### API Key
|
||||
|
||||
```python
|
||||
from azure.search.documents import SearchClient
|
||||
from azure.core.credentials import AzureKeyCredential
|
||||
|
||||
client = SearchClient(
|
||||
endpoint=os.environ["AZURE_SEARCH_ENDPOINT"],
|
||||
index_name=os.environ["AZURE_SEARCH_INDEX_NAME"],
|
||||
credential=AzureKeyCredential(os.environ["AZURE_SEARCH_API_KEY"])
|
||||
)
|
||||
```
|
||||
|
||||
### Entra ID (Recommended)
|
||||
|
||||
```python
|
||||
from azure.search.documents import SearchClient
|
||||
from azure.identity import DefaultAzureCredential
|
||||
|
||||
client = SearchClient(
|
||||
endpoint=os.environ["AZURE_SEARCH_ENDPOINT"],
|
||||
index_name=os.environ["AZURE_SEARCH_INDEX_NAME"],
|
||||
credential=DefaultAzureCredential()
|
||||
)
|
||||
```
|
||||
|
||||
## Client Types
|
||||
|
||||
| Client | Purpose |
|
||||
|--------|---------|
|
||||
| `SearchClient` | Search and document operations |
|
||||
| `SearchIndexClient` | Index management, synonym maps |
|
||||
| `SearchIndexerClient` | Indexers, data sources, skillsets |
|
||||
|
||||
## Create Index with Vector Field
|
||||
|
||||
```python
|
||||
from azure.search.documents.indexes import SearchIndexClient
|
||||
from azure.search.documents.indexes.models import (
|
||||
SearchIndex,
|
||||
SearchField,
|
||||
SearchFieldDataType,
|
||||
VectorSearch,
|
||||
HnswAlgorithmConfiguration,
|
||||
VectorSearchProfile,
|
||||
SearchableField,
|
||||
SimpleField
|
||||
)
|
||||
|
||||
index_client = SearchIndexClient(endpoint, AzureKeyCredential(key))
|
||||
|
||||
fields = [
|
||||
SimpleField(name="id", type=SearchFieldDataType.String, key=True),
|
||||
SearchableField(name="title", type=SearchFieldDataType.String),
|
||||
SearchableField(name="content", type=SearchFieldDataType.String),
|
||||
SearchField(
|
||||
name="content_vector",
|
||||
type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
|
||||
searchable=True,
|
||||
vector_search_dimensions=1536,
|
||||
vector_search_profile_name="my-vector-profile"
|
||||
)
|
||||
]
|
||||
|
||||
vector_search = VectorSearch(
|
||||
algorithms=[
|
||||
HnswAlgorithmConfiguration(name="my-hnsw")
|
||||
],
|
||||
profiles=[
|
||||
VectorSearchProfile(
|
||||
name="my-vector-profile",
|
||||
algorithm_configuration_name="my-hnsw"
|
||||
)
|
||||
]
|
||||
)
|
||||
|
||||
index = SearchIndex(
|
||||
name="my-index",
|
||||
fields=fields,
|
||||
vector_search=vector_search
|
||||
)
|
||||
|
||||
index_client.create_or_update_index(index)
|
||||
```
|
||||
|
||||
## Upload Documents
|
||||
|
||||
```python
|
||||
from azure.search.documents import SearchClient
|
||||
|
||||
client = SearchClient(endpoint, "my-index", AzureKeyCredential(key))
|
||||
|
||||
documents = [
|
||||
{
|
||||
"id": "1",
|
||||
"title": "Azure AI Search",
|
||||
"content": "Full-text and vector search service",
|
||||
"content_vector": [0.1, 0.2, ...] # 1536 dimensions
|
||||
}
|
||||
]
|
||||
|
||||
result = client.upload_documents(documents)
|
||||
print(f"Uploaded {len(result)} documents")
|
||||
```
|
||||
|
||||
## Keyword Search
|
||||
|
||||
```python
|
||||
results = client.search(
|
||||
search_text="azure search",
|
||||
select=["id", "title", "content"],
|
||||
top=10
|
||||
)
|
||||
|
||||
for result in results:
|
||||
print(f"{result['title']}: {result['@search.score']}")
|
||||
```
|
||||
|
||||
## Vector Search
|
||||
|
||||
```python
|
||||
from azure.search.documents.models import VectorizedQuery
|
||||
|
||||
# Your query embedding (1536 dimensions)
|
||||
query_vector = get_embedding("semantic search capabilities")
|
||||
|
||||
vector_query = VectorizedQuery(
|
||||
vector=query_vector,
|
||||
k_nearest_neighbors=10,
|
||||
fields="content_vector"
|
||||
)
|
||||
|
||||
results = client.search(
|
||||
vector_queries=[vector_query],
|
||||
select=["id", "title", "content"]
|
||||
)
|
||||
|
||||
for result in results:
|
||||
print(f"{result['title']}: {result['@search.score']}")
|
||||
```
|
||||
|
||||
## Hybrid Search (Vector + Keyword)
|
||||
|
||||
```python
|
||||
from azure.search.documents.models import VectorizedQuery
|
||||
|
||||
vector_query = VectorizedQuery(
|
||||
vector=query_vector,
|
||||
k_nearest_neighbors=10,
|
||||
fields="content_vector"
|
||||
)
|
||||
|
||||
results = client.search(
|
||||
search_text="azure search",
|
||||
vector_queries=[vector_query],
|
||||
select=["id", "title", "content"],
|
||||
top=10
|
||||
)
|
||||
```
|
||||
|
||||
## Semantic Ranking
|
||||
|
||||
```python
|
||||
from azure.search.documents.models import QueryType
|
||||
|
||||
results = client.search(
|
||||
search_text="what is azure search",
|
||||
query_type=QueryType.SEMANTIC,
|
||||
semantic_configuration_name="my-semantic-config",
|
||||
select=["id", "title", "content"],
|
||||
top=10
|
||||
)
|
||||
|
||||
for result in results:
|
||||
print(f"{result['title']}")
|
||||
if result.get("@search.captions"):
|
||||
print(f" Caption: {result['@search.captions'][0].text}")
|
||||
```
|
||||
|
||||
## Filters
|
||||
|
||||
```python
|
||||
results = client.search(
|
||||
search_text="*",
|
||||
filter="category eq 'Technology' and rating gt 4",
|
||||
order_by=["rating desc"],
|
||||
select=["id", "title", "category", "rating"]
|
||||
)
|
||||
```
|
||||
|
||||
## Facets
|
||||
|
||||
```python
|
||||
results = client.search(
|
||||
search_text="*",
|
||||
facets=["category,count:10", "rating"],
|
||||
top=0 # Only get facets, no documents
|
||||
)
|
||||
|
||||
for facet_name, facet_values in results.get_facets().items():
|
||||
print(f"{facet_name}:")
|
||||
for facet in facet_values:
|
||||
print(f" {facet['value']}: {facet['count']}")
|
||||
```
|
||||
|
||||
## Autocomplete & Suggest
|
||||
|
||||
```python
|
||||
# Autocomplete
|
||||
results = client.autocomplete(
|
||||
search_text="sea",
|
||||
suggester_name="my-suggester",
|
||||
mode="twoTerms"
|
||||
)
|
||||
|
||||
# Suggest
|
||||
results = client.suggest(
|
||||
search_text="sea",
|
||||
suggester_name="my-suggester",
|
||||
select=["title"]
|
||||
)
|
||||
```
|
||||
|
||||
## Indexer with Skillset
|
||||
|
||||
```python
|
||||
from azure.search.documents.indexes import SearchIndexerClient
|
||||
from azure.search.documents.indexes.models import (
|
||||
SearchIndexer,
|
||||
SearchIndexerDataSourceConnection,
|
||||
SearchIndexerSkillset,
|
||||
EntityRecognitionSkill,
|
||||
InputFieldMappingEntry,
|
||||
OutputFieldMappingEntry
|
||||
)
|
||||
|
||||
indexer_client = SearchIndexerClient(endpoint, AzureKeyCredential(key))
|
||||
|
||||
# Create data source
|
||||
data_source = SearchIndexerDataSourceConnection(
|
||||
name="my-datasource",
|
||||
type="azureblob",
|
||||
connection_string=connection_string,
|
||||
container={"name": "documents"}
|
||||
)
|
||||
indexer_client.create_or_update_data_source_connection(data_source)
|
||||
|
||||
# Create skillset
|
||||
skillset = SearchIndexerSkillset(
|
||||
name="my-skillset",
|
||||
skills=[
|
||||
EntityRecognitionSkill(
|
||||
inputs=[InputFieldMappingEntry(name="text", source="/document/content")],
|
||||
outputs=[OutputFieldMappingEntry(name="organizations", target_name="organizations")]
|
||||
)
|
||||
]
|
||||
)
|
||||
indexer_client.create_or_update_skillset(skillset)
|
||||
|
||||
# Create indexer
|
||||
indexer = SearchIndexer(
|
||||
name="my-indexer",
|
||||
data_source_name="my-datasource",
|
||||
target_index_name="my-index",
|
||||
skillset_name="my-skillset"
|
||||
)
|
||||
indexer_client.create_or_update_indexer(indexer)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use hybrid search** for best relevance combining vector and keyword
|
||||
2. **Enable semantic ranking** for natural language queries
|
||||
3. **Index in batches** of 100-1000 documents for efficiency
|
||||
4. **Use filters** to narrow results before ranking
|
||||
5. **Configure vector dimensions** to match your embedding model
|
||||
6. **Use HNSW algorithm** for large-scale vector search
|
||||
7. **Create suggesters** at index creation time (cannot add later)
|
||||
|
||||
## Reference Files
|
||||
|
||||
| File | Contents |
|
||||
|------|----------|
|
||||
| [references/vector-search.md](references/vector-search.md) | HNSW configuration, integrated vectorization, multi-vector queries |
|
||||
| [references/semantic-ranking.md](references/semantic-ranking.md) | Semantic configuration, captions, answers, hybrid patterns |
|
||||
| [scripts/setup_vector_index.py](scripts/setup_vector_index.py) | CLI script to create vector-enabled search index |
|
||||
|
||||
|
||||
---
|
||||
|
||||
## Additional Azure AI Search Patterns
|
||||
|
||||
# Azure AI Search Python SDK
|
||||
|
||||
Write clean, idiomatic Python code for Azure AI Search using `azure-search-documents`.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install azure-search-documents azure-identity
|
||||
```
|
||||
|
||||
## Environment Variables
|
||||
|
||||
```bash
|
||||
AZURE_SEARCH_ENDPOINT=https://<search-service>.search.windows.net
|
||||
AZURE_SEARCH_INDEX_NAME=<index-name>
|
||||
# For API key auth (not recommended for production)
|
||||
AZURE_SEARCH_API_KEY=<api-key>
|
||||
```
|
||||
|
||||
## Authentication
|
||||
|
||||
**DefaultAzureCredential (preferred)**:
|
||||
```python
|
||||
from azure.identity import DefaultAzureCredential
|
||||
from azure.search.documents import SearchClient
|
||||
|
||||
credential = DefaultAzureCredential()
|
||||
client = SearchClient(endpoint, index_name, credential)
|
||||
```
|
||||
|
||||
**API Key**:
|
||||
```python
|
||||
from azure.core.credentials import AzureKeyCredential
|
||||
from azure.search.documents import SearchClient
|
||||
|
||||
client = SearchClient(endpoint, index_name, AzureKeyCredential(api_key))
|
||||
```
|
||||
|
||||
## Client Selection
|
||||
|
||||
| Client | Purpose |
|
||||
|--------|---------|
|
||||
| `SearchClient` | Query indexes, upload/update/delete documents |
|
||||
| `SearchIndexClient` | Create/manage indexes, knowledge sources, knowledge bases |
|
||||
| `SearchIndexerClient` | Manage indexers, skillsets, data sources |
|
||||
| `KnowledgeBaseRetrievalClient` | Agentic retrieval with LLM-powered Q&A |
|
||||
|
||||
## Index Creation Pattern
|
||||
|
||||
```python
|
||||
from azure.search.documents.indexes import SearchIndexClient
|
||||
from azure.search.documents.indexes.models import (
|
||||
SearchIndex, SearchField, VectorSearch, VectorSearchProfile,
|
||||
HnswAlgorithmConfiguration, AzureOpenAIVectorizer,
|
||||
AzureOpenAIVectorizerParameters, SemanticSearch,
|
||||
SemanticConfiguration, SemanticPrioritizedFields, SemanticField
|
||||
)
|
||||
|
||||
index = SearchIndex(
|
||||
name=index_name,
|
||||
fields=[
|
||||
SearchField(name="id", type="Edm.String", key=True),
|
||||
SearchField(name="content", type="Edm.String", searchable=True),
|
||||
SearchField(name="embedding", type="Collection(Edm.Single)",
|
||||
vector_search_dimensions=3072,
|
||||
vector_search_profile_name="vector-profile"),
|
||||
],
|
||||
vector_search=VectorSearch(
|
||||
profiles=[VectorSearchProfile(
|
||||
name="vector-profile",
|
||||
algorithm_configuration_name="hnsw-algo",
|
||||
vectorizer_name="openai-vectorizer"
|
||||
)],
|
||||
algorithms=[HnswAlgorithmConfiguration(name="hnsw-algo")],
|
||||
vectorizers=[AzureOpenAIVectorizer(
|
||||
vectorizer_name="openai-vectorizer",
|
||||
parameters=AzureOpenAIVectorizerParameters(
|
||||
resource_url=aoai_endpoint,
|
||||
deployment_name=embedding_deployment,
|
||||
model_name=embedding_model
|
||||
)
|
||||
)]
|
||||
),
|
||||
semantic_search=SemanticSearch(
|
||||
default_configuration_name="semantic-config",
|
||||
configurations=[SemanticConfiguration(
|
||||
name="semantic-config",
|
||||
prioritized_fields=SemanticPrioritizedFields(
|
||||
content_fields=[SemanticField(field_name="content")]
|
||||
)
|
||||
)]
|
||||
)
|
||||
)
|
||||
|
||||
index_client = SearchIndexClient(endpoint, credential)
|
||||
index_client.create_or_update_index(index)
|
||||
```
|
||||
|
||||
## Document Operations
|
||||
|
||||
```python
|
||||
from azure.search.documents import SearchIndexingBufferedSender
|
||||
|
||||
# Batch upload with automatic batching
|
||||
with SearchIndexingBufferedSender(endpoint, index_name, credential) as sender:
|
||||
sender.upload_documents(documents)
|
||||
|
||||
# Direct operations via SearchClient
|
||||
search_client = SearchClient(endpoint, index_name, credential)
|
||||
search_client.upload_documents(documents) # Add new
|
||||
search_client.merge_documents(documents) # Update existing
|
||||
search_client.merge_or_upload_documents(documents) # Upsert
|
||||
search_client.delete_documents(documents) # Remove
|
||||
```
|
||||
|
||||
## Search Patterns
|
||||
|
||||
```python
|
||||
# Basic search
|
||||
results = search_client.search(search_text="query")
|
||||
|
||||
# Vector search
|
||||
from azure.search.documents.models import VectorizedQuery
|
||||
|
||||
results = search_client.search(
|
||||
search_text=None,
|
||||
vector_queries=[VectorizedQuery(
|
||||
vector=embedding,
|
||||
k_nearest_neighbors=5,
|
||||
fields="embedding"
|
||||
)]
|
||||
)
|
||||
|
||||
# Hybrid search (vector + keyword)
|
||||
results = search_client.search(
|
||||
search_text="query",
|
||||
vector_queries=[VectorizedQuery(vector=embedding, k_nearest_neighbors=5, fields="embedding")],
|
||||
query_type="semantic",
|
||||
semantic_configuration_name="semantic-config"
|
||||
)
|
||||
|
||||
# With filters
|
||||
results = search_client.search(
|
||||
search_text="query",
|
||||
filter="category eq 'technology'",
|
||||
select=["id", "title", "content"],
|
||||
top=10
|
||||
)
|
||||
```
|
||||
|
||||
## Agentic Retrieval (Knowledge Bases)
|
||||
|
||||
For LLM-powered Q&A with answer synthesis, see [references/agentic-retrieval.md](references/agentic-retrieval.md).
|
||||
|
||||
Key concepts:
|
||||
- **Knowledge Source**: Points to a search index
|
||||
- **Knowledge Base**: Wraps knowledge sources + LLM for query planning and synthesis
|
||||
- **Output modes**: `EXTRACTIVE_DATA` (raw chunks) or `ANSWER_SYNTHESIS` (LLM-generated answers)
|
||||
|
||||
## Async Pattern
|
||||
|
||||
```python
|
||||
from azure.search.documents.aio import SearchClient
|
||||
|
||||
async with SearchClient(endpoint, index_name, credential) as client:
|
||||
results = await client.search(search_text="query")
|
||||
async for result in results:
|
||||
print(result["title"])
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use environment variables** for endpoints, keys, and deployment names
|
||||
2. **Prefer `DefaultAzureCredential`** over API keys for production
|
||||
3. **Use `SearchIndexingBufferedSender`** for batch uploads (handles batching/retries)
|
||||
4. **Always define semantic configuration** for agentic retrieval indexes
|
||||
5. **Use `create_or_update_index`** for idempotent index creation
|
||||
6. **Close clients** with context managers or explicit `close()`
|
||||
|
||||
## Field Types Reference
|
||||
|
||||
| EDM Type | Python | Notes |
|
||||
|----------|--------|-------|
|
||||
| `Edm.String` | str | Searchable text |
|
||||
| `Edm.Int32` | int | Integer |
|
||||
| `Edm.Int64` | int | Long integer |
|
||||
| `Edm.Double` | float | Floating point |
|
||||
| `Edm.Boolean` | bool | True/False |
|
||||
| `Edm.DateTimeOffset` | datetime | ISO 8601 |
|
||||
| `Collection(Edm.Single)` | List[float] | Vector embeddings |
|
||||
| `Collection(Edm.String)` | List[str] | String arrays |
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
from azure.core.exceptions import (
|
||||
HttpResponseError,
|
||||
ResourceNotFoundError,
|
||||
ResourceExistsError
|
||||
)
|
||||
|
||||
try:
|
||||
result = search_client.get_document(key="123")
|
||||
except ResourceNotFoundError:
|
||||
print("Document not found")
|
||||
except HttpResponseError as e:
|
||||
print(f"Search error: {e.message}")
|
||||
```
|
||||
Reference in New Issue
Block a user