
Comprehensive notes covering OML4SQL, OML4Py, supervised and unsupervised learning, AI Vector Search, ONNX integration, and MLOps for Oracle AI Database 26ai (Release 23.26.1).
Oracle Machine Learning (OML) is a comprehensive, in-database machine learning platform embedded within Oracle AI Database 26ai. Unlike traditional ML frameworks that require data to be extracted from the database and moved to external environments, OML brings the algorithms directly to the data. This architecture eliminates the latency, security risks, and scalability constraints associated with data movement.
The OML architecture consists of several key components:
Traditional ML frameworks (scikit-learn, TensorFlow, PyTorch) typically require extracting data from the database into external environments. This approach introduces several challenges:
OML addresses these challenges by embedding ML algorithms directly into the database kernel, leveraging Oracle's parallel execution engine and advanced query optimization.
OML4SQL (formerly Oracle Data Mining) allows data scientists and SQL developers to build, evaluate, and deploy machine learning models using SQL and PL/SQL. This is particularly valuable for organizations where SQL expertise is more prevalent than Python or R.
Key OML4SQL capabilities:
DBMS_DATA_MINING package.PREDICTION, PREDICTION_PROBABILITY, and CLUSTER_ID.
-- Build a classification model using OML4SQL
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'CUSTOMER_CHURN_MODEL',
mining_function => 'CLASSIFICATION',
data_table_name => 'CUSTOMER_DATA',
case_id_column_name => 'CUSTOMER_ID',
target_column_name => 'CHURN_FLAG',
settings_table_name => 'MODEL_SETTINGS'
);
END;
/
-- Score new data using the model
SELECT CUSTOMER_ID,
PREDICTION(CUSTOMER_CHURN_MODEL USING *) AS PREDICTED_CHURN,
PREDICTION_PROBABILITY(CUSTOMER_CHURN_MODEL USING *) AS CHURN_PROBABILITY
FROM NEW_CUSTOMER_DATA;
OML Notebooks provide a collaborative, Jupyter-like environment for developing, sharing, and deploying machine learning workflows. They support SQL, Python, R, and Markdown cells within a single notebook.
Key features:
The OML UI provides a comprehensive interface for managing the complete ML lifecycle:
Data quality is critical for ML success. OML4SQL provides functions for data exploration and preparation directly in SQL:
-- Explore data distribution
SELECT CHURN_FLAG, COUNT(*) AS RECORD_COUNT,
AVG(AGE) AS AVG_AGE,
AVG(INCOME) AS AVG_INCOME
FROM CUSTOMER_DATA
GROUP BY CHURN_FLAG;
-- Identify missing values
SELECT COUNT(*) - COUNT(AGE) AS MISSING_AGE,
COUNT(*) - COUNT(INCOME) AS MISSING_INCOME,
COUNT(*) - COUNT(TENURE) AS MISSING_TENURE
FROM CUSTOMER_DATA;
-- Data preparation with transformations
CREATE OR REPLACE VIEW PREPARED_CUSTOMER_DATA AS
SELECT CUSTOMER_ID,
CHURN_FLAG,
NVL(AGE, 0) AS AGE,
NVL(INCOME, 0) AS INCOME,
NVL(TENURE, 0) AS TENURE,
CASE GENDER WHEN 'M' THEN 1 ELSE 0 END AS GENDER_BINARY
FROM CUSTOMER_DATA;
Oracle AI Database 26ai introduces enhanced model lineage tracking. The BUILD_SOURCE column in data dictionary views now records the exact query used to train a model, providing full traceability.
-- Query model lineage information
SELECT MODEL_NAME,
BUILD_SOURCE, -- Contains the exact training query
CREATION_DATE,
ALGORITHM_NAME,
MINING_FUNCTION
FROM ALL_MINING_MODELS
WHERE MODEL_NAME = 'CUSTOMER_CHURN_MODEL';
OML4Py is Oracle's Python interface for in-database machine learning. It allows data scientists to leverage the Python ecosystem (pandas, NumPy, scikit-learn, etc.) while executing ML algorithms directly inside the Oracle AI Database.
Key architecture components:
Connecting OML4Py to the database is straightforward using Oracle's cx_Oracle driver and the OML4Py connection manager.
# Import OML4Py library
import oml
# Connect to Oracle AI Database
oml.connect(
user="my_username",
password="my_password",
dsn="my_database:1521/service_name"
)
# Verify connection
print(oml.version())
Oracle AI Database 26ai introduces the native VECTOR data type, which OML4Py fully supports. This enables similarity search, feature representation, and embedding-based analytics directly from Python.
# Create a table with a VECTOR column from Python
import oml
import numpy as np
# Generate random embeddings
embeddings = np.random.rand(100, 128).astype(np.float32)
# Create a proxy DataFrame
df = oml.DataFrame({
'product_id': list(range(100)),
'embedding': list(embeddings)
})
# Store in database
df.create_table('product_embeddings', overwrite=True)
# Query similar products
similar = df.cbind(
oml.similarity(
df['embedding'],
df['embedding'][0],
metric='cosine'
)
).sort(2, ascending=False).head(10)
Data preparation is critical for model quality. OML4Py provides comprehensive data preparation capabilities at scale.
The ODMS_EXPLOSION_MIN_SUPP setting controls the minimum support threshold for feature explosion in high-cardinality categorical variables. This helps manage memory usage and model interpretability.
# Configure data preparation settings
settings = {
'ODMS_EXPLOSION_MIN_SUPP': 0.01, # Minimum support threshold (1%)
'ODMS_MISSING_VALUE_HANDLING': 'AUTO',
'ODMS_OUTLIER_HANDLING': 'AUTO'
}
# Prepare data with transformations
from oml import OMLDataFrame
df = oml.sync(oml.dataframe('CUSTOMER_DATA'))
# Apply transformations
df_prepared = df.transform(
oml.transform.scale(['age', 'income', 'tenure']),
oml.transform.encode_one_hot(['category', 'region']),
oml.transform.fillna('all', 0)
)
OML4Py provides built-in scaling functions for standardizing features:
from oml import transform
# Standard scaling (z-score)
df_scaled = df.transform(
transform.scale(['age', 'income', 'tenure'])
)
# Min-max scaling
df_normalized = df.transform(
transform.minmax_scale(['age', 'income', 'tenure'])
)
Splitting data into training, testing, and validation sets is essential for unbiased model evaluation.
# Split data using OML4Py
train_data, test_data, valid_data = df.split(
train_fraction=0.6,
test_fraction=0.2,
valid_fraction=0.2,
random_state=42
)
print(f"Training: {train_data.shape[0]} rows")
print(f"Testing: {test_data.shape[0]} rows")
print(f"Validation: {valid_data.shape[0]} rows")
OML4Py provides comprehensive tools for handling missing values and outliers at scale:
# Detect and handle missing values
missing_counts = train_data.isnull().sum()
print("Missing values per column:", missing_counts)
# Fill missing values
df_filled = train_data.fillna({
'age': train_data['age'].mean(),
'income': train_data['income'].median(),
'category': 'UNKNOWN'
})
# Outlier detection and handling
from oml import outlier_detection
outliers = outlier_detection.z_score(df_filled, threshold=3.0)
df_clean = df_filled[~outliers] # Remove outliers
OML4Py leverages database parallelism for high-performance data processing. Best practices include:
# Benchmark data preparation
import time
start = time.time()
# Perform operations in the database
df_prepared = df.transform(
transform.scale(['age', 'income']),
transform.encode_one_hot(['category', 'region'])
)
end = time.time()
print(f"Data preparation completed in {end - start:.2f} seconds")
Supervised learning is the most common ML paradigm, where models learn patterns from labelled data to predict outcomes for new, unseen data. OML provides extensive support for both classification (predicting discrete categories) and regression (predicting continuous values).
OML supports multiple classification algorithms, each with different strengths:
# OML4SQL: Build a Random Forest model
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'RF_CHURN_MODEL',
mining_function => 'CLASSIFICATION',
data_table_name => 'PREPARED_CUSTOMER_DATA',
case_id_column_name => 'CUSTOMER_ID',
target_column_name => 'CHURN_FLAG',
settings_table_name => 'RF_SETTINGS'
);
END;
/
# OML4Py: Build an XGBoost model
from oml import OMLModel
# Create XGBoost classifier
xgb_model = OMLModel(
algorithm='xgboost',
task='classification',
target='churn_flag'
)
# Train the model
xgb_model.fit(train_data)
# Make predictions
predictions = xgb_model.predict(test_data)
XGBoost (Extreme Gradient Boosting) is a powerful ensemble algorithm that has become a cornerstone of modern machine learning. It offers several advantages:
# OML4Py: XGBoost with hyperparameter tuning
from oml import OMLModel, GridSearchCV
# Define XGBoost model
xgb = OMLModel(
algorithm='xgboost',
task='classification',
target='churn_flag'
)
# Parameter grid for tuning
param_grid = {
'max_depth': [3, 6, 9],
'learning_rate': [0.01, 0.1, 0.3],
'n_estimators': [50, 100, 200],
'subsample': [0.8, 1.0]
}
# Grid search with cross-validation
grid_search = GridSearchCV(
estimator=xgb,
param_grid=param_grid,
cv=5,
scoring='accuracy'
)
grid_search.fit(train_data)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")
Evaluating model performance is critical for selecting the best model and understanding its limitations.
# OML4SQL: Evaluate model
SELECT * FROM TABLE(
DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX(
model_name => 'CUSTOMER_CHURN_MODEL'
)
);
# OML4Py: Evaluate model
from oml import metrics
predictions = xgb_model.predict(test_data)
confusion = metrics.confusion_matrix(test_data['churn_flag'], predictions)
print("Confusion Matrix:\n", confusion)
accuracy = metrics.accuracy_score(test_data['churn_flag'], predictions)
precision = metrics.precision_score(test_data['churn_flag'], predictions)
recall = metrics.recall_score(test_data['churn_flag'], predictions)
f1 = metrics.f1_score(test_data['churn_flag'], predictions)
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
Understanding why models make certain predictions is increasingly important for regulatory compliance and business trust. OML provides built-in explainability features:
# OML4Py: Feature importance
feature_importance = xgb_model.feature_importance()
print("Feature Importance:\n", feature_importance)
# Explain a single prediction
explanation = xgb_model.explain(test_data.iloc[0])
print("Explanation for row 0:\n", explanation)
Models are stored as first-class database objects, allowing for versioning, backup, and sharing:
# List all models
SELECT MODEL_NAME, ALGORITHM_NAME, CREATION_DATE, MINING_FUNCTION
FROM ALL_MINING_MODELS
ORDER BY CREATION_DATE DESC;
# Export model to a file
DECLARE
model_blob BLOB;
BEGIN
model_blob := DBMS_DATA_MINING.EXPORT_MODEL('CUSTOMER_CHURN_MODEL');
-- Store model_blob in a file or table
END;
/
# Import a model from a file
BEGIN
DBMS_DATA_MINING.IMPORT_MODEL(
model_name => 'CUSTOMER_CHURN_MODEL_V2',
model_blob => :blob_data
);
END;
/
Clustering is an unsupervised learning technique that groups similar data points together. It's widely used for customer segmentation, anomaly detection, and exploratory data analysis.
The most popular clustering algorithm, K-Means partitions data into K clusters based on distance to cluster centroids.
# OML4SQL: Build a K-Means model
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'CUSTOMER_SEGMENT_MODEL',
mining_function => 'CLUSTERING',
data_table_name => 'CUSTOMER_DATA',
case_id_column_name => 'CUSTOMER_ID',
settings_table_name => 'KMEANS_SETTINGS'
);
END;
/
# OML4Py: K-Means clustering
from oml import OMLModel
kmeans = OMLModel(
algorithm='kmeans',
task='clustering',
n_clusters=5
)
kmeans.fit(train_data)
# Assign cluster IDs
cluster_assignments = kmeans.predict(test_data)
# View cluster centers
cluster_centers = kmeans.cluster_centers_
print("Cluster Centers:\n", cluster_centers)
Hierarchical clustering builds a tree of clusters, providing a hierarchical view of data relationships. It's useful for smaller datasets where interpretability is key.
Anomaly detection identifies data points that deviate significantly from normal behaviour. OML provides several algorithms for this purpose:
# OML4Py: Anomaly detection with Isolation Forest
from oml import OMLModel
iso_forest = OMLModel(
algorithm='isolation_forest',
contamination=0.01,
random_state=42
)
iso_forest.fit(train_data)
# Predict anomalies (1 = anomaly, -1 = normal)
anomaly_scores = iso_forest.predict(test_data)
anomalies = test_data[anomaly_scores == 1]
print(f"Found {anomalies.shape[0]} anomalies")
Non-negative Matrix Factorization (NMF) is a dimensionality reduction technique that decomposes a matrix into two lower-rank matrices with non-negative entries. It's particularly useful for:
# OML4SQL: NMF for feature extraction
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'FEATURE_EXTRACTION_MODEL',
mining_function => 'FEATURE_EXTRACTION',
data_table_name => 'TEXT_EMBEDDINGS',
case_id_column_name => 'DOCUMENT_ID',
settings_table_name => 'NMF_SETTINGS'
);
END;
/
The Exponential Smoothing Method (ESM) in OML provides automatic time series forecasting with the following key features:
# OML4Py: Time series forecasting with ESM
from oml import OMLModel
esm = OMLModel(
algorithm='exponential_smoothing',
task='time_series',
target='sales',
time_column='date',
seasonal_period=12 # Monthly seasonality
)
esm.fit(sales_data)
# Generate forecasts
forecast = esm.forecast(horizon=12) # 12-month forecast
print("Forecast:\n", forecast)
# Plot forecast
esm.plot_forecast()
ESA is a text analysis technique that maps documents to dense vector representations (doc2vec embeddings) based on their semantic content. Key enhancements in 26ai include:
# OML4SQL: ESA for text analysis
BEGIN
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'TEXT_ANALYSIS_MODEL',
mining_function => 'FEATURE_EXTRACTION',
data_table_name => 'DOCUMENTS',
case_id_column_name => 'DOCUMENT_ID',
settings_table_name => 'ESA_SETTINGS'
);
END;
/
# Use ESA embeddings for classification
SELECT DOCUMENT_ID,
PREDICTION(CATEGORY_MODEL USING *) AS PREDICTED_CATEGORY
FROM DOCUMENTS_WITH_EMBEDDINGS;
Using K-Means clustering to identify customer segments based on purchasing behaviour, demographics, and engagement metrics. This enables targeted marketing campaigns and personalized recommendations.
Using ESM and other time series models to forecast product demand, enabling better inventory management, production planning, and supply chain optimization.
AI Vector Search is a breakthrough capability in Oracle AI Database 26ai that enables semantic similarity search across unstructured data. It's powered by the native VECTOR data type and advanced indexing techniques.
Vectors represent data as mathematical embeddings that capture semantic meaning. They can be:
# Generate embeddings from text
from oml import embedding
# Text to embedding
text = "This is a sample product description for similarity search"
embedding_vector = embedding.generate(text, model='all-MiniLM-L6-v2')
# Store embedding in database
df = oml.DataFrame({
'product_id': [101],
'description': [text],
'embedding': [embedding_vector]
})
df.create_table('product_embeddings', overwrite=True)
The power of AI Vector Search lies in combining it with traditional relational predicates and graph queries in a single SQL statement.
-- Hybrid query: Vector similarity + relational filter + graph relationships
SELECT p.product_id, p.product_name,
VECTOR_DISTANCE(p.embedding, :search_vector) AS similarity
FROM products p
WHERE p.price BETWEEN 100 AND 500
AND p.category = 'Electronics'
AND EXISTS (
SELECT 1 FROM product_relationships pr
WHERE pr.product_id = p.product_id
AND pr.relationship_type = 'COMPATIBLE'
)
ORDER BY similarity
FETCH FIRST 10 ROWS ONLY;
Oracle 26ai supports multiple vector index types for high-performance similarity search:
-- Create an HNSW vector index
CREATE VECTOR INDEX products_vec_idx ON products(embedding)
ORGANIZATION HNSW
DISTANCE COSINE
WITH TARGET ACCURACY 95;
-- Performance tuning with concurrent DML
ALTER VECTOR INDEX products_vec_idx
REBUILD PARAMETERS ('optimize_for_concurrent_dml=true');
ONNX (Open Neural Network Exchange) is an open standard for representing machine learning models. Oracle AI Database 26ai provides first-class support for ONNX models.
# OML4Py: Export model to ONNX format
xgb_model.export_onnx('xgboost_churn_model.onnx')
# Import ONNX model into the database
BEGIN
DBMS_VECTOR.IMPORT_ONNX_MODEL(
model_name => 'ONNX_XGBOOST_MODEL',
file_name => 'xgboost_churn_model.onnx'
);
END;
/
Oracle 26ai introduces in-memory sharing of external initializers for ONNX models, reducing memory usage and improving scalability for concurrent inference workloads.
-- Deploy ONNX model with in-memory sharing
BEGIN
DBMS_VECTOR.DEPLOY_ONNX_MODEL(
model_name => 'ONNX_XGBOOST_MODEL',
memory_share => TRUE,
max_concurrent => 100
);
END;
/
# Score using ONNX model inside the database
SELECT product_id,
ONNX_SCORE('ONNX_XGBOOST_MODEL', features) AS predicted_price
FROM products
WHERE product_id IN (101, 102, 103);
-- Monitor vector index performance
SELECT INDEX_NAME, INDEX_TYPE, STATUS,
VECTOR_MEMORY_USAGE_MB,
VECTOR_INDEX_SIZE_MB
FROM V$VECTOR_INDEX_STATUS;
-- Optimize vector memory
ALTER SYSTEM SET VECTOR_MEMORY_SIZE = 4G;
MLOps (Machine Learning Operations) is the practice of applying DevOps principles to machine learning. The ML lifecycle includes:
Oracle AI Database 26ai provides comprehensive model management capabilities:
-- Track model lineage
SELECT MODEL_NAME,
ALGORITHM_NAME,
MINING_FUNCTION,
BUILD_SOURCE, -- Training query used
CREATION_DATE,
LAST_MODIFIED_DATE
FROM ALL_MINING_MODELS
WHERE MODEL_NAME LIKE 'CUSTOMER%';
-- Monitor model usage
SELECT MODEL_NAME,
LAST_SCORED_DATE,
SCORE_COUNT,
AVG_SCORE_TIME_MS
FROM DBA_MODEL_USAGE
ORDER BY LAST_SCORED_DATE DESC;
OML Services provides REST endpoints for model inference, enabling integration with web applications and microservices.
# Deploy model as a REST endpoint
from oml import OMLEndpoint
endpoint = OMLEndpoint(
model='CUSTOMER_CHURN_MODEL',
endpoint_name='churn_prediction',
description='Real-time churn prediction'
)
endpoint.deploy()
# Model is now available at:
# https://your-database-server/oml/api/v1/predict/churn_prediction
# Test the endpoint
response = endpoint.test({
'age': 35,
'income': 75000,
'tenure': 24,
'category': 'Electronics'
})
print(f"Prediction: {response['prediction']}")
OML models can be integrated with Oracle Analytics Cloud and third-party BI tools for visualization and decision support:
PREDICTION functions in BI dashboards.Models degrade over time as data distributions change. Monitoring for drift is essential for maintaining model accuracy.
# Monitor model performance over time
CREATE TABLE model_performance_history AS
SELECT SYSDATE AS evaluation_date,
model_name,
accuracy,
precision,
recall,
f1_score,
data_version
FROM model_evaluation;
# Check for performance degradation
SELECT evaluation_date, accuracy,
LAG(accuracy) OVER (ORDER BY evaluation_date) AS prev_accuracy,
(accuracy - LAG(accuracy) OVER (ORDER BY evaluation_date)) AS change
FROM model_performance_history
WHERE model_name = 'CUSTOMER_CHURN_MODEL'
ORDER BY evaluation_date DESC
FETCH FIRST 30 ROWS ONLY;
Oracle Scheduler and OML automation APIs enable automated retraining pipelines:
# Create a scheduled job for model retraining
BEGIN
DBMS_SCHEDULER.CREATE_JOB(
job_name => 'RETRAIN_CHURN_MODEL_JOB',
job_type => 'PLSQL_BLOCK',
job_action => 'BEGIN retrain_churn_model(); END;',
start_date => SYSDATE,
repeat_interval => 'FREQ=MONTHLY; INTERVAL=1',
enabled => TRUE,
comments => 'Monthly retraining of churn model'
);
END;
/
# OML4Py: Automated retraining script
from oml import OMLModel, OMLDataFrame
def retrain_churn_model():
# Load new data
new_data = OMLDataFrame('CUSTOMER_DATA_NEW')
# Retrain model
model = OMLModel('CUSTOMER_CHURN_MODEL')
model.retrain(new_data)
# Evaluate new model
metrics = model.evaluate(validation_data)
# Log performance
log_performance(metrics)
# Deploy if performance improves
if metrics['accuracy'] > 0.85:
model.deploy()
else:
model.rollback()
Securing ML models and data is critical for enterprise deployments:
# Grant model privileges GRANT CREATE MINING MODEL TO data_scientist_role; GRANT SELECT ON CUSTOMER_CHURN_MODEL TO analyst_role; GRANT EXECUTE ON customer_churn_pkg TO app_user; # Revoke access REVOKE SELECT ON CUSTOMER_CHURN_MODEL FROM analyst_role;
Project Objective: Develop and deploy a complete machine learning solution using OML4SQL and OML4Py, including data preparation, model training, ONNX export, and REST API deployment for a real-world business problem.
Project Deliverables:
📚 References:
All documentation links are for Oracle AI Database 26ai (Release 23.26.1).