Shipping a Cordova hot update is the easy part. Keeping your OTA pipeline secure, observable, and recoverable under real-world pressure is what separates weekend projects from production-grade operations. This guide consolidates everything you need on the operational side: signing and secret management, automated QA gates, rollback scripts, performance KPIs, incident response, and stakeholder communication.
Table of Contents
1. CI/CD Pipeline Architecture for Hot Updates
A hot-update pipeline looks different from a standard app-release pipeline. You are not producing an IPA or APK—you are producing a signed, versioned bundle of web assets that must be validated, staged, and delivered without any app-store intermediary. That means every safeguard the store normally provides (review, signing verification, rollback) is now your responsibility.
Pipeline Stages
A production-ready OTA pipeline flows through six stages:
- Source — PR merge triggers the pipeline. The branch name or tag encodes the target:
ota/v3.4.1. - Build — Webpack/Vite produces the minified web bundle. Source maps are generated but stored separately (never shipped OTA).
- Sign — The bundle is hashed (SHA-256) and signed using a key managed by your KMS. The signature file ships alongside the bundle.
- Test — Static analysis, unit tests, device-level smoke tests, and canary deployment run in sequence.
- Stage — The signed bundle is uploaded to a CDN staging bucket. A version manifest is updated but not yet pointed at by production.
- Release — The production manifest pointer is updated. Phased rollout percentages are applied. Monitoring begins.
Reference Pipeline (GitHub Actions)
# .github/workflows/ota-release.yml
name: OTA Hot Update Release
on:
push:
tags: ['ota/v*']
env:
BUNDLE_DIR: www
STAGING_BUCKET: s3://myapp-ota-staging
PROD_BUCKET: s3://myapp-ota-prod
CDN_DISTRIBUTION: E1A2B3C4D5E6F7
jobs:
build-sign:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install & build
run: |
npm ci
npm run build:prod
# Output: www/ directory with index.html, js/, css/
- name: Generate manifest
run: |
VERSION=${GITHUB_REF_NAME#ota/}
SHA=$(find www -type f -exec sha256sum {} \; | sort | sha256sum | cut -d' ' -f1)
cat > www/manifest.json <<MANIFEST
{
"version": "$VERSION",
"bundleHash": "$SHA",
"minNativeVersion": "3.0.0",
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
}
MANIFEST
- name: Sign bundle with AWS KMS
env:
AWS_REGION: us-east-1
run: |
sha256sum www/manifest.json | cut -d' ' -f1 > manifest.digest
aws kms sign \
--key-id alias/ota-signing-key \
--message fileb://manifest.digest \
--message-type RAW \
--signing-algorithm RSASSA_PKCS1_V1_5_SHA_256 \
--output text --query Signature > www/manifest.sig
- uses: actions/upload-artifact@v4
with:
name: ota-bundle
path: www/
test:
needs: build-sign
uses: ./.github/workflows/ota-test.yml
deploy-staging:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/download-artifact@v4
with: { name: ota-bundle, path: www }
- name: Upload to staging
run: aws s3 sync www/ $STAGING_BUCKET/$VERSION/ --delete
release-prod:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production # requires manual approval
steps:
- name: Promote to production
run: |
VERSION=${GITHUB_REF_NAME#ota/}
aws s3 sync $STAGING_BUCKET/$VERSION/ $PROD_BUCKET/$VERSION/
# Update the "current" pointer
echo "{\"current\": \"$VERSION\"}" | \
aws s3 cp - $PROD_BUCKET/latest.json \
--content-type application/json
# Invalidate CDN
aws cloudfront create-invalidation \
--distribution-id $CDN_DISTRIBUTION \
--paths "/latest.json" "/$VERSION/*"
- name: Notify Slack
run: |
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d "{\"text\":\"OTA $VERSION released to production.\"}"
Key design decisions: the production environment gate forces a manual approval step, source maps never leave the build runner, and the CDN invalidation targets only the changed paths.
2. Security: Code Signing, Secrets & Policy Enforcement
OTA updates bypass the app store’s built-in signature verification. If an attacker compromises your update endpoint or CDN, they can push arbitrary JavaScript to every device. That makes your signing and secret management practices existentially important.
Code Signing with HSM/KMS
Never store signing keys on CI runners or developer laptops. Use a Hardware Security Module (HSM) or cloud KMS where the private key never leaves the hardware boundary:
- AWS KMS — Use an asymmetric RSA or ECC key with the
SIGN_VERIFYusage type. The CI runner sends a digest; KMS returns the signature. The key itself is non-extractable. - Azure Key Vault — Similar flow via the
az keyvault key signcommand. Supports RSA-PSS and ECDSA. - On-prem HSM (Thales, Yubico) — Use PKCS#11 tooling. The HSM plugs into your signing server; the CI runner calls the server via mTLS.
On the client side, embed the public key in the native binary (not in the web assets—those are what you are verifying). Before applying any OTA bundle, the app must:
- Download
manifest.jsonandmanifest.sig. - Compute the SHA-256 digest of the manifest.
- Verify the signature against the embedded public key.
- Verify that every file in the bundle matches the hashes listed in the manifest.
If any check fails, the app must discard the bundle and continue running the previous version.
Secret Protection
Your OTA pipeline handles sensitive credentials: KMS access, CDN tokens, S3 write permissions. Lock these down with defense in depth:
- MFA on CI secrets — Require multi-factor authentication to view or edit pipeline secrets in GitHub/GitLab settings. Use OIDC federation where possible to eliminate long-lived credentials entirely.
- TLS everywhere — All bundle transfers (CI to S3, S3 to CDN, CDN to device) must use TLS 1.2+. Pin the CDN certificate in your Cordova app using
cordova-plugin-advanced-httpor a custom native bridge. - Encrypted storage buckets — Enable SSE-S3 or SSE-KMS on your OTA buckets. Enable bucket versioning so you have an audit trail of every object mutation.
- Least-privilege IAM — The CI runner’s role should have
s3:PutObjecton the OTA bucket andkms:Signon the signing key—nothing else. Nos3:*, no admin policies.
Automated Policy Enforcement
Human review catches intent problems; machines catch known-bad patterns. Wire these into the pipeline as blocking gates:
# In your CI test job
- name: SAST scan
run: npx eslint www/ --rule '{"no-eval": "error", "no-implied-eval": "error"}'
- name: Dependency audit
run: |
npm audit --audit-level=high
# Fail on known CVEs in production deps
npx audit-ci --high
- name: License check
run: npx license-checker --failOn 'GPL-3.0;AGPL-3.0'
- name: Bundle size gate
run: |
MAX_KB=2048
SIZE_KB=$(du -sk www/ | cut -f1)
if [ "$SIZE_KB" -gt "$MAX_KB" ]; then
echo "Bundle $SIZE_KB KB exceeds $MAX_KB KB limit"
exit 1
fi
The bundle-size gate is critical for OTA: a bloated bundle means slower downloads, higher failure rates on poor connections, and increased CDN costs. Set the threshold based on your 90th-percentile user’s connection speed.
3. QA Automation: From Lint to Canary
OTA updates skip the app-store review cycle, which means your own QA pipeline is the only thing between a merged PR and a broken app on a user’s device. Build it in layers.
Layer 1: Static & Unit Tests
These run in seconds and catch the majority of regressions:
- ESLint with strict rules — Ban
eval,document.write, andinnerHTMLassignment. Enforceno-undefto catch missing imports that would crash at runtime. - TypeScript (if applicable) — Run
tsc --noEmitas a gate. Type errors in OTA bundles are especially dangerous because there is no compilation step on the device. - Unit tests — Cover business logic, API response parsing, and state transitions. Target 80%+ coverage on code that ships OTA. Use Vitest or Jest with jsdom for fast execution.
Layer 2: Device-Level Smoke Tests
Static checks cannot catch platform-specific rendering bugs, plugin initialization failures, or webview quirks. Run smoke tests on actual (or emulated) devices:
# ota-test.yml (called from main pipeline)
name: OTA Device Tests
on: workflow_call
jobs:
emulator-smoke:
runs-on: macos-latest
strategy:
matrix:
platform: [ios, android]
steps:
- uses: actions/download-artifact@v4
with: { name: ota-bundle, path: www }
- name: Boot emulator
run: |
if [ "${{ matrix.platform }}" = "ios" ]; then
xcrun simctl boot "iPhone 15"
else
$ANDROID_HOME/emulator/emulator -avd Pixel_7_API_34 -no-window &
adb wait-for-device
fi
- name: Install test harness app
run: |
# Pre-built app shell with OTA client pointed at local server
cordova run ${{ matrix.platform }} --device --no-build
- name: Serve OTA bundle locally & trigger update
run: |
npx serve www -l 8080 &
# Tell test app to pull from localhost:8080
curl -X POST http://localhost:9090/trigger-update \
-d '{"url":"http://localhost:8080"}'
- name: Run Appium smoke suite
run: |
npx appium &
npx wdio run wdio.ota-smoke.conf.js
# Tests: app launches, main screen renders,
# critical navigation works, no JS errors in console
Layer 3: Canary Deployment
Before full rollout, push the update to a small group and watch:
- Internal dogfood — Your team gets the update first. Run it for at least 2 hours.
- Canary ring (1–5% of users) — Selected randomly or by opt-in beta flag. Monitor crash rate, error rate, and API latency for 4–24 hours depending on your traffic volume.
- Automated promotion — If canary KPIs stay within thresholds (see Section 5), the pipeline automatically expands to 25%, 50%, 100%. If any threshold is breached, rollback triggers automatically.
Automated Sign-Off Gate
The promotion decision should be codified, not left to a human checking a dashboard:
#!/bin/bash
# canary-check.sh — run every 15 minutes during canary window
CRASH_RATE=$(curl -s "$MONITORING_API/crash-rate?version=$VERSION&window=1h")
ERROR_RATE=$(curl -s "$MONITORING_API/error-rate?version=$VERSION&window=1h")
P95_LOAD=$(curl -s "$MONITORING_API/p95-load-time?version=$VERSION&window=1h")
CRASH_THRESHOLD="0.5" # percent
ERROR_THRESHOLD="2.0" # percent
LOAD_THRESHOLD="3000" # milliseconds
fail=0
echo "$CRASH_RATE $CRASH_THRESHOLD" | awk '{if ($1 > $2) exit 1}' || { echo "FAIL: crash rate $CRASH_RATE% > $CRASH_THRESHOLD%"; fail=1; }
echo "$ERROR_RATE $ERROR_THRESHOLD" | awk '{if ($1 > $2) exit 1}' || { echo "FAIL: error rate $ERROR_RATE% > $ERROR_THRESHOLD%"; fail=1; }
echo "$P95_LOAD $LOAD_THRESHOLD" | awk '{if ($1 > $2) exit 1}' || { echo "FAIL: p95 load $P95_LOAD ms > $LOAD_THRESHOLD ms"; fail=1; }
if [ "$fail" -eq 1 ]; then
echo "Canary check FAILED — initiating rollback"
./rollback.sh "$PREVIOUS_VERSION"
exit 1
fi
echo "Canary check PASSED"
4. Rollback Plan: Artifacts, Scripts & Observability
Every OTA release must be reversible within minutes. If your rollback requires a human to SSH into a server and manually copy files, you have already failed. Automate it end-to-end.
Artifact Retention Policy
Retain at minimum the last two stable bundles in your production bucket, alongside their signed manifests. This gives you:
- Immediate rollback target — The bundle that was running 5 minutes ago.
- Fallback rollback target — In case the previous version also had a latent issue exposed by the new version’s traffic pattern.
Store SHA-256 hashes of each bundle in a version registry (a simple JSON file in S3, a DynamoDB table, or a database row). The registry serves as the source of truth for what is “known good.”
| Version | Bundle Hash | Status | Released | Retired |
|---|---|---|---|---|
| v3.4.1 | a1b2c3... | current | 2026-03-29T14:00Z | — |
| v3.4.0 | d4e5f6... | rollback-ready | 2026-03-22T10:30Z | — |
| v3.3.2 | g7h8i9... | rollback-ready | 2026-03-15T09:00Z | — |
| v3.3.1 | j0k1l2... | archived | 2026-03-08T11:00Z | 2026-03-22 |
Rollback Script
#!/bin/bash
# rollback.sh — revert OTA to a previous known-good version
set -euo pipefail
TARGET_VERSION="${1:?Usage: rollback.sh <version>}"
PROD_BUCKET="s3://myapp-ota-prod"
CDN_DISTRIBUTION="E1A2B3C4D5E6F7"
SLACK_WEBHOOK="${SLACK_WEBHOOK:?Set SLACK_WEBHOOK env var}"
echo "[$(date -u)] Starting rollback to $TARGET_VERSION"
# 1. Verify the target bundle exists and is signed
aws s3 ls "$PROD_BUCKET/$TARGET_VERSION/manifest.json" || {
echo "ERROR: $TARGET_VERSION not found in production bucket"
exit 1
}
# 2. Update the production pointer
echo "{\"current\": \"$TARGET_VERSION\", \"rolledBackAt\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" | \
aws s3 cp - "$PROD_BUCKET/latest.json" --content-type application/json
# 3. Invalidate CDN edge caches
INVALIDATION_ID=$(aws cloudfront create-invalidation \
--distribution-id "$CDN_DISTRIBUTION" \
--paths "/latest.json" \
--query 'Invalidation.Id' --output text)
echo "CDN invalidation: $INVALIDATION_ID"
# 4. Wait for propagation (typically 30-60s for a single path)
aws cloudfront wait invalidation-completed \
--distribution-id "$CDN_DISTRIBUTION" \
--id "$INVALIDATION_ID"
# 5. Notify stakeholders
curl -s -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d "{
\"text\": \"ROLLBACK executed: OTA reverted to $TARGET_VERSION. CDN invalidation complete. Investigate the failed release.\"
}"
echo "[$(date -u)] Rollback to $TARGET_VERSION complete"
Wire this script into two triggers:
- Automated — Called by the canary-check script (Section 3) when KPIs breach thresholds.
- Manual — Available as a GitHub Actions workflow dispatch or a Slack slash command (
/ota-rollback v3.4.0) for the on-call engineer.
Post-Rollback Observability
After a rollback, confirm that the old version is actually being served and adopted:
- Check CDN access logs for the
latest.jsonresponse body. - Monitor your analytics for the version distribution—the rolled-back version should climb back to 95%+ within 30 minutes for active users.
- If your app caches aggressively, some users may be stuck on the bad version until their next cold start. Consider a force-restart mechanism for critical incidents.
5. Performance Metrics & KPI Dashboards
You cannot manage what you do not measure. Define clear KPIs with thresholds that trigger human review (warning) or automated action (critical).
Delivery KPIs
| Metric | Target | Warning | Critical | Measurement |
|---|---|---|---|---|
| Download success rate | ≥ 99% | < 98% | < 95% | CDN 200 responses / total requests |
| Download time (p95) | < 5s | > 8s | > 15s | Client-reported download duration |
| Install (apply) time | < 3s | > 7s | > 15s | Time from download complete to new bundle active |
| Signature verification failures | 0 | > 0 | > 10 in 1h | Client error reports |
| Time-to-fix (MTTF) | < 4h | > 12h | > 24h | Tag-to-release for hotfix OTAs |
Experience KPIs
| Metric | Target | Warning | Critical | Measurement |
|---|---|---|---|---|
| Crash-free sessions | ≥ 99.5% | < 99% | < 98% | Firebase Crashlytics / Sentry |
| JS error rate (new errors) | 0 | > 0.5% | > 2% | window.onerror + unhandledrejection |
| First Contentful Paint | < 1.5s | > 2.5s | > 4s | Performance observer in webview |
| Time to Interactive | < 3s | > 5s | > 8s | Custom TTI marker |
| API error rate (post-update) | baseline ±0.5% | ±2% | ±5% | HTTP 4xx/5xx from app |
Adoption Tracking
| Metric | Target | Notes |
|---|---|---|
| 1-hour adoption | ≥ 20% of DAU | Active users who check for updates on foreground |
| 24-hour adoption | ≥ 80% of DAU | Healthy update-check interval |
| 72-hour adoption | ≥ 95% of DAU | Stragglers are offline or have update disabled |
| Update-check failures | < 1% | Devices that tried to check but got a network error |
Dashboard Setup (Grafana Example)
# docker-compose.grafana.yml — OTA monitoring stack
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports: ["9090:9090"]
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
environment:
GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_PASS}"
volumes:
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
# prometheus.yml scrape config for OTA metrics
# Your app backend exposes /metrics with:
# ota_download_total{version, status}
# ota_download_duration_seconds{version, quantile}
# ota_install_duration_seconds{version, quantile}
# ota_active_version{version} (gauge per version)
# ota_signature_failures_total{version}
Build four Grafana panels: (1) a version adoption pie chart, (2) a download success rate time series with warning/critical threshold lines, (3) a p95 install time graph, and (4) a crash-free sessions overlay comparing the current version to the previous one.
6. Incident Response Playbook
When an OTA update goes wrong, you need a documented, rehearsed response—not a Slack thread where five people guess at what to do.
Severity Levels
| Level | Criteria | Response Time | Actions |
|---|---|---|---|
| SEV-1 (Critical) | App crashes on launch, data loss, security breach | < 15 min | Automatic rollback, page on-call, exec notification |
| SEV-2 (Major) | Core feature broken, >5% error rate spike | < 1 hour | Manual rollback decision, on-call investigates |
| SEV-3 (Minor) | Cosmetic issue, <1% affected, no data impact | < 4 hours | Hotfix OTA in next cycle, no rollback |
Step-by-Step Response (SEV-1)
- Detect (T+0) — Automated monitoring fires alert. PagerDuty/Opsgenie pages the on-call mobile engineer.
- Assess (T+5 min) — On-call checks the dashboard: which version, how many users affected, what is the error signature? Confirm it is OTA-related (not a backend outage).
- Rollback (T+10 min) — Run
./rollback.sh v3.4.0or trigger via Slack command. Do not wait for root cause—stop the bleeding first. - Verify (T+15 min) — Confirm CDN is serving the old version. Watch crash rate and error rate drop. Check adoption metrics to confirm users are picking up the rollback.
- Communicate (T+20 min) — Post to the incident channel: what happened, what was done, current status. Notify support team with a customer-facing talking point.
- Investigate (T+1h) — Pull the failed bundle’s source maps, reproduce the issue locally, identify root cause.
- Postmortem (T+48h) — Write a blameless postmortem. Identify what detection, testing, or process change would have caught this before release. File follow-up tickets.
Runbook Template
## OTA Incident Runbook
**Trigger:** Alert from [Sentry/Crashlytics/Grafana] indicating OTA regression
**Pre-requisites:**
- Access to AWS CLI with ota-operator role
- Slack #ota-incidents channel
- rollback.sh on your machine or available via GitHub Actions dispatch
**Decision tree:**
1. Is crash-free rate < 98%?
YES → Immediate rollback (SEV-1)
NO → Continue to step 2
2. Is JS error rate > 2% above baseline?
YES → Rollback within 1 hour (SEV-2)
NO → Continue to step 3
3. Is the issue cosmetic / low-impact?
YES → Hotfix in next OTA cycle (SEV-3)
NO → Escalate to engineering lead
**Post-rollback checklist:**
[ ] CDN serving correct version (curl -s $CDN_URL/latest.json)
[ ] Error rate returning to baseline
[ ] Support team notified with talking points
[ ] Incident channel updated with timeline
[ ] Failed bundle quarantined (moved to s3://myapp-ota-quarantine/)
[ ] Postmortem scheduled within 48 hours
7. Stakeholder Communication
OTA updates are invisible to users when they work and catastrophic when they don’t. Your non-engineering stakeholders—support, marketing, product, executives—need structured, predictable communication so they are never surprised.
Weekly Rollup Email
Send every Monday morning. Keep it scannable:
Subject: OTA Weekly Rollup — Mar 23-29, 2026
RELEASES
v3.4.1 (Mar 29) — Fixed checkout flow timeout on Android 12+
v3.4.0 (Mar 25) — Added promo banner for spring campaign
DELIVERY HEALTH
Download success: 99.2% (target: 99%) ✓
Install time p95: 2.1s (target: 3s) ✓
Signature failures: 0 ✓
EXPERIENCE HEALTH
Crash-free sessions: 99.7% (target: 99.5%) ✓
JS error rate: 0.3% (target: <0.5%) ✓
ADOPTION
v3.4.1 — 87% of DAU (24h), 96% (48h)
v3.4.0 — retired, <2% remaining
INCIDENTS
None this week.
NEXT WEEK
v3.5.0 (est. Apr 1) — New onboarding flow. Canary planned for 12h.
---
Questions? Reply to this email or post in #app-releases.
Dedicated Status Channels
Different audiences need different levels of detail. Set up three channels:
- #ota-engineering — Every pipeline run, test result, and metric alert. Noisy by design. Engineers monitor this during releases.
- #ota-releases — One message per release and one per rollback. Support and marketing subscribe here. Messages include: version, what changed (one sentence), rollout percentage, and ETA to full rollout.
- #ota-incidents — Only fires when something goes wrong. Executives and product leads subscribe here. Includes severity, user impact, and resolution ETA.
Feedback Loops
Communication is not one-directional. Build feedback back into the pipeline:
- Support ticket tagging — Support agents tag tickets with the OTA version. A weekly query surfaces tickets correlated with specific releases.
- In-app feedback prompt — After an OTA update applies, show a subtle “How is the app working?” prompt to 5% of users. Funnel responses to a Slack channel or spreadsheet.
- Canary opt-in — Let power users opt into early updates via an in-app toggle. These users provide high-signal feedback because they actively look for issues.
- Postmortem distribution — After any SEV-1 or SEV-2 incident, share the blameless postmortem with all stakeholder channels. This builds trust and demonstrates process maturity.
8. Putting It All Together
Here is the operational checklist for every OTA release. Print it, pin it in Slack, or wire it into a GitHub issue template:
Pre-Release
- Bundle built from tagged commit, signed via KMS
- SAST scan, dependency audit, license check, bundle-size gate all green
- Unit tests pass, device smoke tests pass on iOS + Android emulators
- Rollback target identified and verified (previous version in prod bucket)
- Release notes drafted for #ota-releases
Release
- Deploy to staging, verify manually or with integration test
- Promote to canary (1–5%), start automated canary checks
- Post to #ota-releases: version, changes, canary window
- Monitor dashboards for canary window duration (4–24h)
Post-Release
- Canary checks pass → expand to 25% → 50% → 100%
- Confirm 80%+ adoption at 24h, 95%+ at 72h
- Archive previous-previous version (keep last 2 active)
- Update weekly rollup data
- Close release tracking issue
OTA hot updates give you a superpower: shipping fixes and features to users in minutes instead of days. But superpowers require discipline. Build the pipeline, automate the guardrails, define the thresholds, rehearse the rollback, and communicate relentlessly. The goal is not just fast releases—it is fast releases that you trust.