All case studies
Case studyVivly × Aquin· May 2026

Structuring Social Data for AI

Aquin x Vivly dataset inspector

1.Abstract

This case study details the pipeline of extracting, structuring, and validating public sentiment data surrounding privacy and safety concerns related to AI-powered wearable devices. Discussions around Meta Ray-Ban smart glasses intensified in early 2026 as users, researchers, and online communities debated issues such as surveillance, consent, data collection, and transparency in AI training practices.

To analyze this broader conversation, the project integrated Vivly to autonomously identify signals and Aquin to rigorously inspect the dataset. While the retrieval pipeline surfaced a much larger corpus of wearable AI and privacy-related discussions, this case study specifically focuses on conversations related to Meta Ray-Ban smart glasses. The final result is a curated 1,500-entry training dataset sourced from Reddit and Hacker News that captures public discussions around consumer trust, privacy expectations, passive recording concerns, and AI-enabled wearable technology.

2.Background

The dataset was collected during a period of heightened public discussion around privacy practices associated with Meta Ray-Ban glasses. Conversations across online communities focused on topics such as passive recording, AI training transparency, third-party data handling, and how wearable AI devices may affect expectations of privacy in public and personal spaces.

One incident that intensified these discussions involved investigative reports alleging that contractors associated with AI data labeling operations reviewed user-generated video and audio clips captured through the glasses. The reports raised broader concerns around informed consent, transparency in AI training workflows, data handling practices, and the privacy implications of wearable AI systems.

Following public backlash and regulatory scrutiny, Meta reportedly paused and later ended parts of its collaboration with Sama, a Kenya-based data annotation company involved in the workflow. The controversy also contributed to public debate around AI governance, consumer trust, and responsible deployment of always-on wearable technologies.

3.Architecture

This project utilizes two primary platforms to process the unstructured data into a secure training set.

1. Data Acquisition and Structuring: Vivly

Vivly is a signal identification platform for public and social data. It surfaces meaningful signals from large scale discussions, helping enterprises understand exactly what is being discussed, by whom, and why it matters.

The project used the Vivly SDK, available via pip and npm, to fetch relevant discussions around the Meta Ray-Ban privacy controversy.

2. Dataset Validation and Compliance: Aquin

Aquin is a platform dedicated to building, inspecting, and improving artificial intelligence models, especially large language models. It focuses on peering into how models work internally to ensure they are reliable, safe, and accurate before deployment.

Because the dataset contained raw internet reactions to a highly sensitive privacy controversy, it required thorough sanitization before being used for training and analysis. For this, we used Aquin's Dataset Inspector, which ingests raw social data and processes it through a safety and compliance framework designed for artificial intelligence datasets. The platform performed the following critical checks:

4.Data Preparation

4.1 Data Sources

To capture authentic public reactions to Meta Ray-Ban smart glasses, we sourced data from two primary platforms: Reddit and Hacker News. These sites host some of the most unfiltered debates on emerging tech and privacy.

Reddit
Reddit
Hacker News
Hacker News

4.2 Data Extraction

We used the Vivly CLI to automate the extraction.

terminalvivly cli
$ vivly route \
    "Meta Ray-Ban wearable AI privacy discussions and recording concerns" \
    --reddit --hackernews \
    --items=1500 --format=jsonl

We queried the Vivly SDK with the above prompt, and it analysed the intent and identified the specific communities actively discussing it.

Redditr/RayBanStories
Redditr/RaybanMeta
Redditr/privacy

Using the Vivly SDK, the pipeline initially surfaced several thousand public discussions across Reddit and Hacker News related to wearable AI, privacy expectations, surveillance concerns, and Meta Ray-Ban smart glasses. After multiple filtering and validation stages designed to remove noise, duplicates, and low-relevance entries, the dataset was refined into a curated corpus of 1,500 high-signal discussions for downstream analysis and experimentation.

4.3 Result

Total entries~1,500 discussion items
Data rangeCurrent year (spike period)
Sample dataStructure
{
  "id": "1stq3ct",
  "url": "https://www.reddit.com/r/privacy/comments/1stq3ct/...",
  "score": 479,
  "title": "Being recorded with meta glasses during work",
  "content": "Today I was doing my job at a restaurant...",
  "subreddit": "privacy",
  "created_date": "2026-04-23T17:51:33+00:00",
  "num_comments": 227,
  "comments": [
    {
      "id": "ohv9lnq",
      "body": "Mention it to bosses as it has to be addressed in some standard
               yet inoffensive way for staff - that you can politely decline to
               be recorded more than a couple of seconds, say.",
      "score": 351,
      "depth": 0,
      "created_utc": 1776968911.0
    }
  ]
}

5.Data Processing

5.1 Dataset Preparation for Aquin

The raw JSON data extracted from Reddit and Hacker News was deeply valuable but far too unstructured for direct model training

The key step here was using Claude Sonnet 4.6 not to generate content, but to restructure it.

The model analyzed the raw data and logically grouped scattered discussions based on shared article links and core topics. This preserved the contextual richness of the human conversations while organizing them into coherent, unified threads.

This consolidation step compressed approximately 1,500 individual discussions into 296 structured conversational rows while preserving the core semantic context and sentiment patterns present across the source material. The resulting structure was significantly more efficient for downstream inspection, clustering, and compliance analysis.

Once the discussions were logically grouped, the data was passed through a lightweight formatting script. This step required no additional AI processing. The script converted the grouped data into a strict, LLaMA-compatible prompt-and-answer format. The output was a clean JSON Lines (JSONL) file structured to match the ingestion requirements of Aquin's Dataset Inspector.

Finally, the formatted JSONL file was uploaded into Aquin, where the Dataset Inspector automatically processed the entries through its predefined evaluation pipelines.

6.Process Views

A selection of views from the dataset inspector, audit surfaces, and pipeline output across each stage of the project.

6.1 Prompt Injection Scan

Clean

This process scans the dataset's training rows to detect embedded prompt injection patterns. It specifically looks for inputs designed to hijack the AI by overriding its primary instructions.

The dataset was analyzed (296 rows) and returned a completely "Clean" verdict. Zero rows were flagged, and the average injection score was an incredibly low 0.0037, meaning the data is secure from basic injection attacks.

Aquin · Dataset InspectorPrompt Injection Scan

Flagged

0%

0 rows

High Conf

0

≥ 0.88

Verdict

Clean

user0 flagged · 0%
assistant0 flagged · 0%

No prompt injection patterns detected

6.3 Bias Surface and Fairness Analysis

Low Risk

This analysis detects protected attributes, such as gender, race, or age, and measures label imbalances. The goal is to ensure the dataset is fair, balanced, and won't train the AI to exhibit discriminatory behavior.

The bias risk was marked as "Low." The system detected zero protected attributes and zero label columns across the 296 rows, concluding that there are no significant bias signals or fairness concerns.

Aquin · Dataset InspectorBias Surface and Fairness Analysis

Protected Attrs

0

detected

Label Columns

0

analysed

Bias Risk

Low

Summary Flags

No significant bias signals detected

No protected attributes or label columns detected

6.4 Toxicity Analysis

Clean

This scan evaluates the dataset for harmful, offensive, or inappropriate language. It identifies toxic rows, provides a severity breakdown, and pins the worst offenders for manual review.

The overall verdict is "Clean." While 4.7% of the data (14 rows) was flagged for minor toxicity, only 1 single row was classified as "severe" (scoring ≥ 0.8). The vast majority of the sample remains safe.

Aquin · Dataset InspectorToxicity Analysis

Flagged

4.7%

14 rows

Severe

1

≥ 0.8

Overall

CLEAN

Toxicity Distribution

Non-toxic95.3%
Minor toxic4.7%
Severe (≥ 0.8)0.3%

6.5 System Prompt Leak & Role Confusion Check

Clean

Based on the secondary prompt injection scan you uploaded, this specific check digs deeper into adversarial attacks that attempt to cause "role confusion" or trick the AI into leaking its confidential backend system prompts.

Just like the primary injection scan, this deep dive came back "Clean." The system confirmed a 0% flag rate for these advanced manipulation tactics in both the user and assistant columns.

Aquin · Dataset InspectorSystem Prompt Leak & Role Confusion Check

Flagged

0%

0 rows

High Conf

0

≥ 0.88

Verdict

Clean

user0 flagged · 0%
assistant0 flagged · 0%

No system prompt leaks or role confusion patterns detected

6.6 Synthetic Content Detection

Human

This process analyzes the text to determine if it was generated by an AI (synthetic) rather than written by a human. It scores the likelihood of AI origin and pinpoints the exact rows that look machine-generated.

The overall dataset is classified as "Human," with a very low average synthetic score of 0.1432. However, it did flag 0.7% of the data (2 rows) as highly synthetic: Row #178 (assistant) hit 100% synthetic confidence, and Row #138 (user) hit 90% confidence.

Aquin · Dataset InspectorSynthetic Content Detection

Synthetic

0.7%

2 rows

High Conf

2

≥ 0.9

Verdict

HUMAN

Score Distribution

HumanUncertainLikely Synth.Synthetic

Per-Column Breakdown

assistantavg 19%

0.3% flagged · 1 high conf

useravg 9%

0.3% flagged · 1 high conf

Flagged Rows (≥ 0.7 score)

#178assistant
HIGH 100%
#138user
HIGH 90%

6.7 Poisoned Sample Detection

Clean

This process analyzes the dataset to detect "poisoned" training samples, maliciously altered data meant to corrupt the AI's learning, by searching for cluster outliers, label inconsistencies, and loss anomaly signals.

Across the 296 rows analyzed, the dataset performed perfectly with a 0% flagged rate and a 0 high-confidence score, resulting in a completely "Clean" verdict. The average anomaly score remained extremely low at 0.1423.

A deeper look into the signal analysis confirmed that no significant cluster outliers were detected, no label inconsistencies were found, and zero loss anomalies were present. The dataset is currently free from any poisoned sample vulnerabilities.

Aquin · Dataset InspectorPoisoned Sample Detection

Flagged

0%

0 rows

High Conf

0

≥ 0.8

Verdict

Clean

user0 flagged · 0%
assistant0 flagged · 0%

No poisoned samples detected

6.9 Privacy and PII Scan

Mapped

This scan identifies Personally Identifiable Information (PII) across the dataset (names, contacts, locations) and reports which columns carry the highest exposure so teams know exactly what to address before production.

The scan surfaced 106 entities across 90 rows (30.4% of the dataset). The breakdown is exactly what you would expect from a global privacy story: 105 of those entities are nationality and religion mentions, the kind of contextual detail that makes social data valuable for understanding real public sentiment. The one concrete action item is a single phone number that appeared in a user comment, which is straightforward to redact.

The exposure is concentrated in the user column (29.7% of rows, 104 entities), while the assistant column is nearly clean (0.7% of rows, 2 entities). This distribution is typical for raw forum data. The scan has done its job: the team now knows exactly which rows to touch and which to leave alone.

Aquin · Dataset InspectorPrivacy and PII Scan

PII Rows

30.4%

90 of 296

Entities

106

detected

Risk

HIGH

2types

By Category

Sensitive105
Contact1

Entity Breakdown

Nationality / Religion
Medium105
Phone
High1

PII Density Per Column

user
High0.16/100

29.7% rows affected · 104 entities

assistant
Medium0.01/100

0.7% rows affected · 2 entities

6.10 Text Quality and Duplication Analysis

Low Duplication

This check evaluates the foundational quality of the dataset's text by analyzing the language distribution and scanning for exact or near-duplicate rows that could skew the AI's training.

The dataset showed exceptional text hygiene in this assessment. The language distribution is 100% English, meaning there are no mixed-language translation anomalies to account for.

Furthermore, the duplicate detection process (using a 0.85 Jaccard similarity threshold) confirmed that 100% of the 296 rows are clean. The system found 0% near-duplicates and 0 exact identical rows, resulting in a "Low Duplication" status.

Aquin · Dataset InspectorText Quality and Duplication Analysis
100%

Language Distribution

English100%

Clean Rows

100%

296 rows

Near-Dupes

0%

0 rows

Exact Dupes

0

Identical

Low Duplicationthreshold 0.85 Jaccard

6.11 Compliance Audit Trail Flags

Audit Complete

The audit trail grades the dataset against established regulatory frameworks and produces a prioritized action list, so teams know exactly what to resolve before the dataset enters a training pipeline.

The audit assessed 5 clauses and returned a clear, prioritized picture. The 4 flagged items all trace back to the same root cause: the nationality and religion mentions identified in the PII scan. These are expected in any dataset built from a global privacy controversy, and now they are precisely mapped, which is exactly the output you need before production.

Critically, the dataset passed Section 9 (Sensitive Personal Data) outright, confirming the absence of financial records, health data, Aadhaar, and PAN numbers. The hard categories are clean. What remains is a single well-scoped remediation: address the nationality mentions and the one phone number, and the dataset clears the remaining flags.

Aquin · Dataset InspectorCompliance Audit Trail Flags
40

1 dim · 5 clauses assessed

4 failed0 warned1 passed

Flagged: all trace to PII (4)

!Art. 10(3): Special categories of personal data25
!Section 4: Lawful basis for processing personal data25
!MAP 2.2: Identify risks, privacy25
!MANAGE 3.1: Remediation priority, Privacy / PII25

Passed: sensitive categories clean (1)

Section 9: No financial, health, or biometric identifiers100

6.12 Framework Scores and Remediation Plan

Roadmap Ready

The final step translates the audit findings into framework scores and a concrete remediation roadmap, so the team leaves with a clear path to a production-ready dataset, not just a list of issues.

These are pre-remediation baseline scores for a raw social dataset. These are the expected starting point before a compliance pass, not a measure of the data's usefulness. The India DPDPA score of 62% reflects that the hardest compliance requirements (no financial, health, or biometric data) are already met. The EU AI Act and NIST AI RMF scores track directly to the nationality mentions and the single phone number, both of which are well-understood and fixable in one pass.

The pipeline produced this full compliance picture, mapped, scored, and prioritized, automatically. A team running this without Vivly and Aquin would have reached the same point after weeks of manual review. The remediation plan itself is three steps, all scoped, none ambiguous.

Aquin · Dataset InspectorFramework Scores and Remediation Plan

Pre-Remediation Baseline Scores

raw social data · before compliance pass
EU AI Act25%

PII remediation scoped

India DPDPA62%

Partial, hard rules met

NIST AI RMF25%

PII remediation scoped

Three-step path to production

01Redact the one phone number and anonymize nationality mentions. All rows are mapped by the scan.
02Document the lawful basis for processing under DPDPA. Standard for any social data pipeline.
03Set up periodic re-scans as the dataset grows. Aquin handles this automatically.

7.Conclusion

The challenge with social data has rarely been access. The real difficulty is turning noisy, inconsistent public discussions into datasets that are structured enough for downstream analysis and model development. Public forums contain sarcasm, reposts, fragmented context, and low-signal commentary that make reliable dataset construction difficult at scale.

This project explored a different approach. Using the Vivly SDK, the pipeline identified relevant communities discussing wearable AI privacy concerns and surfaced high-signal discussions related to Meta Ray-Ban smart glasses. After filtering and refinement, the dataset was narrowed into a curated set of 1,500 discussion entries aligned with the core themes of the case study.

The structured output was then passed through a Claude-assisted organization stage followed by Aquin's inspection pipeline. Across multiple automated inspection layers, the dataset showed strong structural consistency with minimal indicators of adversarial manipulation, synthetic amplification, or poisoned content.

Because the dataset was collected directly from public online forums, certain discussions still contained personally identifiable information and sensitive user-provided details originating from the source platforms themselves. Before any downstream training or experimentation, the next stage of the pipeline will focus on PII scrubbing, remediation, and compliance alignment to ensure the dataset is safer and more suitable for research use.