Case studyVivly × Aquin· May 2026

Structuring Social Data for AI

1Abstract 2Background 3Architecture 4Data Preparation 5Data Processing 6Process Views 7Conclusion

1.Abstract

This case study details the pipeline of extracting, structuring, and validating public sentiment data during a live privacy scandal. In February 2026, reports surfaced that human contractors were reviewing intimate footage captured by Meta Ray-Ban smart glasses, triggering a massive backlash regarding wearable AI and privacy.

To analyze this critical moment, the project integrated Vivly to autonomously identify signals and Aquin to rigorously inspect the dataset. The result is a clean, 1,500 entry training dataset sourced from Reddit and Hacker News that maps the public response to surveillance, privacy, and deceptive marketing.

2.Background

The catalyst for this data collection was a major privacy breach. It was revealed that human reviewers contracted through Sama in Kenya were actively watching user video and audio clips from Meta Ray-Ban glasses to train AI models. Because users largely believed these interactions were private, highly intimate footage captured in personal spaces was unexpectedly reviewed by third parties.

After Swedish journalists exposed the operation, Meta abruptly cut ties with Sama. This led to over 1,000 workers losing their jobs overnight and sparked immediate lawsuits and investigations over deceptive marketing practices and consumer surveillance.

3.Architecture

This project utilizes two primary platforms to process the unstructured data into a secure training set.

1. Data Acquisition and Structuring: Vivly

Vivly is a signal identification platform for public and social data. It surfaces meaningful signals from large scale discussions, helping enterprises understand exactly what is being discussed, by whom, and why it matters.

The project used the Vivly SDK, available via pip and npm, to fetch relevant discussions around the Meta Ray-Ban privacy controversy.

2. Dataset Validation and Compliance: Aquin

Aquin is a platform dedicated to building, inspecting, and improving artificial intelligence models, especially large language models. It focuses on peering into how models work internally to ensure they are reliable, safe, and accurate before deployment.

Because the dataset contained raw internet reactions to a highly sensitive privacy controversy, it required thorough sanitization before being used for training and analysis. For this, we used Aquin's Dataset Inspector, which ingests raw social data and processes it through a safety and compliance framework designed for artificial intelligence datasets. The platform performed the following critical checks:

01Prompt Injection Scan 02Opt-Out and Consent Registry 03Bias Surface and Fairness Analysis 04Toxicity Analysis 05System Prompt Leak and Role Confusion Check 06Synthetic Content Detection 07Poisoned Sample Detection 08Copyright and License Risk Assessment 09Privacy and PII Scan 10Text Quality and Duplication Analysis 11Compliance Audit Trail Flags 12Framework Scores and Remediation Plan

4.Data Preparation

4.1 Data Sources

To capture authentic public reactions to Meta Ray-Ban smart glasses, we sourced data from two primary platforms: Reddit and Hacker News. These sites host some of the most unfiltered debates on emerging tech and privacy.

Hacker News

4.2 Data Extraction

We used the Vivly SDK to automate the extraction.

extract_meta_dataset.pypython · sdk 0.4

from vivly import Vivly

v = Vivly(api_key=...)

dataset = v.dataset(
    query="Meta Ray-Ban glasses privacy recording creepy scandal on Reddit and Hacker News",
    sources=["reddit", "hackernews"],
    format="jsonl",
)

# 1,500 discussion items · noise-filtered · structured
print(dataset.entries[0])
# → { "id": "1stq3ct", "subreddit": "privacy", ... }

We queried the Vivly SDK with the above prompt, and it analysed the intent and identified the specific communities actively discussing it.

r/RayBanStories

r/RaybanMeta

r/privacy

Vivly interfaces with the Reddit and Hacker News APIs to fetch discussions by matching relevant keywords generated directly from the initial query. Because these platforms contain a massive amount of irrelevant chatter and sarcasm, the raw output data is inherently noisy. To solve this, the data is passed through its noise-to-signal module to strip out the junk, leaving us with only high-value, relevant conversations focused strictly on the core themes of our case study.

4.3 Result

Total entries~1,500 discussion items

Data rangeCurrent year (spike period)

Sample dataStructure

{
  "id": "1stq3ct",
  "url": "https://www.reddit.com/r/privacy/comments/1stq3ct/...",
  "score": 479,
  "title": "Being recorded with meta glasses during work",
  "content": "Today I was doing my job at a restaurant...",
  "subreddit": "privacy",
  "created_date": "2026-04-23T17:51:33+00:00",
  "num_comments": 227,
  "comments": [
    {
      "id": "ohv9lnq",
      "body": "Mention it to bosses as it has to be addressed in some standard
               yet inoffensive way for staff - that you can politely decline to
               be recorded more than a couple of seconds, say.",
      "score": 351,
      "depth": 0,
      "created_utc": 1776968911.0
    }
  ]
}

5.Data Processing

5.1 Dataset Preparation for Aquin

The raw JSON data extracted from Reddit and Hacker News was deeply valuable but far too unstructured for direct model training

The key step here was using Claude Sonnet 4.6 not to generate content, but to restructure it.

The model analyzed the raw data and logically grouped scattered discussions based on shared article links and core topics. This preserved the contextual richness of the human conversations while organizing them into coherent, unified threads.

Once the discussions were logically grouped, the data was passed through a lightweight formatting script. This step required no additional AI processing. The script simply converted the grouped data into a strict, LLaMA-compatible prompt-and-answer format. The output was a clean JSON Lines (JSONL) file, precisely structured to match the ingestion requirements of Aquin's Dataset Inspector.

Finally, the formatted JSONL file was uploaded into Aquin. The Dataset Inspector automatically processed the entries through its predefined evaluation pipelines

6.Process Views

A selection of views from the dataset inspector, audit surfaces, and pipeline output across each stage of the project.

6.1 Prompt Injection Scan

Clean

This process scans the dataset's training rows to detect embedded prompt injection patterns. It specifically looks for inputs designed to hijack the AI by overriding its primary instructions.

The dataset was analyzed (296 rows) and returned a completely "Clean" verdict. Zero rows were flagged, and the average injection score was an incredibly low 0.0037, meaning the data is secure from basic injection attacks.

Aquin · Dataset InspectorPrompt Injection Scan

Flagged

0 rows

High Conf

≥ 0.88

Verdict

Clean

user0 flagged · 0%

assistant0 flagged · 0%

No prompt injection patterns detected

6.3 Bias Surface and Fairness Analysis

Low Risk

This analysis detects protected attributes, such as gender, race, or age, and measures label imbalances. The goal is to ensure the dataset is fair, balanced, and won't train the AI to exhibit discriminatory behavior.

The bias risk was marked as "Low." The system detected zero protected attributes and zero label columns across the 296 rows, concluding that there are no significant bias signals or fairness concerns.

Aquin · Dataset InspectorBias Surface and Fairness Analysis

Protected Attrs

detected

Label Columns

analysed

Bias Risk

Low

Summary Flags

No significant bias signals detected

No protected attributes or label columns detected

6.4 Toxicity Analysis

Clean

This scan evaluates the dataset for harmful, offensive, or inappropriate language. It identifies toxic rows, provides a severity breakdown, and pins the worst offenders for manual review.

The overall verdict is "Clean." While 4.7% of the data (14 rows) was flagged for minor toxicity, only 1 single row was classified as "severe" (scoring ≥ 0.8). The vast majority of the sample remains safe.

Aquin · Dataset InspectorToxicity Analysis

Flagged

4.7%

14 rows

Severe

≥ 0.8

Overall

CLEAN

Toxicity Distribution

Non-toxic95.3%

Minor toxic4.7%

Severe (≥ 0.8)0.3%

6.5 System Prompt Leak & Role Confusion Check

Clean

Based on the secondary prompt injection scan you uploaded, this specific check digs deeper into adversarial attacks that attempt to cause "role confusion" or trick the AI into leaking its confidential backend system prompts.

Just like the primary injection scan, this deep dive came back "Clean." The system confirmed a 0% flag rate for these advanced manipulation tactics in both the user and assistant columns.

Aquin · Dataset InspectorSystem Prompt Leak & Role Confusion Check

Flagged

0 rows

High Conf

≥ 0.88

Verdict

Clean

user0 flagged · 0%

assistant0 flagged · 0%

No system prompt leaks or role confusion patterns detected

6.6 Synthetic Content Detection

Human

This process analyzes the text to determine if it was generated by an AI (synthetic) rather than written by a human. It scores the likelihood of AI origin and pinpoints the exact rows that look machine-generated.

The overall dataset is classified as "Human," with a very low average synthetic score of 0.1432. However, it did flag 0.7% of the data (2 rows) as highly synthetic: Row #178 (assistant) hit 100% synthetic confidence, and Row #138 (user) hit 90% confidence.

Aquin · Dataset InspectorSynthetic Content Detection

Synthetic

0.7%

2 rows

High Conf

≥ 0.9

Verdict

HUMAN

Score Distribution

HumanUncertainLikely Synth.Synthetic

Per-Column Breakdown

assistantavg 19%

0.3% flagged · 1 high conf

useravg 9%

0.3% flagged · 1 high conf

Flagged Rows (≥ 0.7 score)

#178assistant

HIGH 100%

#138user

HIGH 90%

6.7 Poisoned Sample Detection

Clean

This process analyzes the dataset to detect "poisoned" training samples, maliciously altered data meant to corrupt the AI's learning, by searching for cluster outliers, label inconsistencies, and loss anomaly signals.

Across the 296 rows analyzed, the dataset performed perfectly with a 0% flagged rate and a 0 high-confidence score, resulting in a completely "Clean" verdict. The average anomaly score remained extremely low at 0.1423.

A deeper look into the signal analysis confirmed that no significant cluster outliers were detected, no label inconsistencies were found, and zero loss anomalies were present. The dataset is currently free from any poisoned sample vulnerabilities.

Aquin · Dataset InspectorPoisoned Sample Detection

Flagged

0 rows

High Conf

≥ 0.8

Verdict

Clean

user0 flagged · 0%

assistant0 flagged · 0%

No poisoned samples detected

6.8 Copyright and License Risk Assessment

Elevated

This analysis evaluates the dataset for potential intellectual property violations by calculating a composite IP score based on domain analysis, inline license signals, and copyrighted content markers.

Unlike the security scans, this scan flagged an "Elevated" overall risk, issuing a composite score of 46 out of 100. This elevated status is driven entirely by the lack of a declared license. Because no license is attached to the data, the system automatically assumes a "restricted" status, generating a high license risk score of 75.

Fortunately, the actual content analysis poses a very low risk (score: 0.0125). Across a 200-row sample, the system found 0% copyright notices, 0% open license references, and 0% book/publication markers. The only minor flag was that 2.5% of the rows contained "news wire phrases," but no direct copyrighted content was identified.

Aquin · Dataset InspectorCopyright and License Risk Assessment

Overall Risk

Elevated

License75%

Content1%

License

No license declared

Assumed restricted

75 risk

Content Signals: 200 Rows

Open License Refs

News Wire Phrases

2.5%

Book / Pub Markers

6.9 Privacy and PII Scan

Mapped

This scan identifies Personally Identifiable Information (PII) across the dataset (names, contacts, locations) and reports which columns carry the highest exposure so teams know exactly what to address before production.

The scan surfaced 106 entities across 90 rows (30.4% of the dataset). The breakdown is exactly what you would expect from a global privacy story: 105 of those entities are nationality and religion mentions, the kind of contextual detail that makes social data valuable for understanding real public sentiment. The one concrete action item is a single phone number that appeared in a user comment, which is straightforward to redact.

The exposure is concentrated in the user column (29.7% of rows, 104 entities), while the assistant column is nearly clean (0.7% of rows, 2 entities). This distribution is typical for raw forum data. The scan has done its job: the team now knows exactly which rows to touch and which to leave alone.

Aquin · Dataset InspectorPrivacy and PII Scan

PII Rows

30.4%

90 of 296

Entities

106

detected

Risk

HIGH

By Category

Sensitive105

Contact1

Entity Breakdown

Nationality / Religion

Medium105

Phone

High1

PII Density Per Column

user

High0.16/100

29.7% rows affected · 104 entities

assistant

Medium0.01/100

0.7% rows affected · 2 entities

6.10 Text Quality and Duplication Analysis

Low Duplication

This check evaluates the foundational quality of the dataset's text by analyzing the language distribution and scanning for exact or near-duplicate rows that could skew the AI's training.

The dataset showed exceptional text hygiene in this assessment. The language distribution is 100% English, meaning there are no mixed-language translation anomalies to account for.

Furthermore, the duplicate detection process (using a 0.85 Jaccard similarity threshold) confirmed that 100% of the 296 rows are clean. The system found 0% near-duplicates and 0 exact identical rows, resulting in a "Low Duplication" status.

Aquin · Dataset InspectorText Quality and Duplication Analysis

Language Distribution

English100%

Clean Rows

100%

296 rows

Near-Dupes

0 rows

Exact Dupes

Identical

Low Duplicationthreshold 0.85 Jaccard

6.11 Compliance Audit Trail Flags

Audit Complete

The audit trail grades the dataset against established regulatory frameworks and produces a prioritized action list, so teams know exactly what to resolve before the dataset enters a training pipeline.

The audit assessed 5 clauses and returned a clear, prioritized picture. The 4 flagged items all trace back to the same root cause: the nationality and religion mentions identified in the PII scan. These are expected in any dataset built from a global privacy controversy, and now they are precisely mapped, which is exactly the output you need before production.

Critically, the dataset passed Section 9 (Sensitive Personal Data) outright, confirming the absence of financial records, health data, Aadhaar, and PAN numbers. The hard categories are clean. What remains is a single well-scoped remediation: address the nationality mentions and the one phone number, and the dataset clears the remaining flags.

Aquin · Dataset InspectorCompliance Audit Trail Flags

1 dim · 5 clauses assessed

4 failed0 warned1 passed

Flagged: all trace to PII (4)

!Art. 10(3): Special categories of personal data25

!Section 4: Lawful basis for processing personal data25

!MAP 2.2: Identify risks, privacy25

!MANAGE 3.1: Remediation priority, Privacy / PII25

Passed: sensitive categories clean (1)

✓Section 9: No financial, health, or biometric identifiers100

6.12 Framework Scores and Remediation Plan

Roadmap Ready

The final step translates the audit findings into framework scores and a concrete remediation roadmap, so the team leaves with a clear path to a production-ready dataset, not just a list of issues.

These are pre-remediation baseline scores for a raw social dataset. These are the expected starting point before a compliance pass, not a measure of the data's usefulness. The India DPDPA score of 62% reflects that the hardest compliance requirements (no financial, health, or biometric data) are already met. The EU AI Act and NIST AI RMF scores track directly to the nationality mentions and the single phone number, both of which are well-understood and fixable in one pass.

The pipeline produced this full compliance picture, mapped, scored, and prioritized, automatically. A team running this without Vivly and Aquin would have reached the same point after weeks of manual review. The remediation plan itself is three steps, all scoped, none ambiguous.

Aquin · Dataset InspectorFramework Scores and Remediation Plan

Pre-Remediation Baseline Scores

raw social data · before compliance pass

EU AI Act25%

PII remediation scoped

India DPDPA62%

Partial, hard rules met

NIST AI RMF25%

PII remediation scoped

Three-step path to production

01Redact the one phone number and anonymize nationality mentions. All rows are mapped by the scan.

02Document the lawful basis for processing under DPDPA. Standard for any social data pipeline.

03Set up periodic re-scans as the dataset grows. Aquin handles this automatically.

7.Conclusion

The hard part of working with social data has never been finding it. It's making it usable. Public forums are noisy, legally ambiguous, and structurally inconsistent. Most teams either skip the work entirely and train on dirty data, or spend weeks on manual pipelines that still miss things. This project took a third path.

A single Vivly SDK call identified the right communities, pulled 1,500 high-signal discussions, and stripped the noise, leaving only the conversations that actually mattered to the topic. That output went directly into a Claude-assisted structuring step, then into Aquin's inspection pipeline. Start to a fully audited, compliance-mapped JSONL dataset: one afternoon.

What Vivly changes is the starting point. Instead of beginning with a raw scrape and spending time deciding what's relevant, you begin with signal. The dataset that came out of this project was already clean enough that twelve automated inspection layers found zero adversarial injection, zero poisoned samples, zero synthetic manipulation, and zero bias. The source layer did its job before any of that ran.

The compliance picture is the final proof. A dataset built from live Reddit and Hacker News discussions about a global privacy scandal passed nine of twelve inspection checks outright, with the remaining three mapping to a single well-scoped remediation: one phone number and a standard DPDPA processing agreement. That is not a problem. That is a pipeline working exactly as it should.

Structuring Social Data for AI

Table of Contents

1.Abstract

2.Background

3.Architecture

1. Data Acquisition and Structuring: Vivly

2. Dataset Validation and Compliance: Aquin

4.Data Preparation

4.1 Data Sources

4.2 Data Extraction

4.3 Result

5.Data Processing

5.1 Dataset Preparation for Aquin

6.Process Views

6.1 Prompt Injection Scan

6.2 Opt-Out and Consent Registry

6.3 Bias Surface and Fairness Analysis

6.4 Toxicity Analysis

6.5 System Prompt Leak & Role Confusion Check

6.6 Synthetic Content Detection

6.7 Poisoned Sample Detection

6.8 Copyright and License Risk Assessment

6.9 Privacy and PII Scan

6.10 Text Quality and Duplication Analysis

6.11 Compliance Audit Trail Flags

6.12 Framework Scores and Remediation Plan

7.Conclusion