Introduction: Why Businesses Need to Understand the Legal Framework of Web Scraping in 2026

Web scraping has evolved from a tool for engineers into a strategic discipline of data management. In 2026, its legality is shaped not only by technologies but also by nuances in international law: court precedents in the US, European GDPR enforcement practices, Russian regulations (152-FZ), and the stance of Roskomnadzor. Consequently, the same action may be legal in one jurisdiction, conditionally permissible in another, and risky in a third. This guide will help you navigate the legal landscape confidently, build compliance "by design", minimize risks, and extract maximum value from open data—without conflicts with regulators and rights holders.

What you will gain: a systematic understanding of legal data categories; an up-to-date overview of precedents (including the hiQ vs LinkedIn case in recent years), European regulatory practices, and Russian court decisions; clear frameworks for assessing legality; step-by-step instructions for process setup; checklists; tools; real cases, and common mistakes. We speak in simple language, yet at a professional depth, so you can implement best practices today.

Important Note: This material provides general legal information and analytical recommendations. It is not legal advice and does not create an attorney-client relationship. Please consult with a lawyer familiar with your industry and jurisdictions before making decisions.

Basics: What is Web Scraping and How the Law Views Data

Key Terms and Their Legal Implications

  • Web Scraping — automated extraction of data from publicly available HTML pages or APIs. Legally significant: method of access (public/restricted), presence of technical barriers, terms of use.
  • Open Data — data accessible without barriers for human reading. Important: "openness" does not negate copyright, related rights, database rights, and personal data requirements.
  • Personal Data (PD) — in the EU/EEA under GDPR, this refers to any information related to an identified or identifiable person. In Russia under 152-FZ — any information relating to a directly or indirectly identified or identifiable citizen of the Russian Federation.
  • Publicly Accessible PD — in the EU: personal data published by the subject or a third party; remains PD with a full set of legal requirements. In Russia: post-2021 amendments require separate consent for dissemination; the act of publication does not imply free use.
  • Terms of Service (ToS) — contractual provisions of a website or API. Violating them can have civil consequences and may be associated with unauthorized access norms in some jurisdictions if technical measures are circumvented.
  • robots.txt — a file with recommendations for web robots. Technical rules for indexing and bypassing. In most jurisdictions, it does not have the force of law on its own, but ignoring it can increase risks (indicating bad faith).
  • API vs HTML — access via API is usually licensed and formalized, while HTML scraping relies on the publicness of the interface. Legally, API is preferable, but harsher regarding contractual restrictions.

Key Legal Axes for Assessment

  • Jurisdiction: where you, servers, users, and data subjects are located.
  • Type of Data: personal/non-personal; trade secrets; copyright and related rights; database rights (in the EU).
  • Method of Access: public webpage without registration vs behind a login, bypassing CAPTCHAs and paywalls, session usage.
  • Purpose of Processing: journalism, research, compatibility, competition, commercial analytics, security.
  • Volume and Frequency: "reasonable" extraction of individual elements vs systematic copying of a substantial part of the database.

Deep Dive: Global Legal Frameworks and Trends

USA: The hiQ vs LinkedIn Case and Related Positions

The hiQ vs LinkedIn case has set the tone for discussions around scraping public profiles for years. As of late 2024, courts have affirmed that access to publicly available pages without circumventing authentication does not equate to "unauthorized access" under the US Computer Fraud and Abuse Act (CFAA), especially following the guiding impact of the Van Buren case. However, there are other legal levers: contractual claims related to ToS, database and content protection, unfair competition, trespass to chattels, and other theories. Several high-profile disputes have ended in settlements and/or clarifications of platform practices. In 2025-2026, businesses should monitor any new developments in similar cases in federal courts, but for now, the fundamental line for publicly accessible pages remains: CFAA is applied cautiously, without broadening to "simply reading" what is available to the public.

Practical takeaway: in the USA, scraping public pages without circumventing authentication does not equal a criminal computer attack. However, violating ToS and ignoring official protocols (including robots.txt) may increase civil liability risks and lead to litigation, especially with large-scale copying or commercial parasitism.

EU/EEA: GDPR, ePrivacy, and Database Rights

  • GDPR: any personal data from public sources remains PD. A legal basis is required (most often "legitimate interest"), notification under Article 14 (or justification for exemption), minimization, retention periods, security, and mechanisms for data subjects' rights. Regulators (e.g., CNIL, Irish DPC, and others) have repeatedly emphasized: "publicness" does not mean "without control". Non-compliance with principles can lead to significant fines, as investigations into mass data leakage and scraping resulting in unauthorized aggregation of profiles have shown.
  • Regulatory Decisions: European supervisory authorities have levied significant fines for inadequate protection against scraping (as a result of insufficient "privacy by design" measures by operators publishing data), and for illegal subsequent processing by scrapers. Practices by services forming biometric and behavioral profiles based on public images and pages demonstrate a strict approach to opaque processing and absence of legal basis.
  • Sui generis Database Rights (Directive 96/9/EC): prohibits extraction or re-use of substantial parts of a database and systematic extraction of insubstantial parts, if it causes damage. Key EU Court cases emphasize that metasearch engines and clones of databases that reproduce the economic value of the source are banned. This is critical for projects built on "mirroring" someone else's database.

Russia: 152-FZ and Roskomnadzor's Position

In Russia, any information about an identifiable individual is personal data. 2021 amendments tightened the regime for "publicly accessible PD": separate consent is required for distribution with the possibility to set access conditions. An aggregator collecting such data becomes a PD operator with all obligations: purposes, legal bases, notification to Roskomnadzor (in specified cases), localization (242-FZ), rights of subjects, security.

Judicial practice and oversight in Russia consistently derive from the fact that the placing of information on the Internet does not open a "free license". Illegal parsing of personal data and their publication in aggregators lead to privacy protection lawsuits, orders from Roskomnadzor, and administrative fines. For non-personal data, key issues remain copyright, trade secrets, and unfair competition. Violating technical restrictions and hacking protection fall under criminal norms of unauthorized access to computer information.

Robots.txt, ToS, API: How the Law Views Technical and Contractual Signals

  • robots.txt: legally it is often interpreted as technical policy rather than a literal ban of law. However, it is demonstrably important: ignoring it may demonstrate intent to circumvent explicit rules, and when combined with ToS and CAPTCHAs, increases the likelihood of losing a dispute.
  • ToS: in the EU, ToS infringement is a contractual issue; in the USA, it is a risk of civil lawsuits (contract, tort). In Russia — a public offer/joining contract. Key: did you agree with ToS (acceptance), how was communication recorded, and is there justification for fair use?
  • API: licensing agreements and rate limits create clear legal frameworks. Pros: predictability and data quality. Cons: limits on volume and purpose. Attempts to bypass API limits via HTML scraping or proxies usually increase risks.

2026 Trends

  • Shifting Focus on Platform's Duty of Care: Regulators are raising expectations for website owners to prevent unauthorized scraping of personal data and inform users about risks.
  • Localization and Data Sovereignty: more requirements to store copies of PD locally and limit cross-border transfers.
  • Transparency in the Data Supply Chain: from source to consumer — a requirement for verifiable legal bases and contracts.
  • Ethics and Trust: Companies compete not just on the volume of data but also on the "ethics" of their sourcing and processing.

Practice 1: Legal Assessment Framework for Scraping from A to Z

Step 1. Data and Purpose Mapping

  1. Describe the goals of scraping: price analytics, market research, scientific purposes, quality control, risk monitoring.
  2. Classify the type of data: personal, metadata, ordinary business data (prices, SKUs, schedules), sensitive elements (biometrics, financial identifiers).
  3. Assess accessibility: public page, is registration required, is there a CAPTCHA, paywall, tokens.
  4. Identify jurisdictions: where you are, where the server is, where the data subjects are, where the data is transferred.

Step 2. Choosing the Legal Basis (GDPR) and Legal Regime (Russia)

  • EU/EEA (GDPR): most often — "legitimate interest" (Article 6(1)(f)). A Legitimate Interest Assessment (LIA) needs to be conducted: describe the interest, necessity of processing, assess the balance against the rights of subjects, implement protective measures (minimization, pseudonymization, purpose limitation).
  • Russia (152-FZ): determine if you are processing personal data. If so, a legal basis is needed: consent, law, contract, other specified grounds. For "publicly accessible PD", check for separate consent for distribution and access conditions. Consider localization (242-FZ) and notify Roskomnadzor if necessary.

Step 3. Transparency and Notification

  • GDPR Article 14: if PD is collected from the subject, notification is required. Exceptions may apply if providing information is impossible or requires disproportionate efforts; then place general public information about your processing, ensure ease of realizing subjects' rights, document the proportionality assessment.
  • Russia: inform subjects per your PD position; provide mechanisms for appeals and removal. For data distributed with restrictions, comply with the regime set by the subject.

Step 4. Contractual Cleanliness

  • Analyze the ToS of the source: is there a prohibition on automated collection, restrictions on commercial use, licensing terms?
  • Check API opportunities: if the API is available and meets the needs, it is usually preferable.
  • Assess database rights (EU): is there a risk of extracting substantial parts or systematically restoring content?

Step 5. DPIA and Protective Measures

  • If the risk is high (large PD, profiling, vulnerable groups) — conduct a DPIA: threats, measures, residual risk, mitigation plan.
  • Implement minimization: collect only necessary fields, store as little as possible, remove on a schedule.
  • Monitor cross-border transfers: EU — standard contractual clauses and country assessment.

Step 6. Registers and Operational Procedures

  • RoPA (record of processing activities): purposes, data categories, recipients, retention periods, security measures.
  • DSR procedures (data subject requests): access, deletion, objection to processing.
  • Incident management: breach notification policy, internal communication, response plan.

Conclusion: Decision-Making Matrix

Summarize everything into a "risk map": type of data × method of access × jurisdictions × purpose. Green zone — public non-PD, API, explicit license. Yellow zone — public PD with LIA, notification, minimization. Red zone — circumventing barriers, systematic copying of a database, special category PD.

Practice 2: Technical Design and Ethics of Scraping

Principles of “Privacy & Compliance by Design”

  • Respect for the Source: adhere to robots.txt as a basic policy; if something is prohibited — assess legal grounds and auxiliary measures or seek alternative sources.
  • Rate Limiting and Load: set request limits, use caching and "sleep" intervals; check peak hours to avoid disrupting resource operations.
  • Identify Yourself: a clear User-Agent, contact email for complaints; this reduces the risk of escalations.
  • Data Quality: verify validity, store checksums and scrape dates; document the source for audit.
  • Minimization: do not collect sensitive fields without absolute necessity; apply pseudonymization.
  • Security: encryption in storage and transit, access control, logging, end-to-end identifiers for tracing.

Step-by-Step Implementation

  1. Scanning: audit robots.txt and ToS, map URLs and data patterns, evaluate CAPTCHAs and page dynamics.
  2. Request Planning: rate limit, time windows, retries with exponential backoff, cache at results level.
  3. Extraction: parsing with a clear schema, skipping fields not in the target.
  4. Cleaning: filtering, normalization, removing explicit personal fields without legal basis.
  5. Storage: segmentation by sources, data lifespan, deletion policies.
  6. Control: monitoring errors, 4xx/5xx, feedback with the source in case of failures.

Ethical Standards

  • Do not create load that disrupts the normal operation of the site.
  • Do not bypass technical access barriers and do not mimic real user behavior without permission.
  • Respect requests for exclusion and data removal.
  • Consider the interests of data subjects, even if there is a formal legal basis.

Practice 3: Contractual Strategy: ToS, Licenses, API

The “Negotiate or Limit” Model

  • First Choice — API: if it meets business goals, arrange access. Pros: predictability, SLA, legal certainty. Cons: limits and fees.
  • Content License: for systematic data usage from a third-party site, consider a licensing agreement. It is cheaper than litigation if data is critical.
  • ToS-aware Scraping: if ToS prohibit bots — check for the possibility of written permission, small-volume programs, partnership.

Verifying Database and Content Rights

  • EU: evaluate if you extract a "substantial part" of the database or reproduce its economic value. Regular requests replicating the database are risky.
  • Copyright: texts, images, page structures; citation and fair use are limited.

Pre-Contractual Analysis Framework

  1. Business value of data and alternatives.
  2. Volume and frequency of access.
  3. Data regime (PD/non-PD), jurisdictions, cross-border transfers.
  4. Licensing models and compliance costs vs litigation risks.

Practice 4: Infrastructure and Proxies: How to Be Legal and Transparent

Legal Guidelines for Proxy Use

  • Purpose: proxies are permissible for traffic balancing, geo-testing, reliability, and privacy of infrastructure — but not for bypassing access prohibitions or masking ToS violations.
  • Legality and Consent: use only providers that legally obtain resources and consent from outgoing IP owners (especially for mobile proxies). Exclude unauthorized botnets and gray networks.
  • Transparency: document sources of IP, geography, whether you obtained permission for specific jurisdictions, and how complaints are handled.

Operational Model Without Bypassing Prohibitions

  1. Proxy Policy: document prohibiting the use of proxies for circumventing CAPTCHAs, paywalls, authentication, and rate limits set by the site owner.
  2. Segmentation: separate proxy pools for testing, production, and feedback to investigate incidents.
  3. Ethical Limits: at the code and proxy gateway level, set maximum request frequency below that of the average user and observe "quiet" windows.
  4. Logs: maintain logs (hashed identifiers) to respond to claims and exclude abuses.
  5. Source Registry: for each provider — contract, jurisdiction, contact, SLA for abuse notifications.

Mobile Proxies: When Appropriate and How Safe

  • Use Cases: geographic testing of mobile interfaces, availability checks, measuring speed and quality.
  • Compliance Control: audit the provider for legal sources of IP; written assurances of consent from end-users; complaint response processes.
  • Technical Measures: whitelists of domains (where requests are allowed), speed limits, prohibition on sending personal identifiers through proxies without encryption.

Principally: proxies are a tool of network engineering, not a means to bypass prohibitions. Any scenarios "to circumvent blocks and detections" increase legal risk and contravene ethics.

Practice 5: Documenting Processes: Make Compliance Verifiable

Artifacts for Auditors and Regulators

  • Data Map: sources, data categories, fields, jurisdictions, purposes.
  • RoPA: record of processing for each purpose; updated with changes.
  • LIA: justification of legitimate interest (EU), balance with the rights of subjects, mitigation measures.
  • DPIA: for high-risk scenarios (mass profiling, sensitive data).
  • Policies: scraping policy, proxy policy, storage and deletion policy, incident response policy.
  • Notification Templates: transparency page (Art. 14), responses to DSR, processes for withdrawal of consents (Russia: conditions for PD dissemination).

Step-by-Step Operationalization

  1. Appoint a process owner (Data Steward) and connect Legal × Engineering × Security.
  2. Describe the end-to-end pipeline: collection, processing, storage, access, deletion.
  3. Assign KPIs: response time to DSR, proportion of minimized fields, average data lifespan, success of audits.
  4. Conduct tabletop exercises: scenarios for data subject complaints, regulator requests, rights holder claims.
  5. Implement regular reviews of ToS and robots.txt for key sources.

Templates to Have

  • LIA Template (brief form: purpose, necessity, balance, measures, conclusion).
  • DPIA Template (risk register, likelihood, impact, countermeasures).
  • DSR response template (including requester identification, timelines, exceptions).
  • Template request for scraping permission to the site owner (describing scope, purposes, frequency, contacts).

Practice 6: Content and IS: How Not to Cross the Line

Copyright

  • What is Protected: texts, photographs, design, code; facts as such are not, but their selection and arrangement may be protected.
  • Fair Use: limited, depends on jurisdiction; do not count on it as your main strategy.

Database Rights (EU)

  • Avoid substantial extraction and systematic copying of insubstantial parts that restores economic value.
  • Technical Measures: selective sampling, aggregation without reconstruction of the source, references to the original source for verification.

Trade Secrets and Unfair Competition

  • Do not extract closed sections; do not use other's secrets obtained through bypassing barriers.
  • Do not create an illusion of partnership or affiliation with the source if it does not exist.

Practice 7: API vs HTML: How to Choose and Combine

When API is Better

  • There are stable needs and SLA-critical processes.
  • Legal and technical support is required.
  • It is important to comply with limits and licenses, as well as to receive updates on schemas.

When HTML is Appropriate

  • Data is simple, non-personal, there is no API, and public access is obvious.
  • A quick snapshot of the market is needed.

Hybrid Model

  • Main flow through API; HTML as a backup for validation and filling gaps, with strict limits and ethical rules.

Common Mistakes: What NOT to Do

  • Ignore ToS and robots.txt "because technically possible".
  • Collect everything indiscriminately: violation of the principle of minimization.
  • Store indefinitely: lack of deletion and updating timelines.
  • Transfer data across borders without legal mechanisms.
  • Lack of notifications and transparency under Art. 14 (EU) or 152-FZ requirements.
  • Use dubious proxies, linked to botnets and consent violations.
  • Bypass CAPTCHAs and authentication: high legal and reputational risk.

Tools and Resources: What to Use

Legal and Compliance Tools

  • Generators and templates for LIA/DPIA and records of processing.
  • Platforms for DSR management and auditing.
  • Systems for data lineage and data catalogs for source transparency.

Technical Tools

  • Parsing frameworks with support for rate limiting, retries, and caching.
  • Tools for anonymization and pseudonymization.
  • SIEM/logging, access control, encryption at the database and transport level.

Operational Practices

  • Periodic ToS reviews and robots.txt for key domains.
  • Internal checklists before launching a new source.
  • Team training on scraping ethics and "minimization" principles.

Case Studies and Results: Business Practice

Case 1: Price Monitoring without PD

Company X sells electronics. Objective — daily monitoring of competitor prices. Data: product names, SKUs, prices, availability. Actions: ToS analysis (no prohibition on indexing; there are prohibitions on bulk content copying). Technically: aggressive caching, no-login access, rate limiting 0.1 RPS per domain, nighttime windows. Legal: non-PD; analysis of database rights (EU) — only sample items; no reconstruction of the database. Result: stable feed without complaints, reduction in purchase costs by 3.7%, no incidents over 12 months.

Case 2: Job Aggregator (EU)

Company Y collects job postings from employer websites. Data: titles, descriptions, locations, sometimes contact emails of recruiters (PD). Legal: LIA, Article 14 notification through a public page, and opt-out mechanism for contact addresses, removal of addresses upon first request, minimization (storing emails in hashed form until an employer's request). Contractual work: licensing offers to large sites where ToS prohibit bots. Result: 10 partnership agreements, compliance maintenance, no fines; market coverage increase of 18%.

Case 3: Russian Marketing Analyst

Company Z analyzes open profiles of freelancers on freelance platforms. Data: nicknames, portfolios, bids, reviews; possible PD. Russian law: defining as a PD operator, notifying about activities, localizing copies in Russia, processing policy; excluding from indexing upon request; collecting only public fields; excluding phone numbers and emails (unless there is explicit consent for dissemination). Result: legally clean product, no mandates, loyalty from platforms (data feed exchanges).

FAQ: 10 Key Questions

1. Is it legal to scrape pages without a login?

If the page is public and there is no circumvention of technical barriers, in many jurisdictions this is not construed as unauthorized access. However, risks remain: violating ToS, database (EU), PD (GDPR/152-FZ). Check the legal basis, minimization, notification, and adhere to robots.txt.

2. How does the law view robots.txt?

It is a technical recommendation, not a law. However, ignoring it may strengthen evidence of bad faith and ToS violations. In compliance practice, robots.txt should be respected by default.

3. Is a legal basis needed under GDPR if the data is public?

Yes. Publicness does not negate GDPR requirements. Most often, legitimate interest with LIA suffices. Minimization, transparency (Article 14), retention periods, and mechanisms for subjects' rights are mandatory.

4. What has changed regarding the hiQ vs LinkedIn case by 2026?

As of late 2024, the basic line is: scraping public pages without bypassing authentication is not a CFAA crime in itself. During 2025-2026, watch for new decisions in similar disputes. Do not rely on CFAA as an "indulgence": ToS, copyright, database rights, and other norms remain.

5. Can contact emails be scraped?

Risks are heightened as this is PD. For the EU — LIA and Article 14 notification or exemption, strict minimization and purpose. For Russia — grounds per 152-FZ and respect for dissemination conditions. In some cases, it is better to omit emails from initial collection.

6. What about mobile proxies?

Use only legitimate sources, not for bypassing prohibitions. Outline a policy, limit speed, maintain logs, and respond to complaints. Circumventing CAPTCHAs/authentications via proxies increases the risk of violations.

7. What are the consequences of violating ToS?

Civil lawsuits, blockages, potential claims for unfair competition and IS. In certain scenarios, a combination of actions may be treated as unauthorized access.

8. Do I need to notify Roskomnadzor?

Depends on the nature of PD processing and the grounds. If you are a PD operator, check notification, localization, and policy requirements. If in doubt — conduct an audit with a specialist.

9. How to comply with Article 14 if there are many subjects?

Assess "disproportionate efforts": if applicable, use public notification, clear opt-out channels, and minimize the volume of PD. Document the assessment.

10. How to avoid database claims in the EU?

Do not extract substantial parts and do not restore economic value. Work with sampling, aggregation, references to the original source, and, where possible, licensing.

Liability: Fines, Lawsuits, Reputation

EU/EEA

  • GDPR: up to €20 million or 4% of global turnover; individual cases of mass scraping have led to hefty fines for operators failing to protect PD from unauthorized extraction, and for scrapers in case of illegal subsequent processing.
  • Database Rights: judicial bans, compensation for damages, forfeiture of profits.

USA

  • Civil lawsuits for violating ToS, copyright, unfair competition, trespass to chattels; judicial bans and compensation.

Russia

  • 152-FZ and the Administrative Code: administrative fines for violations of PD processing, mandates to rectify, restrictions on the operation of websites/aggregators.
  • Criminal Code: for unauthorized access to computer information when circumventing protections.
  • Civil Lawsuits: protection of honor, dignity, privacy, IS; compensations.

Reputation

Even lawful scraping can provoke negativity if transparency is lacking. Proactive communication, ethics, and clear exclusion mechanisms reduce risks.

Checklists and Ready Frameworks

Pre-Scrape Checklist

  • Goal and minimum required fields defined.
  • ToS, robots.txt, and API presence checked.
  • PD/non-PD classified, jurisdictions identified.
  • LIA/DPIA prepared if necessary.
  • Retention and deletion timelines established.
  • Rate limits and caching configured.
  • DSR and opt-out mechanisms described.

“4 Quadrants” Framework

  • Data: PD vs non-PD.
  • Access: public vs limited.
  • Law: EU/USA/Russia/other.
  • Purpose: legitimate interest/research/journalism/marketing.

Post-Scrape Checklist

  • Quality check, removal of unnecessary fields.
  • Sources and dates documented.
  • Registers (RoPA), LIA/DPIA updated.
  • Cross-border transfers checked.
  • Transparency page and FAQ updated.

What to Monitor in 2025-2026

  • New rulings on disputes analogous to hiQ vs LinkedIn and the courts' approach to combined claims (ToS + IS + unfair competition).
  • Decisions by European regulators (CNIL, DPC, etc.) regarding mass scraping of PD, including requirements for "privacy by design" for platforms.
  • Russian practices regarding "publicly accessible PD", localization, and Roskomnadzor mandates; development of administrative fines.
  • Updates in ePrivacy and possible clarifications from the EDPB regarding monitoring of public sources.

Conclusion: A Sustainable Scraping Strategy

Legal web scraping is not a set of tricks, but a systematic discipline at the intersection of law, engineering, and ethics. The right questions to ask are: why do we need this data, can we manage with less, what will we tell the data subject and website owner, how will we demonstrate our good faith after a year. In 2026, those who establish processes "legally by design" will win: respect robots.txt and ToS, choose APIs when possible, document legal bases, minimize collection, protect data, and engage transparently with sources and subjects. This approach reduces risks, speeds up approvals, and builds trust — a resource that is hard to replicate and impossible to scrape.

Your next steps: audit your current sources against checklists; update LIA/DPIA; implement a proxy and scraping ethics policy; create a transparency page and DSR processes; train your team and appoint owners; regularly review ToS for key sources and monitor regulator practices. Sustainable compliance is a competitive advantage. Use it.