Conducting Internet Research

Considerations for Participant Protections When Conducting Internet Research

If an activity falls under the category of human subjects research, it is regulated by the federal government and Teachers College (TC) Institutional Review Board (IRB). TC IRB has provided a guide to help researchers determine if their activities can be considered human subjects research.

Internet research is a common practice of using Internet information, especially free information on the World Wide Web or Internet-based resources (e.g., discussion forums, social media), in research. This guide will cover considerations pertaining to participant protections when conducting Internet research, including:

Private versus public spaces for exempt research
Identifiable data available in public databases
Minimizing risks when using sensitive Internet data
Common Internet research approaches

The following information is from an NIH videocast. (Odwanzy, L. (2014, May 8). Conducting Internet Research: Challenges and Strategies for IRBs [Video]. VideoCast NIH. https://videocast.nih.gov/summary.asp?Live=13932&bhcp=1)

Private Versus Public Spaces for Exempt Research

Federal regulations define a category of human subjects research that is exempt from IRB review as:

“Research that only includes interactions involving educational tests (cognitive, diagnostic, aptitude, achievement), survey procedures, interview procedures, or observation of public behavior (including visual or auditory recording).”

With regards to online information, if the data is publicly available (such as Census data or labor statistics), it is usually not considered human subjects research. However, if the data includes identifiable information—meaning the data can be linked back to a specific individual—then it may need to undergo IRB review. Additionally, de-identified data pulled from a private source, such as data provided by a company, may also be considered human subjects research.

Public behavior is any behavior that a subject would or could perform in public without special devices or interventions. Public behavior on the Internet, however, is more difficult to pinpoint. Federal regulations indicate that an environment may be private if a reasonable user would consider their interactions in that environment to be private. To help identify public behavior on the Internet, consider:

Is the behavior displayed on a public or private profile/website?

Typically, posts on a private or password-protected social media profile or site are not considered public behavior.
Even if a website is publicly available, the information on the website may be protected by other measures (e.g., community guidelines, terms of use, etc.).
Sites that require users to pay for access to their content (e.g., purchasing a dataset) are not always considered private, even if the information is behind a paywall.

What types of virtual communication are typically considered public?

Discussions and chats on public forums, news broadcasts, and free podcasts or videos are typically considered public communications.
Emails and person-to-person chat messages are often private, rather than public, communications.
However, institutions may dictate that any activity on their devices (e.g., a company laptop or phone) is subject to review. In these cases, the institutions can limit an individual’s privacy.

What community guidelines, terms of service, or website policies indicate whether the environment is a public or private space?

Some websites explicitly state that the interactions on their site are not to be used for research purposes.
Other sites may not explicitly refuse research activities, but they may require users to be respectful of others’ experiences. Depending on the website, “respect” may have a variety of meanings, including respect of user privacy.

What normative behaviors (behaviors not explicitly stated) within the Internet community might indicate that the environment is a public space?

Expectations of privacy may not always equate to the reality of privacy.
For example, individuals may share personal information on an open forum because there is an expectation within the community that other users will respect their privacy. However, the community guidelines may not explicitly state that their website is private.

What types of users are active on the website, and would a reasonable user expect their behaviors to be private?

Forums and websites directed towards youth may require extra precautions, as the youth may be on the website with or without their guardian’s permission.
If a user shares media on a private profile, but then that media becomes publicly available through re-posts, the media should still be considered private. It is likely that a reasonable user would expect shares on private profiles to remain private.
A site may only be open to certain types of users based on demographics or life experiences (e.g., cancer survivors, support groups for addiction, etc.). In these cases, a reasonable user may expect greater privacy based on the types of users they expect to interact with.

TC IRB will determine whether an Internet environment is private or public based on the IRB protocol submission.

Identifiable Data in Public Datasets

Identifiable data is information or records about a research participant that allows others to identify that person. Names, social security numbers, and bank account numbers are considered personal identifiers and are protected under the Health Insurance Portability and Accountability Act of 1996 (HIPAA). TC IRB has a blog posted on Understanding Identifiable Data that further explains the different types of identifiers. Data that includes personal identifiers does not fall under the Exempt category.

Other types of participant information may include indirect identifiers, such as birthdate, age, ethnicity, gender, etc. Taken alone, these pieces of information are not enough to identify any single participant. However, researchers have shown that certain combinations of these identifiers may identify participants. For example, Sweeny (2000) demonstrated that 87% of the United States population could be uniquely identified based solely on their ZIP code, gender, and date of birth.

It is important to remember that while data may be publicly available, it may still contain identifiable information. In these cases, the IRB will decide the risk to participants on a case-by-case basis. With Internet information, consider these to be possible identifiers:

Usernames

Users may include their partial or full name in a username. When collecting usernames from a site, researchers should consider replacing usernames with pseudonyms.

IP Address

IP addresses are unique identifiers for devices. Researchers should be wary of pairing IP addresses with other information.

Purchase Habits

With the surge in online shopping, individuals’ unique online purchase habits are shown to be possible identifiers.

Digital Images, Audio, & Video

Photos, audio recordings, or videos of an individual are typically considered identifiable, unless the images or audio are ascertained in a way that protects the subject’s identity.

Avatars or Profile Pictures

Although avatars and profile pictures may not include real photos of the user, it is possible that they were chosen because of a resemblance to the user.

Keystroke Dynamics or Typing Biometrics

The detailed information of an individual’s timing and rhythm when typing on a keyboard is a unique identifier. "Keystroke rhythm" measures when each key is pressed and released while a user is typing. These rhythm combinations are as unique to an individual as a fingerprint or a signature.

Minimizing Risk When Using Sensitive Internet Data

In cases where sensitive Internet data must be used for research purposes, researchers should take precautions to ensure the safety and privacy of participants. The nature of online research increases risk to participants in some areas. Researchers should develop a plan to minimize risk in the following areas:

Reduced Participant Contact: when research is conducted over the Internet, researchers have limited or no direct contact with subjects. This makes it more difficult for researchers to gauge subjects' reactions to the study interventions.

Researchers should think through multiple possibilities for interventions, debriefing, and follow-up, if applicable.
Researcher and TC IRB contact information should be presented on the informed consent before beginning the study. This will ensure that participants know whom to contact if they have questions or concerns.

Breach of Confidentiality: when storing or collecting data on devices connected to the Internet, there is a heightened risk for identifiable participant data to be leaked.

TC IRB has published a Data Security Plan outlining best practices for securing and transmitting data. Researchers should implement these practices as they apply to their specific study.
In the case of a breach of confidentiality, researchers must file an adverse event with TC IRB.

Common Internet Research Approaches

The Secretary’s Advisory Committee on Human Research Protections (SACHRP) has provided examples of common Internet research practices. These include elements of research conducted over the Internet. Below are possible examples of Internet research where human subjects may be involved:

Data Mining or Data Scraping: using Internet information that is readily available to the public. This type of information gathering typically does not involve direct interactions with participants.
- Existing datasets (secondary data analysis)
- Social media/blog posts
- Chat room interactions
Online Subject Recruitment: using the Internet as a space for recruiting or interacting with participants.
- Qualtrics
- Amazon Mechanical Turk
- REDCap
- Social media
- Email
Research on the Internet: directly studying the Internet and its effects.
- Patterns on social media or websites
- Evolution of privacy issues
- Spread of false information
Research on Internet Users: directly studying Internet users and their behaviors.
- Online shopping patterns and personalized digital marketing
- Online interventions such as “nudging"

Increased Internet use for research requires researchers and IRBs to become familiar with Internet research-related topics and concerns. Research submitted to the IRB will be reviewed on a case-by-case basis. The Institutional Review Board at Teachers College will make the final determination of whether a study requires review. Researchers should email IRB@tc.edu if they have any questions or concerns about their study design and whether it should be IRB reviewed.