AI & Information Literacy considerations:
Model / Tool | Type | Primary Data Sources | Discipline Coverage (approx.) | Special Features |
GPT‑4 (OpenAI) | LLM | Web (Common Crawl, RefinedWeb), textbooks, code, Wikipedia, licensed data | Web: ~70–80%Books: ~5–10%Academic: ~5–10%Code: ~5–10% | General-purpose: research Q&A, writing, coding |
GPT‑3.5 / ChatGPT | LLM | WebText2, Books1/2, Wikipedia, Common Crawl | Web: ~82%Books: ~16%Wiki: ~3% | Fast general use; limited academic depth |
Gemini (Google) | LLM (multimodal) | Web, books, textbooks, code, multimedia, real-time web | Web: ~70–80%Books: ~5–10%Code: ~10%Multimedia: ~5–10% | Multimodal (text, images, audio, video) |
Co‑Pilot (GitHub) | Code assistant | GitHub repos, StackOverflow, public codebases | Code: ~80%Docs/QA: ~15%NLP: ~5% | Optimized for coding and developer needs |
LLaMA / Falcon / BLOOM | Open-source LLM | Common Crawl, books, Wikipedia, arXiv, PubMed, The Pile | Web: ~70%Books: ~10%Academic: ~5%Code: ~5% | Open-access, customizable |
BERT | Encoder model | Wikipedia, BookCorpus | Wiki: ~60%Books: ~40% | Foundation for text classification & summarization |
T5 | Encoder-decoder | C4 dataset (filtered web) | Web: ~90%Books: <10% | Text-to-text tasks (translate, summarize, etc.) |
SciBERT / PubMedBERT | Domain LLM | PubMed + biomedical articles | Biomedical: ~100% | Strong in medicine & life sciences |
BloombergGPT | Domain LLM | Financial docs, filings, news, web data | Finance: ~50%Web: ~42%News: ~7% | Specialized for financial research |
Elicit | Research AI tool | Semantic Scholar (126M+ papers) | Academic: ~100% | Finds papers, extracts claims, Q&A |
Research Rabbit | Discovery tool | Semantic Scholar, CrossRef metadata | Academic: ~100% | Builds citation & co-author networks |
Jenny AI | Summarizer | Preprints, PDFs, research summaries | Academic: ~70%Preprints/blogs: ~30% | Draft generation & summarization |
Paperpal | Writing assistant | Publisher data, user uploads | Academic: ~100% | Proofreading, grammar, style suggestions |
Scite | Citation analytics | Scholarly papers (citation contexts) | Academic: ~100% | Shows supportive/contrasting citations |
SciSpace | Research assistant | Academic PDFs (STEM-heavy), user uploads | STEM academic: ~100% | PDF annotation, term definitions |
ResearchGPT (Consensus / SciSpace) | Research GPT assistant | ~200–282M academic papers via Consensus or SciSpace | Academic: ~100% across disciplines | Chat-based Q&A, citation-backed summaries, PDF upload, image support (dailyai.com, yeschat.ai, gptr.dev, digital-science.com, morganstanley.com, consensus.app) |
Developed and modified in ChatGPT using the following prompt: "Create a professional table showing the major Large language models and generative AI Tools, especially for students and researchers, along with what datasets they pull and the percentage of material from the various disciplines included in the databases."
Provider | Index Size | Key Sources | OA Full‑Text Access | API Available | Best For |
Allen Institute for AI (AI2) | ~280 M papers | arXiv, bioRxiv, SSRN, HAL, CrossRef, publisher feeds, U.S. gov sites | Partial (OA + some partnerships) | Yes (academic) | Scholarly search & AI‑driven insights |
OurResearch (Unpaywall) | ~250 M works | CrossRef metadata, DOAJ, institutional repos, Unpaywall, MAG legacy, ROR | Extensive OA | Yes (open API) | Metadata research, alt‑metrics, open data |
Google Scholar | ~400 M+ items | Publisher sites, repositories, personal pages, preprints, theses | Mixed (OA + paywalled links) | No official API | Broad scholarly search |
Digital Science | ~350 M publications + grants etc. | CrossRef, PubMed, arXiv, publisher content, clinical trials, patents | Mixed (OA + proprietary) | Yes (limited) | Citation analysis, grants, patents, altmetrics |
CORE (Jisc) | ~300 M documents | Institutional repositories, open-access journals via OAI-PMH | Fully Open Access | Yes (open API) | Full-text discovery across OA repositories |
Cambia / Lens Foundation | ~200 M works | CrossRef, PubMed, Patents, Preprints, institutional repos | Mixed (OA + paywalled links) | Yes (API & data dumps) | Scholarly + patent search, licensing analysis |
Developed and modified in ChatGPT using the prompt: "create a professional table showing the major datasets used in AI Tools and Web-based research and where their data/ material comes from."