Maho Takahashi

Linguistics PhD

mtakahas[at]ucsd[dot]edu

Google Scholar

LinkedIn

Github

How to generate random datasets

import pandas as pd

sample_df = pd.util.testing.makeDataFrame()
sample_df.head()

	A	B	C	D
57UYdLCAz9	0.143324	0.012548	-1.501348	-0.873817
wxTnNfCFsX	-1.200068	-0.079628	-0.101257	-1.825547
4PYx9lgmIg	1.054539	-1.463891	0.282045	-0.203671
J7JCQvcH86	1.323521	1.695652	-0.674408	0.638206
46tUB2L5Xm	0.180413	1.887923	0.992859	0.996487

As a default, the sample dataset contains 30 rows and 4 columns.

include NaNs in a random dataset

sample_df = pd.util.testing.makeMissingDataframe()
sample_df.head()

	A	B	C	D
i3ExiwBU7S	-1.212042	-0.892190	-0.039163	0.010801
zFzIsE2HUx	0.842377	-1.474449	0.896692	NaN
lHoqmwixVC	-0.432291	-0.140856	-0.820675	-0.876261
ZGtc7XPikQ	-0.330881	-2.578911	NaN	0.627988
lXMCFHfcYz	1.095504	-0.670436	1.614122	-0.500160

include timeseries in a random dataset

sample_df = pd.util.testing.makeTimeDataFrame()
sample_df.head()

	A	B	C	D
2000-01-03	-1.150055	0.340213	-0.193093	1.140389
2000-01-04	-0.116650	-1.928765	-1.934851	1.065835
2000-01-05	0.436392	-1.976887	0.124077	-1.292386
2000-01-06	2.362524	-0.733541	-0.745991	-0.600506
2000-01-07	-0.537766	0.622937	-1.650008	-0.308583

include mixed variables in a random dataset

sample_df = pd.util.testing.makeMixedDataFrame()
sample_df.dtypes

A           float64
B           float64
C            object
D    datetime64[ns]
dtype: object